Linux Data Storage & Filesystem Explained

In this article we will be answering some questions related to storage which you might have came across while learning or working with storage.

Here we will try to answer some of these questions ranging from:

how computers store data
where data is stored
types of storage
how data is organised - like access, modification, persistence etc. [focused]

These questions will help you build a mental model about storage at a high level. We will be taking Linux Storage primitive as reference in this article, but the fundamentals discussed remain the same across other platforms at an high level.

How Computers Stores Data?

You might have already guessed the answer to this, that everything is ultimately binary in computers.
At the lowest level, all data in a computer is stored as binary series of bits, which are just 0s and 1s.

1 bit = 0 or 1
8 bits = 1 byte (e.g., 01001000 = the letter "H" in ASCII)

So whether it's a document, a photo, a program, or even your Linux OS itself — it's just billions (or trillions) of 0s and 1s stored somewhere.

In this article we will not be focusing more on this, as our focus is more on the peripherals and the software abstractions used for storing and organising these bits.

Storage Devices: Where is data stored physically?

These 0s and 1s need to be stored physically. That’s done on storage devices.

Device Types	Description
HDD (Hard Disk Drive)	Stores data on spinning magnetic disks. Older but cheaper, good for large storage.
SSD (Solid-State Drive)	Stores data on flash memory (no moving parts). Faster and more durable.
RAM (Random Access Memory)	Temporary memory used while programs run. Volatile (data lost on shutdown).
USB Drives, SD Cards	Portable flash-based storage devices.

On a high-level we may categorise storage into two: persistent(data which is not lost on reboots, like SSD, HDD etc) and non-persistent(data which may get lost on reboot, like RAM, Cache, swap memory etc). Here we are more interested in persistent devices and how software abstracts like file systems work on them.

Block Devices

Data on disk is stored in fixed-size units called blocks (e.g., 4 KB per block) so any device that store data in blocks are called Block devices. Generally persistent devices like SSD or HDD are considered block devices, as the data is stored and retrieved as blocks.

💡

Now you might be wondering what are the other types of devices in term of I/O operation, apart from block devices we’ve character devices(input devices like keyboard) which streams characters as input. These are categorised in Linux based on their I/O behaviour, not whether they’re persistent or volatile. We will focus on block devices here.

Files can span multiple blocks. I always like to think of a block like a page in a book.

The filesystem keeps track of which blocks belong to which file, and in what order.
Each file occupies at least one block, often many (especially large files).
Even if a file is only 1 byte, it’ll still use an entire block (e.g., 4 KB).
Common block sizes are 512 bytes & 4 KB.

Once connected to a Linux syste block devices can be formatted with a filesystem like ext4, xfs, etc. and represented in Linux as a device node like /dev/sda, /dev/nvme0n1 (more about device nodes are discussed later).

💡

When I say “represented in Linux” this is carried out by a user-space service called udev which detect and load physical devices on to the kernel when you connect them. Our focus is on what happens after this connection phase.

So far we’ve seen

data stored in a physical devices as random 1s and 0s
the storage happens in bulk of chucks called blocks of fixed sizes

Now the important question is how does the OS understand, read, write or navigate the data in the given storage? What do you mean by formatting of block divice? and the answer is filesystem

File System: How is data structured on storage devices?

Storage devices without a filesystem is just a sea of 0s and 1s, meaning just a sequence of bytes/blocks with no interpretation. A filesystem is a software layer that organises raw block storage into usable files and directories.

The filesystem defines:

How files are named
How files are stored and retrieved
How permissions work
How data is located and organised

Think of a storage device like a huge library. The filesystem is the catalog system that helps you find the right book (file), and knows where on the shelf (disk sectors) to look.

In Linux, common filesystems types include:

ext4 (most popular on Linux)
xfs, btrfs, zfs
vfat, ntfs (for compatibility with Windows)

Here ext4, xfs, btrfs, etc. are software components that runs in kernel space, giving it direct access to hardware and memory**.** When you use a filesystem like ext4, the Linux kernel either already includes support for it or loads the corresponding module (meaning they're either part of the kernel or dynamically loaded).

They are responsible for implementing the rules and logic of that filesystem format, like:

Free space tracking (which blocks are free?)
Metadata management (timestamps, permissions, file sizes)
Directory hierarchy (folders and nested files)
Journaling (tracking recent writes to prevent corruption)

💡

To clarify further: Your physical device exists independently of any filesystem.The Linux kernel creates a device node in /dev to give you access to it (more about /dev on next section). When you format and mount that device, it gets mapped (by you) to a different place in the filesystem, e.g., /mnt/data or /home/hello.

Filesystem Hierarchy

When you mount a filesystem on a block device, it gives you a hierarchical view: directories and files arranged like a tree.

For example: /home/ /home/vishnu/ /home/vishnu/file.txt

Even though physically, a file.txt may be split across random disk sectors, the filesystem presents a logical view to the user.

`/dev` :

/dev is part of the Linux filesystem hierarchy, but the files under /dev (like /dev/sda) are not normal files**.** They are special files called device nodes, which act as pointers or interfaces to hardware devices.They don’t "contain" the device, they link to drivers in the kernel, which talk to the hardware.

The /dev directory is usually populated dynamically by the udev service at boot, based on detected devices. A device appears in /dev as soon as Linux detects it, whether it’s formatted or not. Meaning it exists as a block device, and you can interact with it using raw tools (like dd, hexdump, etc.). But unless it has a known filesystem on it, you cannot mount it, browse files, or use ls on it.

To summarise this point:

/dev/sda1 → The interface to your storage from Linux.
/mnt/data → The directory where the contents of that device (via its filesystem) appear(if you mount it manually)

Block device formatting

So far we’ve mentioned about the filesystem being able to format your block device to make sense out of raw data in it, what exactly does this formatting process includes?

Formatting a block device (or its partition) means writing the filesystem layout (like ext4's structures) into the disk, so that the OS can organise and access data.

💡

Remember formatting is not the same as mounting, Mounting a filesystem means attaching it to your OS’s directory tree so you can access its files. The filesystem must already exist before mounting.

When you format a partition lets say, /dev/sda1, you write a set of things on your disk giving it a structure, things like:

Write a superblock
Create inode tables
Initialise a root directory
Set up structures to track files and free space

This provides structure to your data, without which kernel doesn't know how to find files and ls, cd, cp, etc. won’t work, making them unmountable

💡

the formatting is done using linux utilities like mkfs , eg: mkfs.ext4 /dev/sda1

If you have done formatting of any disk in your personal computer before, you might be wondering “doesn’t formatting means erase your data on the disk?“. Yes that is true for normal formatting, it does erase or overwrite parts of it depending on the type of formatting you do

Formatting overwrites metadata to start fresh. Since the OS must have a clean, consistent filesystem structure, it has no way to just “add structure” to a device without potentially losing or ignoring old data.

To summarise, When you “format” a partition using a filesystem, you’re laying down that exact kind of structure (superblock, inode table, bitmaps, data blocks, and so on) into specific reserved locations(most of these blocks mentioned above have specific location in the disk, will discuss one. by one later) on the disk.

So far we’ve seen the basics of filesystem like why it is needed(=to bring structure to data) and how it does that(=formatting), now lets do an overview of each of the steps that happens during the formatting process.

Superblock

Superblock is a special block at a fixed location (usually at the start) that stores global metadata about the filesystem (think of it as the “table of contents” for the file system).

It contains:

Filesystem type (ext4, xfs, etc.)
Total number of blocks/inodes
Block size
UUID of filesystem
Location of free space tracking structures
Last mount time, last check time, etc.

Location on disk:

For ext filesystems, the primary superblock is at a fixed offset (usually block 0 or 1024 bytes into the partition).
Multiple backup superblocks are scattered through the filesystem (every few block groups) for recovery.

Why it is important:

Without it, the OS has no idea how to interpret the partition’s contents.
Corrupt superblock = filesystem appears “dead” until you use a backup.

So superblock does not store any actual file data or file related metadata, instead it is the block for storing global metadata or metadata about the filesystem itself.

Inode Table

The inode table is written right after the superblock (in many filesystems), but not always, the exact placement depends on the filesystem design.

When you format a filesystem, it pre-allocates a fixed number of inodes(index nodes) and these inodes are stored in a contiguous region on disk called the inode table.

Each inode in the inode table is given a unique number known as the inode number.The inode number is simply the index (position) of that file’s inode inside the inode table. The filesystem uses this number to look up all the file’s metadata and figure out where the actual content is stored.

But what is an inode?

Every file has its own inode and this inode stores all the metadata about that file except its name and actual content. For example it stores stuff like: file type(whether regular file, directory or symlink), permissions, UID/GID of owners, size, pointers etc.. The pointer is the most important, as this is how inode keeps track of where the actual data.

So where’s the name of the file stored? It lives in a directory entry (a mapping of name → inode number) and that’s why you can have hard links: multiple directory entries (names) pointing to the same inode. Will discuss this in the next section in detail, for now just keep this point in mind!!

Purpose of inode table:
- Holds one inode per file/directory.
- Each inode stores:
  - File type (regular file, dir, symlink…)
  - Owner (UID/GID)
  - Permissions
  - Timestamps (creation, modification, access)
  - Pointers to data blocks on disk
Location on disk:
- ext splits the disk into block groups; each block group has its own inode table.
- 💡
  
  In ext4, block groups are used to organise the file system into smaller, manageable chunks. This design choice helps reduce fragmentation, speeds up file system checks, and allows for more efficient allocation of storage space, especially for large files. We are not going deep into this as it is out of scope for this article, but good to know!!!
- The inode table is a reserved contiguous area inside each block group.
Why important:
- The inode is how the filesystem knows where the file’s data lives.
- Filenames are not stored in inodes — only metadata and block addresses.

Data Block

A data block is the smallest chunk of space on disk that a filesystem uses to store the actual contents of a file. If your ext4 filesystem uses 4 KB block size, then each data block = 4 KB of raw storage mean if a file is larger than 4 KB then it will use multiple blocks. And even a tiny file (say 200 bytes) still consumes a whole block (wasting the unused part is known as internal fragmentation).

Purpose:
- This is where the actual file contents live.
- Each block is a fixed size (e.g., 4 KB) determined at format time.
Location on disk:
- Spread across block groups.
- The inode table points to these blocks.
Why important:
- Lose the mapping from inode to block, and the data becomes “orphaned.”

Earlier we discussed about pointers in inode, and this pointer is how data blocks are connected to each inode. The inode doesn’t store file contents rather it stores block addresses pointing to where the content is in the data block region. A single inode may contain multiple pointers(depending on the size of the file) to diff data blocks!!

Data blocks themselves don’t “point” to each other like a linked list. Instead the inode is the hub that maps file to respective data blocks. Each block number is just a location in the block region of the filesystem. And remember two consecutive blocks in a file might be far apart on disk if fragmentation occurred.

So, Data blocks = chunks of actual file content. Each file has an inode stored in inode table and these Inode stores pointers to data blocks. And most importantly data blocks don’t directly know about each other meaning inode and pointer blocks act as the “map”.

💡

Just to be clear your file might be stored in multiple data blocks spread across the disk and the inode for that file contains pointers to those blocks in the correct order.

Directory Block

A Directory Block is just a special kind of data block that stores a list of directory entries. Each directory entry contains: Filename, Inode number and optionally Length & type

Purpose:

Special data blocks that contain filename → inode number mappings.
Example:
- resume.pdf → inode 1025
- photos → inode 2048

Location on disk:

Stored in regular data blocks, but interpreted differently by the filesystem.

Why important:

Without directories, you could still read files by inode number, but we wouldn’t know block is which or where to put it.

Imagine a folder /docs with two files named resume.pdf & notes.txt, the directory block for /docs might contain something like this:

Filename	Inode Number
`file1.pdf`	105
`file2.txt`	202

The directory itself is a file, but its data blocks store a table like above instead of plain text

And when you run ls /docs, the filesystem:

Looks up /docs’s inode to find its data blocks.
Reads those blocks to get the list of (filename → inode number) mappings.
For each inode number, loads its inode to get file size, permissions, etc.

💡

try ls -i for getting the inode number as it is

Connecting the dots as we discussed in the inode section, inode does not store the file name to inode mapping, it is done by directory block and the inode for the file then tells the OS where the file’s actual data blocks live.

You might be thinking why this way of connecting inode to its filename instead of directly mapping them?
Instead of coupling a name with an inode, this decoupling help us create hard links making multiple names in possibly different directories pointing to the same inode or in other words - same inode can have multiple names without duplicating the file. Apart from hard links this architecture also provides renaming efficiency and treats directories as first class citizens - directories are just files whose data blocks map names to inodes.

From the user/UI perspective:

You never directly “see” inodes or block numbers.
You interact with names in directories (/home/vishnu/file.txt).
The kernel handles translating names → inode → data blocks.

From the filesystem’s perspective:

File names are just labels stored in directories.
Inodes are the “real identity” of the file.
That’s why you can have:
- Multiple names (hard links) → same inode.
- No name at all (deleted file still open in a process) → inode still exists until all references drop.

💡

When you run mkdir, the filesystem allocates a new inode for the directory, creates a minimal data block containing . and .., and updates the parent directory’s listing with the new entry. No file data blocks for actual contents are allocated yet — only metadata and the small block for the directory table are written. If the filesystem is journaling, these metadata updates are logged before being committed to disk.

Bitmaps

Now bitmaps are like the filesystem’s "availability map" which keeps track of what’s free and what’s used. It is just a long sequence of bits (0 and 1): 0 means free, 1 means occupied.

Purpose:

Track which inodes and which data blocks are free or used.
The filesystem uses two main bitmaps:
- Block bitmap, tracks which data blocks are free or in use.
- Inode bitmap, tracks which inode entries in the inode table are free or in use.
1 bit per item (block/inode) — very space-efficient.

Location on disk:

Each block group has its own bitmaps right next to its inode table.

Why important:

This is how the filesystem quickly finds free space for new files.
Corruption here = possible overwrites of in-use files.

This is useful especially when allocating data to a data block. When you create a file, the FS can quickly scan the bitmap to find the next free inode and free data block.

It also helps with consistency checks ensuring corruption of disk space is preventable using tools like fsck which can quickly see which files and blocks are in use and compare that to the actual data.

Since it uses 1 bit per item it is very space efficient - each file or block is tracked with just one bit, so even huge disks can be mapped using very little space and searched quickly

In ext4, each block group (remember we discussed: the FS splits disk into block groups) has:

Its own block bitmap
Its own inode bitmap

This keeps allocation local and helps avoid having to search the whole disk for free space.

Lets summarise what we went through so far:

In this article we’ve covered the story of how data is stored in disks, starting from 0s and 1s to advanced filesystem internals how data is actually stored and interpreted by the OS. While we went through the internals of filesystems, we understood the importance of a file system, each components of it and how they interact with each other to enable OS to use disks efficiently!!

I’ve packed too many details into one article while trying to keep the jargons minimal, this has caused the article to be a bit longer, but some of the topics had to be discussed in detail for better understanding. I hope by the of this article it helped you connect the dots and better understand filesystems in general. Next I’ll be writing a short one about diff types of links in linux filesystem covering topics like hard-links, symlinks etc!! Stay tuned for more!!

Feel free to share your thought about this and let me know if it lacks in clarity of substance!!
Thanks for reading!!

Storage and Filesystem

How Computers Stores Data?

Storage Devices: Where is data stored physically?

Block Devices

File System: How is data structured on storage devices?

Filesystem Hierarchy

`/dev` :

Block device formatting

Superblock

Inode Table

Data Block

Directory Block

Lets summarise what we went through so far:

Subscribe to my newsletter

Vishnu Mohan

Vishnu Mohan

Storage and Filesystem

How Computers Stores Data?

Storage Devices: Where is data stored physically?

Block Devices

File System: How is data structured on storage devices?

Filesystem Hierarchy

/dev :

Block device formatting

Superblock

Inode Table

Data Block

Directory Block

Lets summarise what we went through so far:

Subscribe to my newsletter

Vishnu Mohan

Vishnu Mohan

`/dev` :