A Beginner’s Guide to HDFS: The Heart of Hadoop Storage

Anamika PatelAnamika Patel
2 min read

The Hadoop Distributed File System (HDFS) is designed to handle huge volumes of data by distributing it across multiple machines in a cluster.

  1. Data Splitting into Blocks

    When a user uploads or copies a file into HDFS (e.g., using hdfs dfs - put ). the system doesn’t store the entire file as one unit. Instead file is split into fixed-size-blocks, usually 128 MB or 256 MB. This can be configured in hdfs-site.xml using dfs.blocksize. Eg 500 MB file will be split into : 128 MB, 128 MB,128 MB,116MB

  2. Where are These Blocks Stored?

    The blocks are stored on Datanodes (worker machines in the cluster).

    The Namenode (master node) doesn’t store actual data - instead, it keeps track of:

    * While file is broken int which blocks.

    * Where each block is stored across the Datanodes

  3. Replication of Data (for Fault Tolerance)

    To make sure data is not lost if a node fails, HDFS automatically creates multiple copies of each block:

    * Default replication factor:3

    * This means each block is stored on 3 different Datanodes (can be changed per file or globally)

    Example:

    * Block A : stored on Datanodes 1,3 and 5

    * Block B : stored on Datanodes 2,4,and 6

    The replication follows the rack-awareness policy:

    * Once copy on the local rack

    * One on a different rack

    * One on a third machine for extra safety

  4. Why All This?

    * If a Datanode fails, HDFS still has 2 other copies of each block.

    * If replication drops(e.g., due to failure), HDFS re-replicates blocks automatically.

  5. Summary Flow:

    1. User copies a file to HDFS

    2. HDFS splits the file into blocks(128 MB default)

    3. Each block is stored across multiple Datanodes.

    4. The Namenode keeps track of the block locations.

    5. Blocks are replicated (usually 3 copies) across different machines for fault tolerance.

0
Subscribe to my newsletter

Read articles from Anamika Patel directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Anamika Patel
Anamika Patel

I'm a Software Engineer with 3 years of experience building scalable web apps using React.js, Redux, and MUI. At Philips, I contributed to healthcare platforms involving DICOM images, scanner integration, and real-time protocol management. I've also worked on Java backends and am currently exploring Data Engineering and AI/ML with tools like Hadoop, MapReduce, and Python.