Hadoop Step by Step Learning Process


If you’re someone who wants to learn Hadoop from scratch, this guide is just for you. In this article, I’ll walk you through the step-by-step learning path, from setting up your environment to understanding core components like HDFS, MapReduce, and Hive.
🧱 What is Hadoop?
Apache Hadoop is an open-source framework for processing and storing large-scale datasets on clusters of commodity hardware using parallel and distributed computing.Originally developed by Doug Cutting and Mike Cafarella, Hadoop was inspired by Google’s MapReduce and GFS (Google File System) papers.
Hadoop Architecture
Hadoop is batch processing and loosely coupled framework and then all it is integrated part.
Zoom image will be displayed
Architecture
📁 HDFS (Hadoop Distributed File System)
HDFS is the primary storage system of Hadoop, designed to store very large files reliably.
🔸 Key Concepts:
Block Storage: Files are split into fixed-size blocks (default 128 MB)
NameNode: Master daemon — maintains metadata (namespace, block locations)
DataNode: Slave daemon — stores actual blocks on disk
🔹 Example:
Uploading a 300MB file:
It’s split into 3 blocks: 128MB + 128MB + 44MB
Each block is replicated across 3 different DataNodes for fault tolerance
🧮 MapReduce
MapReduce is the computation engine that handles distributed data processing.
🔸 Map Phase:
Input split into records
Each record is processed into key-value pairs
(key1, value1) → (key2, value2)
🔸 Shuffle and Sort:
- Group all values by key
🔸 Reduce Phase:
- Aggregate or process grouped values to produce final output
key2 → list(value2) → (key3, value3)
Example: Word Count
Map:
"The cat sat” → (“The”, 1), (“cat”, 1), (“sat”, 1)
Reduce:
“The” → [1,1,1] → (“The”, 3)
Hive — SQL on Hadoop:
Apache Hive is a data warehouse system built on top of Hadoop.
Allows SQL-like querying using HQL (Hive Query Language)
Stores data in HDFS
Great for batch querying, ETL pipelines, and analytics
🔸 Hive Architecture:
HiveQL Parser → Compiler → Optimizer → Execution Plan
Execution happens as MapReduce or on Tez/Spark engines
and also learn the concepts of hive partition static vs dynamic,hive buckets and bucket counting and hive ORC file format.
Example code:
CREATE TABLE emp (
sno INT,
name STRING,
city STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;
✅ Summary Table
Hadoop--Big Data Framework
HDFS — Distributed File Storage
MapReduce — Parallel Computation Model
Hive — SQL-like Query Engine over HDFSYARNResource Management Layer
References:
https://www.geeksforgeeks.org/linux-unix/basic-linux-commands/
https://www.youtube.com/watch?v=rsOSrEbK7sU&list=PLLa_h7BriLH2UYJIO9oDoP3W-b6gQeA12
💬 Final Thoughts
If you’re also on a journey of learning, remember this:
🚀 Stay curious. Stay consistent. Happy coding!
Let’s keep learning, building, and growing — one day at a time. 💪 If this post resonated with you, feel free to connect or drop a comment.
💬 Let’s Connect!
🔗 connect with me in Linkedn
🐦 Follow my journey on X (Twitter)
Subscribe to my newsletter
Read articles from Jayasree S directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
