Understanding Big Data Frameworks: Apache Hadoop

Ricky SuhanryRicky Suhanry
10 min read

In the previous article, we already talking a little bit about big data and Apache Hadoop.

Consider once more the hundreds of thousands of commodity hardware nodes and the hundreds of thousands of petabytes (PB) of data.

In the context of big data, this article will concentrate on the core components and architecture of Hadoop.


Introduction

The academic and business groups were interested in Apache Hadoop since it was an open-source solution. The Hadoop ecosystem is a framework and a collection of apache open-source projects that support processing large datasets across distributed computing environments. Hadoop not only processes large amounts of data but also stores them in its system.

File Hadoop logo | Source: commons.wikimedia.org

Short History

Hadoop was created by Doug Cutting and Mike Cafarella in 2005. It was originally developed to support distribution for the Nutch search engine project. After releasing its GFS paper in 2003 and MapReduce paper in 2004, Google quickly realized that Hadoop couldn't execute all of this on a single processor (Hadoop 1.x). Hadoop needed a distributed approach, which involved breaking the problem up into smaller parts and running them concurrently. Yarn (Hadoop 2.x) appeared rather later.

In the late 2000s, Pig, which was developed on top of Hadoop and MapReduce, functioned similarly to a SQL processor. It was originally developed internally by Yahoo! in 2006 to create and execute MapReduce jobs on all datasets.

Hive would come 4 years later in 2010, it is an open-source data warehouse system built on top of Hadoop. Apache Hive is one of the technologies that are being used to address the requirements at Facebook. Hive is a batch processing framework most suitable for long-running jobs.

Drill offers a far better experience than Hive for data exploration and business intelligence. Apache Drill was first released in 2014 and originated from Google's Dremel system, which is now part of Google Cloud Platform and known as BigQuery.

The latest version of Hadoop as of 2025 is Hadoop 3.x, which offers significant improvements over its predecessors, Hadoop 1.x and 2.x. Hadoop 3.x introduces advanced features such as erasure coding for fault tolerance, improved scalability beyond 10,000 nodes, and enhanced storage efficiency with 50% less overhead compared to Hadoop 2.x.

Difference between Hadoop 1 and Hadoop 2 | Source: geeksforgeeks.org

Apache Hadoop 3 Quick Start Guide | Source: packtpub.com

Hadoop Cluster Architecture

A collection of computers known as nodes that function as a single, centralized system working on the same job is known as a Hadoop cluster. Every node has its own memory and disk space and is an autonomous, self-sufficient entity.

Master nodes

The two main functional components of Hadoop — storing large amounts of data (HDFS) and doing parallel calculations on all of that data (MapReduce) — are managed by the Master nodes. A NameNode, Secondary NameNode, and JobTracker are among the master nodes, which usually make use of better hardware and are each operating on a different computer.

Slave or Worker nodes

A slave or worker node actually performs jobs assigned by the master node. Each worker node runs the DataNode and TaskTracker services, which are used to receive the instructions from the master nodes. To receive instructions, each worker node runs the TaskTracker and DataNode services.

Client nodes

A client nodes serves as a gateway between a Hadoop cluster and outer systems and applications. Client nodes are responsible for loading data into the cluster, submitting a MapReduce job that specifies how the data should be processed, and retrieving or viewing the job's output after it is complete.

Understanding Hadoop Clusters and the Network | Source: bradhedlund.com

Core Component

Hadoop Common’s

A set of shared libraries and utilities that support the other Hadoop modules. Essential services including serialization, I/O, and persistent data storage are offered by Hadoop Common. Hadoop Common's main objective is to offer a necessary collection of tools, libraries, and Java APIs that support the operation of other Hadoop modules, including YARN, MapReduce, and the Hadoop Distributed File System (HDFS).

Hadoop Distributed File System(HDFS)

Hadoop Distributed File System(HDFS) is a distributed file system that provides high-throughput access to application data. HDFS manage multiple replicas of data across different clusters to enhance the system's fault tolerance significantly.

In the event of a failure of any computational unit within the cluster, HDFS is capable of retrieving data from a secondary copy located in an alternative cluster within the environment. By default, HDFS maintains replicas at three places.

Yet Another Resource Negotiator(YARN)

Yet Another Resource Negotiator(YARN) is a core component of the Hadoop ecosystem, serving as the resource management and job scheduling framework. YARN consists of three daemons running on it which are ResourceManager(RM), NodeManager(NM), and WebAppProxy.

→ ResourceManager (RM)

This is the main body in charge of overseeing the Hadoop cluster's resource management. The ResourceManager has two main components: Scheduler and ApplicationsManager (AM).

  • The Scheduler is in charge of allocating resources to the several programs that are currently executing, taking into account well-known limitations like queues and capacity.

  • The ApplicationsManager (AM) is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster (AM) and provides the service for restarting the ApplicationMaster (AM) container on failure.

→ NodeManager (NM)

The NodeManager (NM) is the per-machine framework agent who is responsible for containers, monitoring their resource usage (CPU, memory, disk, network) and reporting the same to the ResourceManager (RM) or Scheduler.

→ WebAppProxy

It can be set up to operate alone, but by default it will be a part of the Resource Manager.

Apache Hadoop 3.4.1 – Apache Hadoop YARN | Source: hadoop.apache.org

MapReduce

MapReduce is a framework conducting distributed and parallel processing of large volumes of data. By dividing petabytes of data into smaller pieces and processing them concurrently on hadoop commodity servers, MapReduce makes concurrent processing easier. It is the primary component of Apache Hadoop and is used for job division, shuffle, and filtering.

Hadoop Component Architecture

HDFS architecture

HDFS architecture follows master/slave architecture with NameNode as master and DataNode as slave. HDFS consists of three daemons running on it which are NameNode, SecondaryNameNode, and DataNode.

→ NameNode

NameNode is a key component of an HDFS file system, the NameNode operates on Hadoop's master node. NameNode regulates client access to files and maintains the filesystem namespace on the master server. It performs operations such as opening, closing, and renaming files and directories.

→ SecondaryNameNode

The secondary NameNode server is responsible for maintaining a copy of the metadata in the disk. The main purpose of the secondary NameNode is to create a new NameNode in case of failure. It creates checkpoints of the namespace to ensure that the filesystem metadata is up-to-date and can be recovered in case of a NameNode failure.

→ DataNode

DataNodes are the worker nodes in HDFS, responsible for storing and retrieving actual data blocks as instructed by the NameNode. It carries out operations such data block generation, replication, and deletion.

HDFS Interview Questions and Answers | Source: whizlabs.com

MapReduce architecture

MapReduce job usually divides the input data set into separate parts that are handled entirely in parallel by the map tasks. MapReduce has two phase Map Phase and Reduce Phase.

→ Map Phase

There are two steps in this phase: classification and mapping. The database is divided into equal units called units (input divisions) in the division step. Key and value pairs <key,value> are then used as input on the map step.

→ Reduce Phase

Reduce also takes inputs as key and value pairs <key,value>, and produces key and value pairs as output. Reducer takes the input from Mapper, performs shuffle and sort, and provides the output as smaller set of values. Key and value pairs <key,value> are then used as input on the map step.

Learn Everything about MapReduce Architecture & its Components | Source: analyticsvidhya.com

Hadoop Common Use Case

Big data analysis

Hadoop facilitates organizations in the centralized aggregation and storage of extensive datasets from various systems. In the retail sector, e-commerce enterprises use Hadoop to monitor the associations of products purchased concurrently by consumers. For instance, when a customer intends to procure a mobile device, the system may recommend accessories such as a mobile back cover or screen protector.

Log processing

Organizations leverage Hadoop for the processing and examination of log data from web servers, applications, and network devices. When an individual accesses a website, Hadoop is capable of capturing data regarding the visitor's origin prior to arriving at a specific website. Advertising targeting platforms utilize Hadoop to capture and analyze clickstream, video, transaction, and social media data.

Machine learning and AI

Hadoop serves as a foundational framework for numerous machine learning and artificial intelligence paradigms by effectively managing the data sets for large models. In healthcare, Hadoop facilitates the examination of vast quantities of data derived from medical apparatus, laboratory findings, clinical observations, imaging diagnostics, and other relevant sources. Through the utilization of Hadoop, data storage can be administered and processed with optimal efficiency, thereby allowing for a concentrated emphasis on the architecture and training of artificial intelligence algorithms.

Data warehouse

Hadoop stores and manages large volumes of structured and unstructured data, serving as a cost-effective data warehouse solution. Within governmental sectors, Hadoop is leveraged for the advancement of national, state, and municipal development by analyzing vast amounts of data.

Fraud detection

Financial institutions leverage Hadoop to detect fraudulent activities by analyzing large volumes of transaction data in real time. They utilize Apache Hadoop to mitigate risk, detect unscrupulous traders, and examine patterns of fraud. By using feature engineering techniques to ensure that the model has enough information to recognize fraudulent activities.

Hadoop Advantages

Cost-effectiveness

As an open-source framework that can run on commodity hardware, Hadoop significantly reduces the cost of storing and processing large datasets compared to traditional solutions. In other words, Hadoop offers us two key advantages at a reasonable price: first, it is open-source, meaning it is free to use; second, it employs commodity hardware, which is also reasonably priced.

Scalability

By adding more nodes to the cluster, Hadoop can effortlessly scale from gigabytes to petabytes of data, allowing enterprises to manage increasing data volumes. In the former scenario, you can add CPU or RAM to distinct computers to boost their capacity.

Processing Speed

Hadoop leverages parallel processing to break down large tasks into smaller ones that can be processed concurrently, significantly improving processing speed and efficiency. Speed is crucial when working with big amounts of unstructured data; Hadoop makes it simple to retrieve TBs of data in a matter of minutes.

Fault Tolerance

Hadoop makes use of commodity hardware, which are low-cost systems that can be crashed at any moment. Data in Hadoop is duplicated across many DataNodes inside a Hadoop cluster, guaranteeing data availability in the event that one of your systems crashes.

Hadoop Limitation

Handling Small Files

Hadoop's architecture is not optimized for handling a large number of small files. Small files are those that are significantly smaller than Hadoop's block size, which is 128 MB by default. Since the NameNode stores the HDFS namespace, it will get overloaded if there are too many tiny files.

Latency

Large datasets are processed in a single, sequential operation by Hadoop's MapReduce framework, which is built for batch processing. MapReduce uses a lot of time to complete these tasks, which increases latency. Map takes a piece of data and transforms it into another set of data, while Reduce takes the map's result as input and processes it further.

No real-time data processing

Apache Hadoop is designed for batch processing, it can receive and process large amounts of data to provide results. Hadoop is not suitable for Real-time data processing.

Summary

Apache Hadoop is an open-source framework designed to process and store vast datasets across distributed computing environments. Created in 2005 by Doug Cutting and Mike Cafarella, it was initially developed to support the Nutch search engine project. The evolution of Hadoop, influenced by Google's GFS and MapReduce papers, led to the distributed processing capabilities seen in Hadoop 1.x and the later introduction of YARN in Hadoop 2.x. Hadoop 3.x, the latest version as of 2025, offers significant improvements in fault tolerance, scalability, and storage efficiency.

Hadoop's core components include Hadoop Common, HDFS, YARN, and MapReduce, all working together to manage and process big data. HDFS (Hadoop Distributed File System) handles high-throughput data access by breaking files into chunks and storing them across nodes. Hadoop finds common use cases in big data analysis, log processing, machine learning and AI, data warehousing, and fraud detection. I hope you enjoyed reading this.

0
Subscribe to my newsletter

Read articles from Ricky Suhanry directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Ricky Suhanry
Ricky Suhanry

Data Engineer based in Jakarta, Indonesia. When I first started out in my career, I was all about becoming a Software Engineer or Backend Engineer. But then, I realized that I was actually more interested in being a Data Practitioner. Currently focused on data engineering and cloud infrastructure. In my free time, I jog and running as a hobby, listening to Jpop music, and trying to learn the Japanese language.