Introduction

What is Hadoop ?

With the rise of Big Data, there was need of some sort of framework was required to manage it. Hadoop was, one such, framework which was created by Doug Cutting and Mike Cafarella in 2005. Hadoop was developed to solve the problem of processing and storing massive data sets that traditional systems could not manage. It is based on the principles of Google's MapReduce and Google File System (GFS).

Why Hadoop?

→ Scalability: One of the major feature that Hadoop provide is horizontal scalability. It can scale by adding more node to its cluster.

→ Cost-Efficiency: Hadoop can run on commodity hardware which make it affordable to process Big Data.

→ Fault Tolerance: Hadoop replicate data up to 3 times by default, which can be changed, making it resilience to hardware failure and ensure high availability.

Hadoop Ecosystem

Hadoop Ecosystem mainly consist of three major component.
- Hadoop Distributed File System (HDFS)
  
  HDFS provides highly scalable, reliable, and distributed storage for dataset. It allows data to be stored on multiple node in cluster and ensure fault tolerance using data replication.
  
  Key Features
  
  → HDFS splits big files into small blocks of 128 MB or 256 MB, default size, and store them across multiple computers or “Nodes“. Blocks are replicated across several nodes ( default replication factor is 3).
  
  → HDFS are designed to scale horizontally i.e. more storage can be added by adding more nodes to cluster.
  
  → HDFS provides data locality meaning instead of moving data to computation, it moves computation task to data node. This ensure reduction in network congestion and improves performance.
  
  → HDFS is optimized for ‘write once read many‘, that is once data is written it can be read multiple times with changing it efficiently.
- Map Reduce:
  
  Map Reduce enables parallel processing of large data-sets across multiple nodes. It follows divide and conquer technique, dividing task into small task and performing them across distributed cluster, processing them simultaneously. It has lot of concept which in itself can take up an article, but we will go through a bird eye view of it.
  
  Key Concept
  
  → Map Phase (Data processing): Map phase takes data set, split them into key value pair and process each chunk of data across nodes. It produces output as intermediate key value pair which are then sorted and shuffled by key.
  
  → Reduce Phase (Analysis/Aggregation): In the Reduce phase, the output from the Map phase is grouped by key, and the values corresponding to each key are aggregated or analyzed to produce the final result.
- Yet Another Resource Negotiator (YARN):
  
  YARN as the name suggests, is a resource manager. It enables efficient management of resource and scheduling of jobs across the distributed cluster.
  
  Key Component
  
  → Resource Manager (RM):
  - The Resource Manager is the central authority that manages and allocates resources (e.g., CPU, memory) across the entire Hadoop cluster. It acts as the master node in YARN's architecture.
  - It has two main subcomponents:
    
    - Scheduler: Responsible for allocating resources to running applications based on the availability and configured policies. It doesn't monitor or restart failed applications but simply allocates resources.
    
    - Application Manager: Manages the lifecycle of submitted applications (e.g., MapReduce jobs) and negotiates the first container for an application to start the Application Master.

→ Node Manager (NM):

The Node Manager is responsible for managing resources and monitoring the health of the individual nodes in the cluster. It reports the resource usage of each node (e.g., CPU, memory) to the Resource Manager and enforces resource constraints.
It also monitors and reports the status of containers running on that node.

In simple terms, Hadoop has reshaped how we handle enormous amounts of data, offering a smart and flexible way to store and process it across many machines at once. With HDFS to store data and MapReduce to crunch the numbers, Hadoop helps businesses dig through huge piles of information quickly and efficiently. Whether it's analyzing trends or finding patterns, it’s a tool that turns chaos into clarity.

In the next article, we’ll take a closer look at how Hadoop is built, uncovering how its different parts fit together and power the magic behind big data processing. Stay tuned for a clearer view!

Hadoop Framework

Introduction

Hadoop Ecosystem

Subscribe to my newsletter

Rajat Srivastava

Rajat Srivastava