Introduction to Big Data Analysis

Madhav GanesanMadhav Ganesan
15 min read

Data refers to raw, unprocessed facts, statistics, or information collected for reference, analysis, and processing.

They are of different formats:

Structured Data: Organized in a defined format, such as rows and columns in a database

Unstructured Data: No predefined structure, such as text, emails, images, videos, and social media posts.

(Note: If 20 percent of the data available to enterprises is structured data, the other 80 percent is unstructured.)

Semi-Structured Data: A hybrid of structured and unstructured data, like JSON or XML documents that have tags or markers to organize the data but don’t fit into a traditional database schema.

Sources of structural data

Machine-Generated Structured Data: Sensor Data Web Log Data Point-of-Sale Data Financial Data

Human-Generated Structured Data: Input Data Click-Stream Data Gaming-Related Data

Sources of Unstructured data

Machine-Generated Unstructured Data: Satellite Images Scientific Data Photographs and Video Radar or Sonar Data

Human-Generated Unstructured Data: Text Internal to Your Company Social Media Data Mobile Data Website Content

Bigdata

It refers to large and complex datasets that are difficult to process, analyze, and manage using traditional data processing tools and techniques.

Key Characteristics of Big Data

1) Volume: This refers to the size/magnitude of the data, typically in the range of terabytes, petabytes, or even exabytes.

Limitations of Traditional RDBMS:

  • Traditional RDBMS are designed primarily for structured data organized in tables. They struggle with unstructured or semi-structured data

  • RDBMS have limitations in scaling out horizontally (adding more servers) to handle increased data volume. They often rely on vertical scaling, which is not always feasible for big data scenarios.

  • IoT devices and sensors generate continuous data streams that traditional databases are not equipped to handle efficiently.

2) Velocity: Big Data is generated at high speed and often requires real-time or near-real-time processing to derive meaningful insights.

3) Variety: This refers to different types and sources of data that need to be integrated and analyzed

Advantages: Competitive Advantage Enhanced Analytics Ex. Structured, Unstructured and Semi-structured data

4) Veracity:(4th V) (Coined by IBM) It refers to the reliability, accuracy, and truthfulness of data.

  • Data must be accurate and reliable to make informed decisions. False or misleading data can lead to incorrect conclusions and misguided actions.

  • Ensure that data is complete, accurate, and timely.

  • High-quality data leads to more reliable analysis outcomes.

  • Implement processes to verify data accuracy and consistency.

  • Assess the reliability of data sources. Data from reputable and validated sources is generally more trustworthy.

  • The data must be relevant and contextually appropriate for the specific problem

5) Variability: It refers to the fluctuations in the data flow rates and the changing patterns of data over time. It highlights the inconsistency in how data is generated, stored, and processed.

6) Visualization: It is a crucial aspect of big data that involves converting complex data into graphical formats such as graphs, charts, and maps. This transformation allows for easier comprehension and actionable insights from large and complex datasets.

Ex. Nanocubes is a visualization tool developed by AT&T that supports high-dimensional data visualization.

7) Value: This refers to the usefulness of the data

Examples of Big data:

Social Media: Posts, tweets, comments, likes, shares, and multimedia content. Internet of Things (IoT): Data from sensors, devices, wearables, and smart systems. E-commerce: Customer behavior, transaction logs, and purchase history. Healthcare: Medical records, imaging data, and genomic data. Finance: Stock market data, transaction data, and financial records. Web Analytics: Clickstreams, browsing history, and user interactions on websites.

Data warehouse

It is a centralized repository that stores large volumes of structured and sometimes semi-structured data from various sources

Key Characteristics of Data Warehouse Subject-Oriented: It is organized around key business subjects such as sales, finance, inventory, etc.

Integrated: Data from different sources (e.g., transactional systems, databases, flat files, cloud applications) is cleaned, transformed, and integrated into a unified format

Non-Volatile: Once data is loaded into the data warehouse, it is not modified or deleted. This ensures historical data is preserved for long-term analysis.

Time-Variant: Data in a data warehouse is stored with time stamps, allowing users to track and analyze trends over time.

Data Mart

It is a specialized subset of a data warehouse that is designed to focus on a specific business department, such as sales, marketing, finance, or operations.

Tools for building data warehouse

Amazon Redshift (AWS) Google BigQuery Snowflake Apache Hive Microsoft Azure Synapse Analytics IBM Db2 Warehouse on Cloud

Snapshots of data

A snapshot is a static view of the data at a particular point in time. Organizations might store snapshots to preserve a record of important information without keeping a continuously updated dataset.

Big data management cycle

Image description

Big data architecture

Image description

Layer 0: Redundant Physical Infrastructure An infrastructure should be resilient to failure or changes when sufficient redundant resources are in-place, ready to jump into action.

"elastic" refers to the ability of a system or infrastructure to dynamically scale its resources up or down in response to varying demands. Elasticity ensures that the system can handle fluctuations in workload efficiently without significant performance degradation or wasted resources.

It utilizes multiple machines to handle large-scale data processing tasks. Redundancy helps in managing failures and scaling operations efficiently.

Layer 1: Security Infrastructure

Security infrastructure is vital to protect sensitive data and ensure compliance with regulations. Security measures should be integrated into the architecture from the beginning, including encryption, access control, and monitoring.

It should adhere to industry standards and regulations (e.g., GDPR, HIPAA) is crucial for data security and privacy.

Layer 2: Operational Databases Database Engines store and manage data collections relevant to your business. It must be fast, scalable, and reliable. Different engines may be more suitable depending on the big data environment, often requiring a mix of database technologies.

Structured Data: Relational Database

Unstructured Data: No SQL Database

Image description

Layer 3: Organizing Data Services and Tools Addressing challenges of handling large volumes of unstructured data and ensuring continuous data capture without relying on snapshots.

Key Characteristics of Virtualization for Big Data

Partitioning

For big data environments, we can run multiple applications or instances of data processing frameworks (like Hadoop) on the same physical hardware

Isolation

Each virtual machine (VM) is isolated from others and from the host system. This means that if one VM fails or encounters issues, it doesn’t impact others.

Encapsulation

A VM can be encapsulated into a single file or set of files, which includes the virtual machine’s operating system, applications, and data.

Virtualization

It is the process of creating a simulated version of something rather than a physical version. This can include hardware platforms, storage devices, and network resources.

Virtualization allows multiple virtual systems to run on a single physical machine by separating the virtual environments from the underlying hardware

Imagine a small company with only one physical server. The company needs to run several different applications, such as a web server, a database server, and an email server. Instead of buying multiple physical servers for each application, they use virtualization to run multiple virtual servers on a single physical server.

How It Works: Physical Server: The company has one physical server with sufficient hardware resources (CPU, RAM, disk space).

Virtualization Software: The company installs virtualization software (such as VMware, Hyper-V, or VirtualBox) on this physical server.

Creating Virtual Machines (VMs):

VM 1: Runs the web server application. VM 2: Runs the database server application. VM 3: Runs the email server application. Each VM operates as if it were a separate physical server, with its own operating system and applications.

Resource Allocation: The virtualization software allocates the physical server’s resources (CPU, RAM, disk space) to each VM as needed. For instance, if the web server needs more resources, the virtualization software can allocate additional resources from the pool.

Image description

Applications

Server Consolidation Software Virtualization Storage Virtualization Network Virtualization

Hypervisor (virtual machine monitor)

It is a software layer that allows multiple virtual machines (VMs) to run on a single physical host machine by managing and allocating hardware resources among them. It sits between the physical hardware and the virtual machines. It manages and allocates physical resources to each VM.

Types of Hypervisors

  1. Type 1 Hypervisor (Bare-Metal Hypervisor) It runs directly on the physical hardware of the host machine, without an underlying operating system. It interacts directly with the hardware and manages the VMs directly. Ex. VMware ESXi, Microsoft Hyper-V, Xen, KVM (Kernel-based Virtual Machine)

  2. Type 2 Hypervisor (Hosted Hypervisor) It runs on top of a host operating system, which then interacts with the physical hardware. The hypervisor relies on the host OS to manage hardware resources.

Ex. VMware Workstation, Oracle VirtualBox, Parallels Desktop

Data Team Structure

Data Engineers

They manage the infrastructure, ensuring data pipelines, databases, and computing resources are effectively set up and maintained.

Data Modelers

They focus on analyzing data, developing models, and creating predictive / inferential products.

Subject Matter Experts (SMEs)

They provide deep knowledge about the specific domain or industry to guide data-driven decision-making.

How Software Development can be compared with Big Data Analysis?

Tools: Version control (like Git) parallels distributed file systems (HDFS) for managing data across nodes. IDEs are like big data platforms (e.g., Spark) for efficient processing.

Processes: Modularization in development is like data pipelines (e.g., Apache NiFi) for organizing data steps. Continuous integration mirrors data streaming tools (Kafka) for real-time updates.

Algorithms: Sorting/searching algorithms compare to MapReduce in big data, distributing processing across clusters. Testing in dev is similar to data cleaning for reliable results.

Use Cases: Debugging matches anomaly detection, optimizing code resembles query optimization, and developing features aligns with predictive analytics for actionable insights.

Team Structure: Cross-functional dev teams are like big data teams (engineers, analysts) working together for robust, insightful products.

Distributed File System (DFS)

A distributed file system (DFS) is a storage system that allows files to be stored across multiple servers or nodes in a way that makes them accessible as if they were on a single device.

Example Systems Hadoop Distributed File System (HDFS), Google File System (GFS), Amazon S3

Running HDFS on laptop?

(This is only for testing purpose as this is for large scale parallel processing) It works in a pseudo-distributed mode, which simulates a distributed environment on a single machine. All Hadoop components (NameNode, DataNode, ResourceManager, and NodeManager) run on the same machine. Laptop will act as both the master and worker node.

Image description

Why DFS?

Redundancy: Data is replicated across multiple nodes, ensuring that if one node fails, the data is still available from another node.

Scalability: DFS can easily scale out by adding more nodes to the cluster. This allows organizations to increase storage capacity and processing power as needed without significant changes to the existing infrastructure.

Parallel Processing: By distributing data across multiple nodes, a DFS allows for parallel processing of data.

How file storage happens in DFS?

Scenario: Suppose we have a large video file that we want to store in a DFS. 1. Chunking the File

The video file is divided into chunks of 64 MB each. For instance, if the video file is 192 MB, it will be divided into three chunks: Chunk 1: 64 MB Chunk 2: 64 MB Chunk 3: 64 MB

2. Replication of Chunks

Each chunk is replicated three times for redundancy. The replicas are distributed across different compute nodes to ensure that data remains available even if some nodes fail. Example: Chunk 1: Replica 1: Node A (Rack 1) Replica 2: Node C (Rack 2) Replica 3: Node D (Rack 2) Chunk 2: Replica 1: Node B (Rack 1) Replica 2: Node A (Rack 1) Replica 3: Node C (Rack 2) Chunk 3: Replica 1: Node D (Rack 2) Replica 2: Node B (Rack 1) Replica 3: Node A (Rack 1)

3) Master Node (NameNode)

There is a master node (also known as NameNode) that keeps track of where each chunk is located. For our example, the NameNode maintains a mapping of: Chunk 1: Node A, Node C, Node D Chunk 2: Node B, Node A, Node C Chunk 3: Node D, Node B, Node A

The NameNode itself is also replicated across different nodes for fault tolerance, ensuring that if one NameNode fails, others can take over.

MapReduce

It is a programming model and processing framework designed for processing and generating large data sets in parallel across a distributed computing environment. It was introduced by Google to handle massive amounts of unstructured data efficiently.

Image description

Core concepts of MapReduce

Map Phase: The map function processes input data in parallel. It divides the data into smaller chunks and applies a specific function to each chunk to produce intermediate key-value pairs. Ex. {"word":1}

Shuffle and Sort: After the map phase, the framework sorts and groups the intermediate key-value pairs by key. Ex. In the word count application, all pairs with the same word key are grouped together

Reduce Phase: The reduce function processes the grouped intermediate data. It takes each group of key-value pairs and aggregates them into a final result.

Image description

Work of master and worker nodes

Master (JobTracker)

Task Management: Coordinates the execution of MapReduce jobs, creating Map and Reduce tasks based on user input. Task Assignment: Assigns tasks to Workers based on availability and load, monitoring their health. Status Tracking: Keeps track of each task's status (idle, in progress, completed) and provides job progress updates. Intermediate Data Handling: Manages the creation and flow of intermediate files between Map and Reduce tasks. Job Scheduling: Schedules task execution, optimizing resource use and data locality.

Worker (TaskTracker)

Task Execution: Executes either Map or Reduce tasks, processing data assigned by the Master. Map Tasks: Processes input chunks, generates intermediate key-value pairs, and writes them to local disk. Reduce Tasks: Retrieves intermediate data, applies the reduce function, and produces final output results. Communication with Master: Reports task status to the Master and requests new tasks or notifications of completions.

Image description

Applications

Google used it for large matrix-vector multiplications for calculating PageRank. Amazon used it for performing analytical queries to identify users with similar buying patterns.

What is Hadoop?

A comprehensive open-source framework part of Apache Software Foundation that includes both distributed storage (HDFS) and processing (MapReduce) components for managing and analyzing large-scale datasets.

Master node

NameNode: It stores the directory tree of the file system, file metadata, and the locations of each file in the cluster.

SecondaryNameNode(Master): It performs housekeeping tasks and checkpointing on behalf of the NameNode. (Note: Not a backup of Namenode)

ResourceManager (Master)(YARN) It allocates and monitors available cluster resources to applications as well as handling scheduling of jobs on the cluster.

ApplicationMaster(Master)(YARN): It coordinates a particular application being run on the cluster as scheduled by the ResourceManager.

Worker Nodes

DataNode: It stores and manages HDFS blocks on the local disk. It also reports health and status of individual data stores back to the NameNode.

NodeManager: It runs and manages processing tasks on an individual node as well as reports the health and status of tasks as they’re running.

Image description

#Working

Data Access in HDFS

  • Client Request: The client application sends a request to the NameNode to locate the data.

  • Metadata Retrieval: The NameNode replies with a list of DataNodes that store the requested data blocks.

  • Direct Data Access: The client directly requests each data block from the respective DataNodes.

  • NameNode Role: The NameNode acts as a traffic cop, managing metadata and directing clients but does not store or transfer data.

Job Execution in YARN

  • Resource Request: Clients submit job requests to the ResourceManager for resource allocation.

  • ApplicationMaster Creation: The ResourceManager assigns an ApplicationMaster specific to the job for its duration.

  • Job Tracking: The ApplicationMaster tracks job execution and resource requirements.

  • Node Management: The ResourceManager monitors the status of the cluster nodes, while each NodeManager handles resource allocation and task execution within containers.

  • Task Execution: NodeManagers create containers and execute tasks as specified by the ApplicationMaster.

Components

  1. Hadoop Distributed File System (HDFS)

  2. MapReduce

  3. YARN (Yet Another Resource Negotiator) A cluster resource management layer for Hadoop that manages and schedules resources across the cluster.

  4. Hadoop Ecosystem Components (Apache Hive, Apache HBase, Apache Pig, Apache Spark, Apache Flume, Apache Sqoop, Apache Zookeeper)

Image description

Workflow Summary

  1. Job Submission: User submits a job.

  2. Task Creation: Master creates Map and Reduce tasks.

  3. Task Assignment: Master assigns tasks to Workers.

  4. Map Execution: Workers process data and generate intermediate results.

  5. Intermediate File Creation: Workers store intermediate data locally.

  6. Reduce Execution: Master assigns Reduce tasks; Workers process intermediate data.

  7. Completion Notification: Workers notify the Master upon task completion.

  8. Output Storage: Final results are saved to HDFS.

HDFS Blocks

HDFS files are divided into blocks, typically 64 MB or 128 MB, with high-performance systems often using 256 MB. The block size is configurable and represents the minimum data unit for read/write operations. By default, each block is replicated three times to ensure data availability in case of node failure. This replication factor is configurable at runtime.

Hadoop Streaming

It is a utility that allows developers to create and run MapReduce jobs using any executable or script as the mapper and/or reducer. This provides a way to process data in Hadoop without needing to write Java code, making it accessible to a broader range of programming languages.

hadoop jar /path/to/hadoop-streaming.jar \
  -input /path/to/input.txt \
  -output /path/to/output \
  -mapper /path/to/mapper.py \
  -reducer /path/to/reducer.py \

Example 1: Word Count

#!/usr/bin/env python
import sys

if __name__ == "__main__":
    for line in sys.stdin:
        for word in line.split():
            sys.stdout.write("{}\ti\n".format(word))
#!/usr/bin/env python
import sys

if __name__ == '__main__':
    curkey = None
    total = 0

    for line in sys.stdin:
        key, val = line.split("\t")  # Corrected syntax here
        val = int(val)

        if key == curkey:
            total += val
        else:
            if curkey is not None:
                sys.stdout.write("{}\t{}\n".format(curkey, total))
            curkey = key
            total = val

Advanced MapReduce:

Job Chaining: It allows for the execution of multiple MapReduce jobs in sequence, where the output of one job can serve as the input for the next.

Types:

  1. Linear Job Chaining: This involves a straight sequence of jobs, where each job runs after the previous one completes.

Image description

  1. Data Flow Job Chaining: This involves more complex workflows where the output of one job may feed into multiple subsequent jobs, allowing for more intricate data processing pipelines. Ex. ETL (Extract, Transform, Load)

Combiners: It is an optional component in the MapReduce process that acts as a mini-reducer to optimize the data flow between the Mapper and Reducer phases. It processes the output of the Mapper to reduce the amount of data transferred to the Reducer by performing local aggregation (e.g., summing values) before sending it over the network.

Partitioners: It determines how the output from the Mapper is distributed to the Reducer tasks. It controls the partitioning of the intermediate key-value pairs into different Reducers based on the key.


Stay Connected! If you enjoyed this post, don’t forget to follow me on social media for more updates and insights:

Twitter: madhavganesan

Instagram: madhavganesan

LinkedIn: madhavganesan

0
Subscribe to my newsletter

Read articles from Madhav Ganesan directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Madhav Ganesan
Madhav Ganesan

I am a computer science student passionate about web development and any other technologies that amazes me