Introduction

Data is one of the many essentials of any successful business. Businesses and their operations are significantly impacted by the enormous volumes of data generated. More and more data is being generated every second, in order to extract value, it must be kept and examined. The technical term "Big Data" refers to the enormous volume of diverse datasets that are produced and disseminated quickly, for which the traditional methods for processing, analyzing, retrieving, storing, and visualizing such enormous volumes of data are no longer appropriate or sufficient.

Short History

The famous Library of Alexandria, which was established around 300 B.C., can be considered as a first attempt by the ancient Egyptians to capture all ‘data’ within the empire. Whereas up until the 1950s, most data analysis was done manually and on paper, we now have the technology and capability to analyze terabytes of data within split seconds. In a single data collection, big data sizes can range from a few dozen terabytes (TB) to many petabytes (PB), and they are continuously growing. By 2020, that is expected to increase significantly to 44 zettabytes. Large corporations like Meta, Amazon, Apple, Netflix, Google, and Open AI (MAANGO) are currently analyzing vast amounts of extremely precise data in an effort to learn previously unknown truths. Understanding how each of these stages influenced the current definition of big data is crucial to comprehending its context today.

Phase-1: Structured Content

Database administration has long been the source of data analysis, data analytics, and big data. The techniques that are used in these systems, such as structured query language (SQL) and the extraction, transformation and loading (ETL) of data, started to professionalize in the 1970s.

Phase-2: Web-based Unstructured Content

The internet and related web applications began producing enormous volumes of data in the early 2000s. The expansion of web traffic and online stores, companies such as Yahoo, Amazon and eBay started to analyze customer behavior by checking the click-rates, IP-specific location data and search logs.

Phase-3: Mobile and Sensor Based Content

Big Data has entered a new phase and opened up new possibilities with the rise of smartphones and other mobile devices, sensor data, wearable technology, the Internet of Things (IoT), and many more data generators. Since this development is not expected to stop anytime soon, it could be stated that the race to extract meaningful and valuable information out of these new data sources has only just begun.

The Three major Phases in the evolution of Big Data | Source: bigdataframework.org

Characteristics

In order to best understand how big data was able to grow to such popularity. Each of these traits highlights a distinct set of opportunities and difficulties for organizations, reflecting the growing complexity of big data management and analysis.

The first 3 Vs

Big data based on a 2001 study, analyst Doug Laney explained the features of big data using the 3V model: Volume, Variety, and Velocity.

Volume

Volume means the amount of data. It refers to all the data collected and stored from different sources like IoT devices (connected sensors), mobile apps, cloud services, websites, and social media. As time passed, better tools like data lakes and Hadoop were created, which are now commonly used to store, process, and analyze data.
Variety

Besides the volume and speed of data, Big Data also includes handling different types of data from different sources. Big Data can be grouped into three types: structured, semi-structured, and unstructured or raw data.
Velocity

This feature shows how quickly data is generated. Data speed is crucial in addition to the exponential volume of incoming data.

The 5 V’s of Big Data

Big data was explained by Oguntimilehin. Big data poses unique challenges from several aspects including analysis, visualization, integration and architecture, due to the inherent high-dimensionality.

Volume

Volume means the amount of data. It refers to all the data collected and stored from different sources like IoT devices (connected sensors), mobile apps, cloud services, websites, and social media. As time passed, better tools like data lakes and Hadoop were created, which are now commonly used to store, process, and analyze data.
Variety

Besides the volume and speed of data, big data also includes handling different types of data from different sources. Big data can be grouped into three types: structured, semi-structured, and unstructured or raw data.
Velocity

This feature shows how quickly data is generated. Data speed is crucial in addition to the exponential volume of incoming data.
Veracity

Veracity means how true, accurate, and trustworthy the data is. The biggest issue people face is that big data is often unclear and uncertain. If the data isn’t reliable and well-checked, then analyzing it won’t help you make good decisions.
Value

The most important "v" concept on how much it affects a company is the value of data. This comes from finding new insights and patterns that give a company a better advantage. Real results only happen when data becomes useful information.

The 10 Vs of Big Data

In 2014, Kirk Bornme serves on several national and international advisory boards and journal editorial boards related to big data.

Volume

Volume means the amount of data. It refers to all the data collected and stored from different sources like IoT devices (connected sensors), mobile apps, cloud services, websites, and social media. As time passed, better tools like data lakes and Hadoop were created, which are now commonly used to store, process, and analyze data.
Variety

Besides the volume and speed of data, Big Data also includes handling different types of data from different sources. Big Data can be grouped into three types: structured, semi-structured, and unstructured or raw data.
Velocity

This feature shows how quickly data is generated. Data speed is crucial in addition to the exponential volume of incoming data.
Veracity

Veracity means how true, accurate, and trustworthy the data is. The biggest issue people face is that big data is often unclear and uncertain. If the data isn’t reliable and well-checked, then analyzing it won’t help you make good decisions.
Value

The most important "v" concept on how much it affects a company is the value of data. This comes from finding new insights and patterns that give a company a better advantage. Real results only happen when data becomes useful information.
Vulnerability

Big data introduces new security issues. Big data systems hold complicated and sensitive information, making them easy targets of cyberattacks. Managing vulnerabilities is important for keeping big data secure because it helps find and fix problems before they can be exploited, lowering the chance of data breaches, leaks, and unauthorized access to important information.
Validity

Validity means how correct and accurate the data is for the purpose it’s used for. If the input data is accurate and the processing is done right, the results should be reliable.
Variability

Variability, which is sometimes mistaken for variety, is a measure of the irregularities in data. The quantity of discrepancies in the data that must be discovered by anomaly and outlier detection techniques before any significant analytics can take place. It can significantly affect your data homogeneity if you consistently derive various meanings from the same dataset.
Volatility

Volatility refers to the value of data that changes quickly because new data is continuously being produced. Usually, data volatility comes under data governance and is assessed by data engineers.
Visualization

It describes the process of transforming vast amounts of data into a form that is easier to comprehend and explore. The in-memory technology's limits and the current large data visualization tools' inadequate scalability, functionality, and response time pose technical difficulties.

Key Components

Big Data technologies are majorly classified into three parts include:

File system

File system is responsible for the organization, storage, naming, sharing, and protection of files. In the context of big data and cloud computing, various file systems and storage solutions are available, include:

Distributed file system

When dealing with the challenge of accessing big data that is distributed stored on a cluster, a cluster file system is a common solution. The cluster file system provides location-transparent access to data files to the servers on the cluster.

Hadoop Distributed File System (HDFS) is a popular type of cluster file system that is designed for reliably storing large amounts of data across machines in a large-scale cluster. Amazon S3 is a highly scalable and durable object storage service provided by Amazon Web Services (AWS). Google Cloud Storage is Google’s object storage service for storing and retrieving data. Azure Blob Storage is Microsoft’s object storage service for the Azure cloud platform.
Local file system

A Local file system is the structure within a UNIX-like operating system where files, remote file shares, and devices are organized and accessed. It allows users to view, manage, and interact with files stored locally on the system. The most common local file system on UNIX platforms is the Unix File System (UFS), which is relatively basic but still widely used today.

Data Processing frameworks

The data processing usually provides direct access to data sources, mainly building beyond data storage services and focusing on real-time querying, conditional filtering, and data transportation. There are three typical kinds that can be divided by interfaces and calculating models include:

Batch Querying

Batch data processing is the method of collecting data over time and processing it in bulk at scheduled intervals. This approach is suitable for handling high volumes of data where immediate results aren't necessary, focusing on throughput and resource efficiency.
- Apache Hive
  
  Apache hive is a data warehousing solution built on top of Hadoop. HiveQL simplifies big data analytics by translating queries into MapReduce jobs or Tez tasks, which run on a Hadoop cluster.
- Apache Spark
  
  Apache Spark started as a research project at the UC Berkeley AMPLab in 2009, and was open-sourced in early 2010. Spark SQL is a spark module brings native support for SQL to process high-volume datasets.
Real-Time Querying

Real-time processing systems are inherently stateful. They require constant access to internal states for incremental computation.
- Spark Streaming
  
  Apache spark started as a research project at the UC Berkeley AMPLab in 2009, and was open-sourced in early 2010. Spark streaming breaks the stream into small batches and processes them, offering a balance between real-time and batch processing.
- Apache Flink
  
  Apache flink originated from the Stratosphere research project, which was started in 2010 by several German universities. Apache flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams.
- Apache Kafka
  
  Apache kafka was originally developed at LinkedIn in 2010 to handle the growing need for a high-throughput, low-latency system to process real-time event data. Apache kafka is an open-source, distributed streaming platform that has revolutionized the way we process streaming data.
- Apache Storm
  
  Apache storm is a distributed, open-source real-time computation system initially developed by Nathan Marz at BackType. Apache storm processes data events as they arrive, offering low latency and real-time responsiveness.
- Risingwave
  
  Risingwave history began in early 2021 with the goal of creating a more accessible and efficient stream processing system. Risingwave is a stream processing platform that utilizes SQL to enhance data analysis, offering improved insights on real-time data.
Multi-Dimensional Query

Most important elements of Online Analytical Processing (OLAP) is the multidimensional query. OLAP queries are multidimensional queries that can present your cube data in a wide variety of views. Multi Dimensional Expressions (MDX) is a query language for online analytical processing (OLAP) using a database management system.
- Apache Kylin
  
  The multi-dimensional query component, Apache Kylin, is among the best examples of these components. Rational Online Analytical Processing (ROLAP) is an improved version of Online Analytical Processing (OLAP). It allows for dynamic multi-dimensional analysis of data that is stored in a relational database.

Data analytics

Many big data solutions first prepare data for analysis and then provide it in a structured format that can be used by analytical tools. The place where this data is stored to answer queries can be a traditional Kimball-style relational data warehouse, which is commonly used in business intelligence (BI) systems, or a Lake house that uses a medallion architecture. Big data analytics also includes using different techniques to find patterns, trends, and relationships in the data. This covers types of analysis like descriptive, diagnostic, predictive, and prescriptive analytics, as well as the use of machine learning and artificial intelligence.

Big Data Format

In the world of big data, choosing the right file format is a very important choice that affects how well the data works and how much space it takes. Big data formats are designed to store and handle very large amounts of data in a fast and effective way. The most common formats are CSV, JSON, AVRO, Parquet, and ORC.

CSV

A simple, text-based format ideal for tabular data. CSV is a row-based format, meaning each line in the file represents a row in the table.
JSON (JavaScript Object Notation)

JSON is a format that people can read easily and supports complex, nested data. It is commonly used for APIs and settings files.
Apache Parquet

Apache Parquet was officially released on 13 March 2013. It was initially developed by Twitter and Cloudera. Parquet files store data in a columnar format, which makes them very efficient for read-heavy analytics queries that only need to access a few columns.
Apache Avro

Apache Avro was created as part of the Apache Hadoop project in 2009. It is used for processing and generating large data sets. Avro stores data in a row-based format and is also described as a data serialization system that works well for write-heavy operations like streaming.
Apache ORC (Optimized Row Columnar)

Apache ORC was created by Hortonworks, with major contributions from Cloudera and Facebook in 2013. ORC is a columnar format, which allows for efficient reads when you only need specific columns.

Big Data Format Comparison | Image create by author

Summary

In conclusion, this blog covered the Big Data and rapidly growing volume of structured and unstructured data that traditional tools can no longer effectively process. Its evolution has been shaped by technological advances—from early database systems and internet expansion to today’s mobile, IoT, and real-time data sources. The concept is defined by several characteristics, commonly known as the 5Vs or even up to 10Vs, including Volume, Variety, Velocity, Veracity, and Value, along with others like Vulnerability and Visualization.

Big Data systems rely on distributed storage (like HDFS, Amazon S3) and advanced processing frameworks (e.g., Apache Spark, Flink, Kafka) to analyze and extract insights. Choosing the right data format (such as Parquet, ORC, or JSON) and leveraging analytics techniques including machine learning are essential for unlocking value from Big Data. I hope you enjoyed reading this.

A Brief Introduction to Big Data and Analytics

Table of contents