Introduction to Big Data

With the rise of social media, Online shopping, Streaming services, etc., vast amount of information are getting generated with each clicks and interactions we make online. These data can provide valuable insights to companies to improve their product, define business route they need to take, improve user experience online, but size of the data is so huge and complex that it is impossible to manage and analyze it through traditional system. Such huge data is referred as “Big Data”.

A quick google search will reveal the fact that approximately 400+ million terabytes of data is produced Daily in 2024 which is increasing daily due to rise of social media, Internet of things (IOT), etc. Processing such huge data requires powerful and capable system which will not only be costly but also not easily available.

Big Data is generally defined using 5 V’s:

Volume: This refers to the sheer size of data that is being produced through various medium like social media, sensors, transaction, etc.
Variety: This refers to data that is coming in various types, Structured Data (Databases), Semi-Structured Data (JSON,XML) and Unstructured Data (Social Media images, Video). Dealing with such different type of data poses a unique challenge.
Velocity: Data is getting generated at high speed and in some scenarios they needs to be processed in real time. For example, Money transaction through card or net banking needs to be processed and analyzed in real time to detect fraudulent transaction.
Veracity: Data are produced at such high rate that they tend to be noisy and inconsistent. They need to go through lot of processing before they are ready for analyzing and derive any insight out of them.
Value: The ultimate motive of big data is to produce some insight and points on user behavior, define business routes needed to maximize profit, increase user retention and experience.

To address the challenge posed by Big data various technologies and tools are developed. These technology are made to deal with storage, processing, analysis, and management of such huge data. Following are some of the key technology which proved to be foundational in Big data field:

Hadoop Framework

Hadoop is one such foundational technology. It proved to be foundation for Distributed Filesystem and Parallel computation. It consist of three major component:

HDFS (Hadoop Distributed File System): It provide a distributed file system which uses commodity hardware to store data. It ensures data availability and prevent data loss through replication.
Map Reduce: It is programming model which is used to process data parallelly. This ensure sufficient data speed to process data with commodity hardware. Initially MR programs were written in Java only but later got expanded to Python, C++, and Ruby also.
YARN (Yet another resource negotiator): This manages jobs, allocate resources and schedule jobs in Hadoop cluster.

Apache Spark

Even though Hadoop framework provided Map reduce to process data, they are considered slow. Apache Spark provides faster processing due to its in-memory processing. It is useful for advance analytics and provide ease-of-use. It also provide tool to process data in real time which is useful for streaming data. Spark, even though written in Scala, supports multiple language like Python, Java, R and recently extended to C# also. It is ideal for iterative task, machine learning and real time data processing.

Apache Hive

Even though Map reduce does the job of processing with data, large codes needs to be written for simple queries also and top of that familiarity with languages such as Java is required. To deal with this challenge, Apache Hive was developed, which uses simple SQL like queries to to get the result and transform the structured data.

Apache Pig

Again to combat with Map reduce problems, Apache Pig was made. It uses simple scripting language called Pig Latin., which can write map reduce program in 10 times less line of code than what Map reduce requires.

Data Lakes

Data lakes are storage architecture which allows organization to store huge amount of structured, semi-structured and unstructured data. It provides various tiers for data storage like Hot tier, Cool tier, archive tier which provide option to retrieve data in few milliseconds to few hours and cost accordingly. Technologies like Amazon S3, Azure Data Lake, and Google Cloud Storage are major data lakes player in the market.

These technologies, tools and frameworks provide different capabilities to effectively address various issue and challenges posed by Big Data. It is used by companies to manage, process and analyze data to get some value out of them.

In conclusion, Big Data is no longer the buzzword —Its a critical asset which is used by companies to drive their business, gain competitive advantage and improve their user experience. As Data will keep growing exponentially, so will the technologies to harness values out of the data but as Uncle Ben says “With great power comes great responsibility”, data privacy and security must remain the top priority.

Big Data