Diving into Big Data with Apache Spark

DineshDinesh
2 min read

My Journey Through Module 5 of Data Engineering Zoomcamp

This week marked an exciting milestone in my data engineering journey as I completed Module 5 of the Data Engineering Zoomcamp, which focused on batch processing with Apache Spark.

๐Ÿ› ๏ธ Setting Up the Environment

The module began with setting up Apache Spark and PySpark on my Linux machine:

  • Created a Python virtual environment named 'zoomcamp'

  • Installed PySpark 3.4.1 and necessary dependencies

  • Set up Jupyter for interactive exploration

  • Configured Spark session with local processing mode

Getting that first Spark session running and seeing the version number appear was a small but satisfying victory.

๐Ÿ“Š Working with NYC Yellow Taxi Data

The homework assignment centered around analyzing the NYC Yellow Taxi dataset for October 2024:

Data Processing Steps:

  1. Read the parquet file into a Spark DataFrame

  2. Examined the schema and data structure

  3. Repartitioned the data into 4 partitions

  4. Measured the resulting partition sizes (about 23MB each)

Key Learning: Spark's lazy evaluation means commands aren't executed until an action is called, making the entire pipeline more efficient as Spark can optimize operations before actually processing data.

๐Ÿ” Finding Insights in the Data

The analysis questions pushed me to apply various Spark functions and techniques:

Question 3: Trip Count Analysis

  • Used timestamp functions to filter for a specific date

  • Found nearly 129,000 taxi trips on October 15th alone

  • Learned to use year(), month(), and dayofmonth() functions

Question 4: Longest Trip Duration

  • Calculated trip durations using unix_timestamp conversions

  • Discovered a trip lasting 162.62 hours (almost a week!)

  • This outlier highlighted the importance of data quality checks

๐Ÿ’ป SQL in Spark

For the final question, I leveraged Spark's SQL capabilities:

query = """
SELECT z.Zone, COUNT(*) as trip_count
FROM trips t
JOIN zones z ON t.PULocationID = z.LocationID
GROUP BY z.Zone
ORDER BY trip_count ASC
LIMIT 1
"""

least_frequent_zone = spark.sql(query)

Interesting Finding: Governor's Island/Ellis Island/Liberty Island had only a single pickup in the entire month!

๐ŸŒŸ Key Takeaways

What impressed me most about Spark:

  • Versatility: Seamlessly switching between DataFrame operations and SQL queries

  • Performance: Ability to process large datasets efficiently

  • Scalability: Understanding how partitioning affects processing capabilities

  • Integration: Works well with existing data formats like Parquet

๐Ÿ”ฎ Looking Ahead

As I move forward in my data engineering journey:

  • Excited to build on these batch processing concepts

  • Planning to explore streaming data in the next module

  • Looking forward to applying Spark to larger and more complex datasets

  • Interested in exploring Spark ML capabilities in the future

#DEZOOMCAMP #DataEngineering #ApacheSpark #BigData

0
Subscribe to my newsletter

Read articles from Dinesh directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Dinesh
Dinesh