Diving into Big Data with Apache Spark

My Journey Through Module 5 of Data Engineering Zoomcamp
This week marked an exciting milestone in my data engineering journey as I completed Module 5 of the Data Engineering Zoomcamp, which focused on batch processing with Apache Spark.
๐ ๏ธ Setting Up the Environment
The module began with setting up Apache Spark and PySpark on my Linux machine:
Created a Python virtual environment named 'zoomcamp'
Installed PySpark 3.4.1 and necessary dependencies
Set up Jupyter for interactive exploration
Configured Spark session with local processing mode
Getting that first Spark session running and seeing the version number appear was a small but satisfying victory.
๐ Working with NYC Yellow Taxi Data
The homework assignment centered around analyzing the NYC Yellow Taxi dataset for October 2024:
Data Processing Steps:
Read the parquet file into a Spark DataFrame
Examined the schema and data structure
Repartitioned the data into 4 partitions
Measured the resulting partition sizes (about 23MB each)
Key Learning: Spark's lazy evaluation means commands aren't executed until an action is called, making the entire pipeline more efficient as Spark can optimize operations before actually processing data.
๐ Finding Insights in the Data
The analysis questions pushed me to apply various Spark functions and techniques:
Question 3: Trip Count Analysis
Used timestamp functions to filter for a specific date
Found nearly 129,000 taxi trips on October 15th alone
Learned to use year(), month(), and dayofmonth() functions
Question 4: Longest Trip Duration
Calculated trip durations using unix_timestamp conversions
Discovered a trip lasting 162.62 hours (almost a week!)
This outlier highlighted the importance of data quality checks
๐ป SQL in Spark
For the final question, I leveraged Spark's SQL capabilities:
query = """
SELECT z.Zone, COUNT(*) as trip_count
FROM trips t
JOIN zones z ON t.PULocationID = z.LocationID
GROUP BY z.Zone
ORDER BY trip_count ASC
LIMIT 1
"""
least_frequent_zone = spark.sql(query)
Interesting Finding: Governor's Island/Ellis Island/Liberty Island had only a single pickup in the entire month!
๐ Key Takeaways
What impressed me most about Spark:
Versatility: Seamlessly switching between DataFrame operations and SQL queries
Performance: Ability to process large datasets efficiently
Scalability: Understanding how partitioning affects processing capabilities
Integration: Works well with existing data formats like Parquet
๐ฎ Looking Ahead
As I move forward in my data engineering journey:
Excited to build on these batch processing concepts
Planning to explore streaming data in the next module
Looking forward to applying Spark to larger and more complex datasets
Interested in exploring Spark ML capabilities in the future
#DEZOOMCAMP #DataEngineering #ApacheSpark #BigData
Subscribe to my newsletter
Read articles from Dinesh directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
