My Journey Through Module 5 of Data Engineering Zoomcamp

This week marked an exciting milestone in my data engineering journey as I completed Module 5 of the Data Engineering Zoomcamp, which focused on batch processing with Apache Spark.

🛠️ Setting Up the Environment

The module began with setting up Apache Spark and PySpark on my Linux machine:

Created a Python virtual environment named 'zoomcamp'
Installed PySpark 3.4.1 and necessary dependencies
Set up Jupyter for interactive exploration
Configured Spark session with local processing mode

Getting that first Spark session running and seeing the version number appear was a small but satisfying victory.

📊 Working with NYC Yellow Taxi Data

The homework assignment centered around analyzing the NYC Yellow Taxi dataset for October 2024:

Data Processing Steps:

Read the parquet file into a Spark DataFrame
Examined the schema and data structure
Repartitioned the data into 4 partitions
Measured the resulting partition sizes (about 23MB each)

Key Learning: Spark's lazy evaluation means commands aren't executed until an action is called, making the entire pipeline more efficient as Spark can optimize operations before actually processing data.

🔍 Finding Insights in the Data

The analysis questions pushed me to apply various Spark functions and techniques:

Question 3: Trip Count Analysis

Used timestamp functions to filter for a specific date
Found nearly 129,000 taxi trips on October 15th alone
Learned to use year(), month(), and dayofmonth() functions

Question 4: Longest Trip Duration

Calculated trip durations using unix_timestamp conversions
Discovered a trip lasting 162.62 hours (almost a week!)
This outlier highlighted the importance of data quality checks

💻 SQL in Spark

For the final question, I leveraged Spark's SQL capabilities:

query = """
SELECT z.Zone, COUNT(*) as trip_count
FROM trips t
JOIN zones z ON t.PULocationID = z.LocationID
GROUP BY z.Zone
ORDER BY trip_count ASC
LIMIT 1
"""

least_frequent_zone = spark.sql(query)

Interesting Finding: Governor's Island/Ellis Island/Liberty Island had only a single pickup in the entire month!

🌟 Key Takeaways

What impressed me most about Spark:

Versatility: Seamlessly switching between DataFrame operations and SQL queries
Performance: Ability to process large datasets efficiently
Scalability: Understanding how partitioning affects processing capabilities
Integration: Works well with existing data formats like Parquet

🔮 Looking Ahead

As I move forward in my data engineering journey:

Excited to build on these batch processing concepts
Planning to explore streaming data in the next module
Looking forward to applying Spark to larger and more complex datasets
Interested in exploring Spark ML capabilities in the future

#DEZOOMCAMP #DataEngineering #ApacheSpark #BigData

Diving into Big Data with Apache Spark