What I Learned in Module 4 of the Data Engineering Zoomcamp

Today I completed Module 4 of the Data Engineering Zoomcamp, focusing on Analytics Engineering with dbt (data build tool). This module took me on a comprehensive journey through transforming raw taxi data into sophisticated analytics-ready models that answer complex business questions. The process was both challenging and rewarding, pushing me to apply theoretical knowledge to real-world data problems.

Setting Up the Environment: From Raw Data to Cloud Storage

The foundation of any good data project starts with proper data engineering. I began by preparing three massive NYC Taxi & Limousine Commission (TLC) datasets in Google Cloud Platform:

Green Taxi dataset: 7,778,101 records spanning 2019-2020
Yellow Taxi dataset: 109,047,518 records spanning 2019-2020
For-Hire Vehicle dataset: 43,244,696 records from 2019

Using Python scripts, I:

Generated download URLs for all data files
Downloaded each compressed CSV file to temporary storage
Uploaded them to Google Cloud Storage buckets
Created external tables in BigQuery with appropriate schemas
Validated record counts to ensure data integrity before proceeding

This process taught me about handling large datasets efficiently, working with compressed files, and the challenges of moving data between systems without overwhelming local resources.

dbt Project Structure: Building a Logical Data Transformation Flow

With the raw data available in BigQuery, I designed a dbt project following modern analytics engineering principles:

Staging Layer: Created initial models (stg_green_tripdata, stg_yellow_tripdata, stg_fhv_tripdata) that:
- Cleaned and standardized column names
- Applied appropriate type casting
- Handled data quality issues like duplicate rows
- Generated surrogate keys for easier joining
Dimension Layer: Created dimension tables including:
- dim_zones: Normalized representation of taxi zones
- dim_fhv_trips: Clean representation of for-hire vehicle trips
Fact Layer: Built fact tables like fact_trips and specialized analytical models:
- fct_taxi_trips_quarterly_revenue: For revenue trend analysis
- fct_taxi_trips_monthly_fare_p95: For fare distribution analysis
- fct_fhv_monthly_zone_traveltime_p90: For travel time analysis

This multi-layered approach taught me how to organize transformations logically, creating a balance between reusability, performance, and maintainability.

dbt Fundamentals: Beyond Basic SQL

The heart of this module was learning how dbt enhances SQL with software engineering best practices:

Model Resolution and Configuration

I learned how dbt resolves references with environment variables and default values:

sources:
  - name: raw_nyc_tripdata
    database: "{{ env_var('DBT_BIGQUERY_PROJECT', 'dtc_zoomcamp_2025') }}"
    schema: "{{ env_var('DBT_BIGQUERY_SOURCE_DATASET', 'raw_nyc_tripdata') }}"

Understanding this resolution process is crucial for creating projects that work seamlessly across development, testing, and production environments.

Jinja Templating and Macros

I worked with Jinja macros to create reusable code blocks, including:

{% macro resolve_schema_for(model_type) -%}
    {%- set target_env_var = 'DBT_BIGQUERY_TARGET_DATASET'  -%}
    {%- set stging_env_var = 'DBT_BIGQUERY_STAGING_DATASET' -%}

    {%- if model_type == 'core' -%} {{- env_var(target_env_var) -}}
    {%- else -%}                    {{- env_var(stging_env_var, env_var(target_env_var)) -}}
    {%- endif -%}
{%- endmacro %}

This macro dynamically determines the target schema based on model type, allowing sensible defaults while maintaining flexibility.

I also used dbt's built-in macros like safe_cast and generate_surrogate_key to improve code quality and maintainability.

Data Lineage and Model Selection

One of the most powerful features of dbt is its ability to understand relationships between models. I explored the lineage graph showing how models depend on each other:

Raw sources feed into staging models
Staging models feed into dimension tables
Dimension tables combine to create fact tables
Fact tables power specific analytical models

This lineage affects how we select models for execution. I learned various selection patterns:

dbt run: Runs all models
dbt run --select +models/core/: Runs all core models and their dependencies
dbt run --select +models/core/model.sql: Runs a specific model and its dependencies
dbt run --select models/staging/+: Runs all staging models and their descendants

Understanding these selection patterns is crucial for efficient development and deployment workflows.

Advanced SQL Analysis: From Raw Data to Business Insights

The assignment pushed my SQL skills with complex analytical requirements:

Year-over-Year Growth Analysis

I created a quarterly revenue model to analyze the pandemic's impact:

WITH quarterly_revenue AS (
    SELECT
        service_type,
        year,
        quarter,
        year_quarter,
        SUM(total_amount) AS revenue
    FROM {{ ref('fact_trips') }}
    WHERE year IN (2019, 2020)
    GROUP BY service_type, year, quarter, year_quarter
),
revenue_with_prior_year AS (
    SELECT
        current_year.service_type,
        current_year.year,
        current_year.quarter,
        current_year.year_quarter,
        current_year.revenue AS current_revenue,
        prior_year.revenue AS prior_revenue
    FROM quarterly_revenue AS current_year
    LEFT JOIN quarterly_revenue AS prior_year
        ON current_year.service_type = prior_year.service_type
        AND current_year.quarter = prior_year.quarter
        AND prior_year.year = current_year.year - 1
    WHERE current_year.year = 2020
)
SELECT
    service_type,
    year,
    quarter,
    year_quarter,
    current_revenue AS quarterly_revenue,
    prior_revenue AS prev_year_revenue,
    CASE
        WHEN prior_revenue IS NOT NULL AND prior_revenue != 0
        THEN ROUND((current_revenue - prior_revenue) / prior_revenue * 100, 2)
        ELSE NULL
    END AS yoy_growth_percentage
FROM revenue_with_prior_year
ORDER BY service_type, quarter

This analysis revealed which quarters in 2020 performed best or worst compared to 2019, giving insight into the pandemic's impact on different taxi services.

Percentile Calculations for Fare Distribution

Using window functions, I analyzed fare amount distributions:

PERCENTILE_CONT(fare_amount, 0.97) OVER (PARTITION BY service_type, year, month) AS p97,
PERCENTILE_CONT(fare_amount, 0.95) OVER (PARTITION BY service_type, year, month) AS p95,
PERCENTILE_CONT(fare_amount, 0.90) OVER (PARTITION BY service_type, year, month) AS p90

This helped understand what passengers typically pay at the upper end of the spectrum, revealing differences between green and yellow taxis during specific periods.

Travel Time Analysis with Ranked Results

I built a model calculating the 90th percentile of trip durations between specific pickup and dropoff zones:

PERCENTILE_CONT(trip_duration, 0.90) OVER (
    PARTITION BY year, month, pickup_locationid, dropoff_locationid
) AS p90_duration

Then applied dense ranking to find the second-longest destinations:

DENSE_RANK() OVER (
    PARTITION BY year, month, pickup_zone
    ORDER BY p90_duration DESC
) AS duration_rank

This revealed interesting patterns about travel times from specific neighborhoods, potentially highlighting traffic patterns or service efficiency issues.

Challenges and Solutions

Throughout the project, I encountered several challenges:

Data Volume Management: Working with over 160 million records required careful consideration of materialization strategies. I used views for staging models and tables for frequently queried fact models.
Schema Evolution: The taxi data schemas changed slightly over time. I used dbt's safe_cast function to handle potential schema inconsistencies.
Environment Configuration: Managing environment variables for different contexts required careful planning. I learned to use nested variable resolution for maximum flexibility.
Performance Optimization: Some complex analytical queries were slow to execute. I added appropriate indices and pre-aggregations to improve performance.

Business Insights Discovered

Beyond technical learning, the assignment revealed interesting patterns in NYC taxi data:

Pandemic Impact: The data starkly showed how COVID-19 affected taxi services in NYC:
- Q1 2020 showed the beginning impacts
- Q2 2020 demonstrated the most dramatic revenue drops as lockdowns took effect
- Yellow and Green taxis were affected differently in various quarters
Fare Distribution Patterns: The analysis of p90/p95/p97 fare amounts showed:
- Green taxis consistently had higher upper percentile fares than yellow taxis in certain months
- The distribution tightened during pandemic months, suggesting fewer long/expensive trips
Geographic Insights: The travel time analysis revealed:
- Certain airport routes consistently had high p90 travel times
- Some neighborhood-to-neighborhood routes had surprisingly long p90 durations
- The second-longest routes from popular locations often involved cross-borough travel

Skills Development and Next Steps

This module significantly enhanced my data engineering toolbox:

Technical Skills:
- dbt project structure and best practices
- Advanced SQL window functions and analytical techniques
- BigQuery optimization strategies
- Complex data transformation patterns
- Dynamic configuration management
Business Skills:
- Translating analytical requirements into data models
- Identifying meaningful metrics for time-series analysis
- Understanding how to measure business impact through data

Future Enhancements

Building on this foundation, I'm planning to:

Expand the Analysis:
- Include demographic data to understand service patterns by neighborhood characteristics
- Add weather data to analyze its impact on taxi demand and pricing
- Incorporate COVID-19 case data to directly correlate with service changes
Technical Improvements:
- Implement incremental models to improve processing efficiency
- Add comprehensive data quality tests
- Create exposures for downstream BI tools
- Document key models with detailed descriptions
Visualization Layer:
- Develop Looker Studio dashboards to visualize the findings
- Create interactive maps showing geographic patterns
- Build executive summaries of key metrics

Conclusion

Module 4 of the Data Engineering Zoomcamp provided comprehensive training in modern analytics engineering. By applying dbt to real-world NYC taxi data, I've gained practical experience with the entire analytics engineering workflow—from raw data ingestion to producing business insights.

The combination of dbt's power with advanced SQL techniques creates a robust foundation for any data transformation challenge. Understanding how to structure models logically, apply software engineering principles to data work, and extract meaningful insights from large datasets are skills that will serve me well in future data engineering projects.

#zoomcamp #DataEngineering

Data Engineering with dbt: Mastering Analytics Engineering Through NYC Taxi Data