Data Engineering with dbt: Mastering Analytics Engineering Through NYC Taxi Data

What I Learned in Module 4 of the Data Engineering Zoomcamp
Today I completed Module 4 of the Data Engineering Zoomcamp, focusing on Analytics Engineering with dbt (data build tool). This module took me on a comprehensive journey through transforming raw taxi data into sophisticated analytics-ready models that answer complex business questions. The process was both challenging and rewarding, pushing me to apply theoretical knowledge to real-world data problems.
Setting Up the Environment: From Raw Data to Cloud Storage
The foundation of any good data project starts with proper data engineering. I began by preparing three massive NYC Taxi & Limousine Commission (TLC) datasets in Google Cloud Platform:
Green Taxi dataset: 7,778,101 records spanning 2019-2020
Yellow Taxi dataset: 109,047,518 records spanning 2019-2020
For-Hire Vehicle dataset: 43,244,696 records from 2019
Using Python scripts, I:
Generated download URLs for all data files
Downloaded each compressed CSV file to temporary storage
Uploaded them to Google Cloud Storage buckets
Created external tables in BigQuery with appropriate schemas
Validated record counts to ensure data integrity before proceeding
This process taught me about handling large datasets efficiently, working with compressed files, and the challenges of moving data between systems without overwhelming local resources.
dbt Project Structure: Building a Logical Data Transformation Flow
With the raw data available in BigQuery, I designed a dbt project following modern analytics engineering principles:
Staging Layer: Created initial models (
stg_green_tripdata
,stg_yellow_tripdata
,stg_fhv_tripdata
) that:Cleaned and standardized column names
Applied appropriate type casting
Handled data quality issues like duplicate rows
Generated surrogate keys for easier joining
Dimension Layer: Created dimension tables including:
dim_zones
: Normalized representation of taxi zonesdim_fhv_trips
: Clean representation of for-hire vehicle trips
Fact Layer: Built fact tables like
fact_trips
and specialized analytical models:fct_taxi_trips_quarterly_revenue
: For revenue trend analysisfct_taxi_trips_monthly_fare_p95
: For fare distribution analysisfct_fhv_monthly_zone_traveltime_p90
: For travel time analysis
This multi-layered approach taught me how to organize transformations logically, creating a balance between reusability, performance, and maintainability.
dbt Fundamentals: Beyond Basic SQL
The heart of this module was learning how dbt enhances SQL with software engineering best practices:
Model Resolution and Configuration
I learned how dbt resolves references with environment variables and default values:
sources:
- name: raw_nyc_tripdata
database: "{{ env_var('DBT_BIGQUERY_PROJECT', 'dtc_zoomcamp_2025') }}"
schema: "{{ env_var('DBT_BIGQUERY_SOURCE_DATASET', 'raw_nyc_tripdata') }}"
Understanding this resolution process is crucial for creating projects that work seamlessly across development, testing, and production environments.
Jinja Templating and Macros
I worked with Jinja macros to create reusable code blocks, including:
{% macro resolve_schema_for(model_type) -%}
{%- set target_env_var = 'DBT_BIGQUERY_TARGET_DATASET' -%}
{%- set stging_env_var = 'DBT_BIGQUERY_STAGING_DATASET' -%}
{%- if model_type == 'core' -%} {{- env_var(target_env_var) -}}
{%- else -%} {{- env_var(stging_env_var, env_var(target_env_var)) -}}
{%- endif -%}
{%- endmacro %}
This macro dynamically determines the target schema based on model type, allowing sensible defaults while maintaining flexibility.
I also used dbt's built-in macros like safe_cast
and generate_surrogate_key
to improve code quality and maintainability.
Data Lineage and Model Selection
One of the most powerful features of dbt is its ability to understand relationships between models. I explored the lineage graph showing how models depend on each other:
Raw sources feed into staging models
Staging models feed into dimension tables
Dimension tables combine to create fact tables
Fact tables power specific analytical models
This lineage affects how we select models for execution. I learned various selection patterns:
dbt run
: Runs all modelsdbt run --select +models/core/
: Runs all core models and their dependenciesdbt run --select +models/core/model.sql
: Runs a specific model and its dependenciesdbt run --select models/staging/+
: Runs all staging models and their descendants
Understanding these selection patterns is crucial for efficient development and deployment workflows.
Advanced SQL Analysis: From Raw Data to Business Insights
The assignment pushed my SQL skills with complex analytical requirements:
Year-over-Year Growth Analysis
I created a quarterly revenue model to analyze the pandemic's impact:
WITH quarterly_revenue AS (
SELECT
service_type,
year,
quarter,
year_quarter,
SUM(total_amount) AS revenue
FROM {{ ref('fact_trips') }}
WHERE year IN (2019, 2020)
GROUP BY service_type, year, quarter, year_quarter
),
revenue_with_prior_year AS (
SELECT
current_year.service_type,
current_year.year,
current_year.quarter,
current_year.year_quarter,
current_year.revenue AS current_revenue,
prior_year.revenue AS prior_revenue
FROM quarterly_revenue AS current_year
LEFT JOIN quarterly_revenue AS prior_year
ON current_year.service_type = prior_year.service_type
AND current_year.quarter = prior_year.quarter
AND prior_year.year = current_year.year - 1
WHERE current_year.year = 2020
)
SELECT
service_type,
year,
quarter,
year_quarter,
current_revenue AS quarterly_revenue,
prior_revenue AS prev_year_revenue,
CASE
WHEN prior_revenue IS NOT NULL AND prior_revenue != 0
THEN ROUND((current_revenue - prior_revenue) / prior_revenue * 100, 2)
ELSE NULL
END AS yoy_growth_percentage
FROM revenue_with_prior_year
ORDER BY service_type, quarter
This analysis revealed which quarters in 2020 performed best or worst compared to 2019, giving insight into the pandemic's impact on different taxi services.
Percentile Calculations for Fare Distribution
Using window functions, I analyzed fare amount distributions:
PERCENTILE_CONT(fare_amount, 0.97) OVER (PARTITION BY service_type, year, month) AS p97,
PERCENTILE_CONT(fare_amount, 0.95) OVER (PARTITION BY service_type, year, month) AS p95,
PERCENTILE_CONT(fare_amount, 0.90) OVER (PARTITION BY service_type, year, month) AS p90
This helped understand what passengers typically pay at the upper end of the spectrum, revealing differences between green and yellow taxis during specific periods.
Travel Time Analysis with Ranked Results
I built a model calculating the 90th percentile of trip durations between specific pickup and dropoff zones:
PERCENTILE_CONT(trip_duration, 0.90) OVER (
PARTITION BY year, month, pickup_locationid, dropoff_locationid
) AS p90_duration
Then applied dense ranking to find the second-longest destinations:
DENSE_RANK() OVER (
PARTITION BY year, month, pickup_zone
ORDER BY p90_duration DESC
) AS duration_rank
This revealed interesting patterns about travel times from specific neighborhoods, potentially highlighting traffic patterns or service efficiency issues.
Challenges and Solutions
Throughout the project, I encountered several challenges:
Data Volume Management: Working with over 160 million records required careful consideration of materialization strategies. I used views for staging models and tables for frequently queried fact models.
Schema Evolution: The taxi data schemas changed slightly over time. I used dbt's
safe_cast
function to handle potential schema inconsistencies.Environment Configuration: Managing environment variables for different contexts required careful planning. I learned to use nested variable resolution for maximum flexibility.
Performance Optimization: Some complex analytical queries were slow to execute. I added appropriate indices and pre-aggregations to improve performance.
Business Insights Discovered
Beyond technical learning, the assignment revealed interesting patterns in NYC taxi data:
Pandemic Impact: The data starkly showed how COVID-19 affected taxi services in NYC:
Q1 2020 showed the beginning impacts
Q2 2020 demonstrated the most dramatic revenue drops as lockdowns took effect
Yellow and Green taxis were affected differently in various quarters
Fare Distribution Patterns: The analysis of p90/p95/p97 fare amounts showed:
Green taxis consistently had higher upper percentile fares than yellow taxis in certain months
The distribution tightened during pandemic months, suggesting fewer long/expensive trips
Geographic Insights: The travel time analysis revealed:
Certain airport routes consistently had high p90 travel times
Some neighborhood-to-neighborhood routes had surprisingly long p90 durations
The second-longest routes from popular locations often involved cross-borough travel
Skills Development and Next Steps
This module significantly enhanced my data engineering toolbox:
Technical Skills:
dbt project structure and best practices
Advanced SQL window functions and analytical techniques
BigQuery optimization strategies
Complex data transformation patterns
Dynamic configuration management
Business Skills:
Translating analytical requirements into data models
Identifying meaningful metrics for time-series analysis
Understanding how to measure business impact through data
Future Enhancements
Building on this foundation, I'm planning to:
Expand the Analysis:
Include demographic data to understand service patterns by neighborhood characteristics
Add weather data to analyze its impact on taxi demand and pricing
Incorporate COVID-19 case data to directly correlate with service changes
Technical Improvements:
Implement incremental models to improve processing efficiency
Add comprehensive data quality tests
Create exposures for downstream BI tools
Document key models with detailed descriptions
Visualization Layer:
Develop Looker Studio dashboards to visualize the findings
Create interactive maps showing geographic patterns
Build executive summaries of key metrics
Conclusion
Module 4 of the Data Engineering Zoomcamp provided comprehensive training in modern analytics engineering. By applying dbt to real-world NYC taxi data, I've gained practical experience with the entire analytics engineering workflow—from raw data ingestion to producing business insights.
The combination of dbt's power with advanced SQL techniques creates a robust foundation for any data transformation challenge. Understanding how to structure models logically, apply software engineering principles to data work, and extract meaningful insights from large datasets are skills that will serve me well in future data engineering projects.
#zoomcamp #DataEngineering
Subscribe to my newsletter
Read articles from Dinesh directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
