Building a Modern Data Pipeline for NYC Taxi Data with dbt

Abhay AhirkarAbhay Ahirkar
2 min read

Introduction

Transforming raw data into actionable insights requires robust pipelines and well-structured workflows in data engineering. Today, I'll share a project I've been working on that leverages dbt (data build tool) to transform NYC taxi ride data into a reliable analytics foundation.

The Project: NYC Taxi Rides Analytics

This project creates a complete analytics engineering workflow for NYC taxi data using dbt Cloud and Google BigQuery. The pipeline transforms raw taxi trip data into well-modeled, tested, and documented datasets ready for analysis.

Key Components

1. Data Source & Staging

The project starts with raw NYC taxi data (yellow, green, and FHV) stored in BigQuery. Using dbt's staging models, we:

  • Transform raw data into consistent, cleaned views

  • Apply proper data typing and field standardization

  • Create surrogate keys for reliable record identification

2. Core Data Models

The heart of the project lies in its core models:

  • fact_trips: A unified view of all taxi trips with additional metadata

  • dim_zones: A dimension table for NYC taxi zone information

  • dm_monthly_zone_revenue: A data mart for analyzing revenue patterns

3. Testing & Documentation

Data quality is ensured through:

  • Column-level tests for uniqueness, null values, and data ranges

  • Custom tests for business logic validation (e.g., positive_values test)

  • Comprehensive documentation is generated automatically with dbt docs

The project incorporates CI/CD practices with GitHub integration and scheduled job runs in dbt Cloud. This allows for:

  • Version-controlled data transformations

  • Automated testing on pull requests

  • Scheduled refreshes of production data

Why This Matters

This project demonstrates how modern data teams can:

  1. Move beyond SQL scripts to version-controlled, modular data transformations

  2. Implement testing that catches data quality issues before they impact analysis

  3. Create self-documenting data assets that make analytics accessible

  4. Bridge the gap between data engineering and analytics

Next Steps

To expand this project, I'm considering the following:

  • Adding machine learning models for trip duration prediction

  • Creating additional data marts for specific business questions

  • Implementing dbt metrics for standardized KPI tracking

Conclusion

Using dbt for NYC taxi data analytics showcases the power of treating data transformations as code. The resulting pipeline is maintainable, testable, and produces trusted data assets ready for business intelligence tools and further analysis.

Have you worked with dbt or similar tools for your data pipelines? I'd love to hear your experiences in the comments!


This project was built with dbt Cloud, BigQuery, and inspiration from the DataTalksClub community.

0
Subscribe to my newsletter

Read articles from Abhay Ahirkar directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Abhay Ahirkar
Abhay Ahirkar