Preprocessing NYC Taxi Trip Data with Python πŸ—½πŸš•

Martin TranMartin Tran
6 min read

tags: [Python, Data Analysis, Pandas, Data Visualization, NYC]

Image credits: https://www.nyc.gov/site/tlc/about/about-tlc.page


Overview

The NYC Taxi & Limousine Commission (TLC) publishes one of the richest urban mobility datasets in the world, containing millions of taxi trips taken across New York City. Analyzing this data offers insights into urban transportation patterns, congestion hotspots, and rider behavior. In this post, we’ll begin preparing the data for machine learning tasks by merging metadata, visualizing demand, and exporting spatial features.


The Dataset

I used publicly available data from the NYC Taxi & Limousine Commission (TLC). Here's a glimpse at what the data looks like:

import pandas as pd

df = pd.read_parquet("yellow_tripdata_2020-01.parquet")
df.head()
VendorIDtpep_pickup_datetimetpep_dropoff_datetimepassenger_counttrip_distanceRatecodeIDstore_and_fwd_flagPULocationIDDOLocationIDpayment_typefare_amountextramta_taxtip_amounttolls_amountimprovement_surchargetotal_amountcongestion_surchargeairport_fee
12020-01-01 00:28:152020-01-01 00:33:031.01.21.0N23823916.03.00.51.470.00.311.272.5None
12020-01-01 00:35:392020-01-01 00:43:041.01.21.0N23923817.03.00.51.500.00.312.302.5None
12020-01-01 00:47:412020-01-01 00:53:521.00.61.0N23823816.03.00.51.000.00.310.802.5None
12020-01-01 00:55:232020-01-01 01:00:141.00.81.0N23815115.50.50.51.360.00.38.160.0None
22020-01-01 00:01:582020-01-01 00:04:161.00.01.0N19319323.50.50.50.000.00.34.800.0None

What Do the Colors Mean?

πŸ’› Yellow Taxis

  • Traditional medallion taxis

  • Can pick up riders anywhere via street hail

  • Mostly operate in Manhattan and airports

  • Rich data: trip times, distances, fares, tips, payment types

πŸ’š Green Taxis

  • Also called Boro Taxis

  • Focus on outer boroughs and Upper Manhattan

  • Street hail only allowed outside Midtown/Downtown Manhattan

  • Data structure nearly identical to Yellow Taxis

🚘 FHV (For-Hire Vehicles)

  • Covers Uber, Lyft, Via, and other pre-arranged rides

  • Cannot be hailed from the street

  • Less granular data due to privacy: usually lacks fare and tip amounts

🚐 High Volume FHV

  • Since 2018, high-volume data includes:

    • Trip miles, time, shared ride flag

    • Pickups by zone


Step 1: Loading Lookup Metadata

We started by loading the taxi zone lookup table, which maps LocationID values to boroughs and zones:

import dask.dataframe as dd
import pandas as pd

lookup_table = pd.read_csv('./taxi_zones/taxi_zone_lookup.csv', header=0)
lookup_table.columns = ["LocationID", "Borough", "Zone", "service_zone"]
lookup_table = lookup_table.fillna("N/A")
lookup_table = dd.from_pandas(lookup_table, npartitions=1)

This mapping is essential for geographic analysis and later visualizations or aggregations by borough or zone.

Note: dask.dataframe is not needed, but it is nice to use wildcards to read in data as the data is split across multiple files for each month.

# Load all Parquet files into Dask
df = dd.read_parquet("./data/*/yellow_tripdata_*.parquet")

Step 2: Preparing Spatial Coordinates

In tandem with zone information, we worked with centroid coordinates for each taxi zone:

centroids = pd.read_csv('./taxi_zones/taxi_zone_centroids.csv')

The centroid data provides latitude and longitude per LocationID, useful for plotting pickup and drop-off locations or calculating spatial features.

We also included optional (but commented out) GeoPandas logic to:

  • Read in the official NYC taxi zone shapefile

  • Calculate geometric centroids

  • Convert the CRS to WGS84 (EPSG:4326)

  • Save those as CSV for reuse

This step is crucial for anyone wanting to enhance ML features with geographic context.


Step 3: Merging Data for Contextual Features

To tie the lookup and centroid data together, we merged them into one enriched table on LocationID:

# Merge trip data with zone lookup
df = df.merge(lookup_table, left_on="PULocationID", right_on="LocationID", how="left")
df = df.rename(columns={"Borough": "pickup_borough", "Zone": "pickup_zone", "service_zone": "pickup_service_zone"})
df = df.drop(columns=["LocationID"])  # Remove duplicate LocationID column

df = df.merge(lookup_table, left_on="DOLocationID", right_on="LocationID", how="left")
df = df.rename(columns={"Borough": "dropoff_borough", "Zone": "dropoff_zone", "service_zone": "dropoff_service_zone"})
df = df.drop(columns=["LocationID"])  # Remove duplicate LocationID column

# Merge centroids with trip data (ensure `PULocationID` is matched)
df = df.merge(centroids, left_on="PULocationID", right_on="LocationID", how="left")
df = df.drop(columns=["LocationID"]) # Remove duplicate LocationID column

df["hour"] = df["tpep_pickup_datetime"].dt.hour

This combined dataset now contains:

  • Zone and borough names

  • Latitude/Longitude of zone centroids

Such a table becomes critical when analyzing pickup or drop-off zones in the raw trip data.


Step 4: Understanding df.compute()

Since we used Dask to load the lookup table, we needed to convert the lazy Dask DataFrame into a real in-memory Pandas DataFrame using .compute():

df.compute()

This command triggers Dask to execute any queued operations and return the full Pandas DataFrame. It’s necessary when you want to:

  • Perform operations that require full materialization (like merging with other Pandas data)

  • Preview results or export data

  • Move from parallel to single-machine processing when the data size is manageable

Using .compute() strategically allows us to benefit from Dask’s scalability without sacrificing compatibility with core Python tools.


Step 5: Computing Basic Zone Statistics

We then computed some basic counts of how often each zone appeared:

# Merge zone counts with coordinates
zone_counts = zone_counts.merge(
    centroids,
    left_on="PULocationID",
    right_on="LocationID",
    how="left"
)

These summaries provide insight into high-traffic areas, which can later be used to:

  • Create features for popular zones

  • Weight zones based on frequency

  • Visualize spatial activity levels


Step 6: Creating Visualization File for Kepler.gl

Once the data is merged and enriched with lat/lon coordinates, it's ready to be exported and visualized in a tool like Kepler.gl.

# Export to CSV for Kepler
zone_counts[[
    "timestamp", "Latitude", "Longitude", "trip_count",
    "hour", "day", "day_of_year", "day_of_week"
]].to_csv("kepler_zone_heatmap_detailed.csv", index=False)

With Kepler.gl, we can:

  • Plot pickup/drop-off zone centers

  • Visualize heatmaps of high-traffic zones

  • Animate trip volume over time (in future posts)

This step is especially useful for communicating insights spatially and uncovering regional patterns before modeling.


Step 7: Uncovering Demand Patterns by Hour and Borough

To better understand rider behavior, we visualized taxi demand over time and across boroughs.

Demand by Hour:

import matplotlib.pyplot as plt

df['hour'] = df['tpep_pickup_datetime'].dt.hour
hourly_demand = df.groupby('hour').size()

plt.figure(figsize=(10, 5))
plt.plot(hourly_demand.index, hourly_demand.values, marker='o')
plt.xlabel("Hour of Day")
plt.ylabel("Number of Trips")
plt.title("NYC Taxi Demand by Hour")
plt.grid()
plt.show()

This line plot shows daily fluctuations in demand, highlighting rush hours and late-night surges.

Demand by Borough:

borough_counts = df["pickup_borough"].value_counts().drop('N/A')

plt.figure(figsize=(10, 5))
borough_counts.plot(kind="bar", color="orange")
plt.xlabel("Borough")
plt.ylabel("Number of Trips")
plt.title("NYC Taxi Demand by Borough")
plt.xticks(rotation=45)
plt.show()

This bar chart reveals which boroughs are most active for pickups. Both analyses provide valuable context when building time-based or location-based features.


Next Steps: Feature Engineering & ML

Modeling Possibilities

With zone metadata and demand patterns now accessible, we can engineer features like:

  • pickup_hour, pickup_zone, dropoff_zone (categorical features)

  • Trip distance (from raw data)

  • Average tip by zone or hour (aggregated features)

  • Popularity rank of pickup zones

These features could feed into models predicting:

  • Fare amount (regression)

  • Trip duration (regression)

  • Demand forecasting (time series or classification)

Stay tuned for the next post where we'll start visualizing the data with Kepler.gl.

0
Subscribe to my newsletter

Read articles from Martin Tran directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Martin Tran
Martin Tran