Preprocessing NYC Taxi Trip Data with Python π½π

Table of contents
- tags: [Python, Data Analysis, Pandas, Data Visualization, NYC]
- Overview
- The Dataset
- What Do the Colors Mean?
- Step 1: Loading Lookup Metadata
- Step 2: Preparing Spatial Coordinates
- Step 3: Merging Data for Contextual Features
- Step 4: Understanding df.compute()
- Step 5: Computing Basic Zone Statistics
- Step 6: Creating Visualization File for Kepler.gl
- Step 7: Uncovering Demand Patterns by Hour and Borough
- Next Steps: Feature Engineering & ML

tags: [Python, Data Analysis, Pandas, Data Visualization, NYC]
Image credits: https://www.nyc.gov/site/tlc/about/about-tlc.page
Overview
The NYC Taxi & Limousine Commission (TLC) publishes one of the richest urban mobility datasets in the world, containing millions of taxi trips taken across New York City. Analyzing this data offers insights into urban transportation patterns, congestion hotspots, and rider behavior. In this post, weβll begin preparing the data for machine learning tasks by merging metadata, visualizing demand, and exporting spatial features.
The Dataset
I used publicly available data from the NYC Taxi & Limousine Commission (TLC). Here's a glimpse at what the data looks like:
import pandas as pd
df = pd.read_parquet("yellow_tripdata_2020-01.parquet")
df.head()
VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge | airport_fee |
1 | 2020-01-01 00:28:15 | 2020-01-01 00:33:03 | 1.0 | 1.2 | 1.0 | N | 238 | 239 | 1 | 6.0 | 3.0 | 0.5 | 1.47 | 0.0 | 0.3 | 11.27 | 2.5 | None |
1 | 2020-01-01 00:35:39 | 2020-01-01 00:43:04 | 1.0 | 1.2 | 1.0 | N | 239 | 238 | 1 | 7.0 | 3.0 | 0.5 | 1.50 | 0.0 | 0.3 | 12.30 | 2.5 | None |
1 | 2020-01-01 00:47:41 | 2020-01-01 00:53:52 | 1.0 | 0.6 | 1.0 | N | 238 | 238 | 1 | 6.0 | 3.0 | 0.5 | 1.00 | 0.0 | 0.3 | 10.80 | 2.5 | None |
1 | 2020-01-01 00:55:23 | 2020-01-01 01:00:14 | 1.0 | 0.8 | 1.0 | N | 238 | 151 | 1 | 5.5 | 0.5 | 0.5 | 1.36 | 0.0 | 0.3 | 8.16 | 0.0 | None |
2 | 2020-01-01 00:01:58 | 2020-01-01 00:04:16 | 1.0 | 0.0 | 1.0 | N | 193 | 193 | 2 | 3.5 | 0.5 | 0.5 | 0.00 | 0.0 | 0.3 | 4.80 | 0.0 | None |
What Do the Colors Mean?
π Yellow Taxis
Traditional medallion taxis
Can pick up riders anywhere via street hail
Mostly operate in Manhattan and airports
Rich data: trip times, distances, fares, tips, payment types
π Green Taxis
Also called Boro Taxis
Focus on outer boroughs and Upper Manhattan
Street hail only allowed outside Midtown/Downtown Manhattan
Data structure nearly identical to Yellow Taxis
π FHV (For-Hire Vehicles)
Covers Uber, Lyft, Via, and other pre-arranged rides
Cannot be hailed from the street
Less granular data due to privacy: usually lacks fare and tip amounts
π High Volume FHV
Since 2018, high-volume data includes:
Trip miles, time, shared ride flag
Pickups by zone
Step 1: Loading Lookup Metadata
We started by loading the taxi zone lookup table, which maps LocationID
values to boroughs and zones:
import dask.dataframe as dd
import pandas as pd
lookup_table = pd.read_csv('./taxi_zones/taxi_zone_lookup.csv', header=0)
lookup_table.columns = ["LocationID", "Borough", "Zone", "service_zone"]
lookup_table = lookup_table.fillna("N/A")
lookup_table = dd.from_pandas(lookup_table, npartitions=1)
This mapping is essential for geographic analysis and later visualizations or aggregations by borough or zone.
Note: dask.dataframe is not needed, but it is nice to use wildcards to read in data as the data is split across multiple files for each month.
# Load all Parquet files into Dask
df = dd.read_parquet("./data/*/yellow_tripdata_*.parquet")
Step 2: Preparing Spatial Coordinates
In tandem with zone information, we worked with centroid coordinates for each taxi zone:
centroids = pd.read_csv('./taxi_zones/taxi_zone_centroids.csv')
The centroid data provides latitude and longitude per LocationID
, useful for plotting pickup and drop-off locations or calculating spatial features.
We also included optional (but commented out) GeoPandas logic to:
Read in the official NYC taxi zone shapefile
Calculate geometric centroids
Convert the CRS to WGS84 (EPSG:4326)
Save those as CSV for reuse
This step is crucial for anyone wanting to enhance ML features with geographic context.
Step 3: Merging Data for Contextual Features
To tie the lookup and centroid data together, we merged them into one enriched table on LocationID
:
# Merge trip data with zone lookup
df = df.merge(lookup_table, left_on="PULocationID", right_on="LocationID", how="left")
df = df.rename(columns={"Borough": "pickup_borough", "Zone": "pickup_zone", "service_zone": "pickup_service_zone"})
df = df.drop(columns=["LocationID"]) # Remove duplicate LocationID column
df = df.merge(lookup_table, left_on="DOLocationID", right_on="LocationID", how="left")
df = df.rename(columns={"Borough": "dropoff_borough", "Zone": "dropoff_zone", "service_zone": "dropoff_service_zone"})
df = df.drop(columns=["LocationID"]) # Remove duplicate LocationID column
# Merge centroids with trip data (ensure `PULocationID` is matched)
df = df.merge(centroids, left_on="PULocationID", right_on="LocationID", how="left")
df = df.drop(columns=["LocationID"]) # Remove duplicate LocationID column
df["hour"] = df["tpep_pickup_datetime"].dt.hour
This combined dataset now contains:
Zone and borough names
Latitude/Longitude of zone centroids
Such a table becomes critical when analyzing pickup or drop-off zones in the raw trip data.
Step 4: Understanding df.compute()
Since we used Dask to load the lookup table, we needed to convert the lazy Dask DataFrame into a real in-memory Pandas DataFrame using .compute()
:
df.compute()
This command triggers Dask to execute any queued operations and return the full Pandas DataFrame. Itβs necessary when you want to:
Perform operations that require full materialization (like merging with other Pandas data)
Preview results or export data
Move from parallel to single-machine processing when the data size is manageable
Using .compute()
strategically allows us to benefit from Daskβs scalability without sacrificing compatibility with core Python tools.
Step 5: Computing Basic Zone Statistics
We then computed some basic counts of how often each zone appeared:
# Merge zone counts with coordinates
zone_counts = zone_counts.merge(
centroids,
left_on="PULocationID",
right_on="LocationID",
how="left"
)
These summaries provide insight into high-traffic areas, which can later be used to:
Create features for popular zones
Weight zones based on frequency
Visualize spatial activity levels
Step 6: Creating Visualization File for Kepler.gl
Once the data is merged and enriched with lat/lon coordinates, it's ready to be exported and visualized in a tool like Kepler.gl.
# Export to CSV for Kepler
zone_counts[[
"timestamp", "Latitude", "Longitude", "trip_count",
"hour", "day", "day_of_year", "day_of_week"
]].to_csv("kepler_zone_heatmap_detailed.csv", index=False)
With Kepler.gl, we can:
Plot pickup/drop-off zone centers
Visualize heatmaps of high-traffic zones
Animate trip volume over time (in future posts)
This step is especially useful for communicating insights spatially and uncovering regional patterns before modeling.
Step 7: Uncovering Demand Patterns by Hour and Borough
To better understand rider behavior, we visualized taxi demand over time and across boroughs.
Demand by Hour:
import matplotlib.pyplot as plt
df['hour'] = df['tpep_pickup_datetime'].dt.hour
hourly_demand = df.groupby('hour').size()
plt.figure(figsize=(10, 5))
plt.plot(hourly_demand.index, hourly_demand.values, marker='o')
plt.xlabel("Hour of Day")
plt.ylabel("Number of Trips")
plt.title("NYC Taxi Demand by Hour")
plt.grid()
plt.show()
This line plot shows daily fluctuations in demand, highlighting rush hours and late-night surges.
Demand by Borough:
borough_counts = df["pickup_borough"].value_counts().drop('N/A')
plt.figure(figsize=(10, 5))
borough_counts.plot(kind="bar", color="orange")
plt.xlabel("Borough")
plt.ylabel("Number of Trips")
plt.title("NYC Taxi Demand by Borough")
plt.xticks(rotation=45)
plt.show()
This bar chart reveals which boroughs are most active for pickups. Both analyses provide valuable context when building time-based or location-based features.
Next Steps: Feature Engineering & ML
Modeling Possibilities
With zone metadata and demand patterns now accessible, we can engineer features like:
pickup_hour
,pickup_zone
,dropoff_zone
(categorical features)Trip distance (from raw data)
Average tip by zone or hour (aggregated features)
Popularity rank of pickup zones
These features could feed into models predicting:
Fare amount (regression)
Trip duration (regression)
Demand forecasting (time series or classification)
Stay tuned for the next post where we'll start visualizing the data with Kepler.gl.
Subscribe to my newsletter
Read articles from Martin Tran directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
