NYC Taxi Data Preprocessing with Python

tags: [Python, Data Analysis, Pandas, Data Visualization, NYC]

Image credits: https://www.nyc.gov/site/tlc/about/about-tlc.page

Overview

The NYC Taxi & Limousine Commission (TLC) publishes one of the richest urban mobility datasets in the world, containing millions of taxi trips taken across New York City. Analyzing this data offers insights into urban transportation patterns, congestion hotspots, and rider behavior. In this post, we’ll begin preparing the data for machine learning tasks by merging metadata, visualizing demand, and exporting spatial features.

The Dataset

I used publicly available data from the NYC Taxi & Limousine Commission (TLC). Here's a glimpse at what the data looks like:

import pandas as pd

df = pd.read_parquet("yellow_tripdata_2020-01.parquet")
df.head()

VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	RatecodeID	store_and_fwd_flag	PULocationID	DOLocationID	payment_type	fare_amount	extra	mta_tax	tip_amount	tolls_amount	improvement_surcharge	total_amount	congestion_surcharge	airport_fee
1	2020-01-01 00:28:15	2020-01-01 00:33:03	1.0	1.2	1.0	N	238	239	1	6.0	3.0	0.5	1.47	0.0	0.3	11.27	2.5	None
1	2020-01-01 00:35:39	2020-01-01 00:43:04	1.0	1.2	1.0	N	239	238	1	7.0	3.0	0.5	1.50	0.0	0.3	12.30	2.5	None
1	2020-01-01 00:47:41	2020-01-01 00:53:52	1.0	0.6	1.0	N	238	238	1	6.0	3.0	0.5	1.00	0.0	0.3	10.80	2.5	None
1	2020-01-01 00:55:23	2020-01-01 01:00:14	1.0	0.8	1.0	N	238	151	1	5.5	0.5	0.5	1.36	0.0	0.3	8.16	0.0	None
2	2020-01-01 00:01:58	2020-01-01 00:04:16	1.0	0.0	1.0	N	193	193	2	3.5	0.5	0.5	0.00	0.0	0.3	4.80	0.0	None

What Do the Colors Mean?

💛 Yellow Taxis

Traditional medallion taxis
Can pick up riders anywhere via street hail
Mostly operate in Manhattan and airports
Rich data: trip times, distances, fares, tips, payment types

💚 Green Taxis

Also called Boro Taxis
Focus on outer boroughs and Upper Manhattan
Street hail only allowed outside Midtown/Downtown Manhattan
Data structure nearly identical to Yellow Taxis

🚘 FHV (For-Hire Vehicles)

Covers Uber, Lyft, Via, and other pre-arranged rides
Cannot be hailed from the street
Less granular data due to privacy: usually lacks fare and tip amounts

🚐 High Volume FHV

Since 2018, high-volume data includes:
- Trip miles, time, shared ride flag
- Pickups by zone

Step 1: Loading Lookup Metadata

We started by loading the taxi zone lookup table, which maps LocationID values to boroughs and zones:

import dask.dataframe as dd
import pandas as pd

lookup_table = pd.read_csv('./taxi_zones/taxi_zone_lookup.csv', header=0)
lookup_table.columns = ["LocationID", "Borough", "Zone", "service_zone"]
lookup_table = lookup_table.fillna("N/A")
lookup_table = dd.from_pandas(lookup_table, npartitions=1)

This mapping is essential for geographic analysis and later visualizations or aggregations by borough or zone.

Note: dask.dataframe is not needed, but it is nice to use wildcards to read in data as the data is split across multiple files for each month.

# Load all Parquet files into Dask
df = dd.read_parquet("./data/*/yellow_tripdata_*.parquet")

Step 2: Preparing Spatial Coordinates

In tandem with zone information, we worked with centroid coordinates for each taxi zone:

centroids = pd.read_csv('./taxi_zones/taxi_zone_centroids.csv')

The centroid data provides latitude and longitude per LocationID, useful for plotting pickup and drop-off locations or calculating spatial features.

We also included optional (but commented out) GeoPandas logic to:

Read in the official NYC taxi zone shapefile
Calculate geometric centroids
Convert the CRS to WGS84 (EPSG:4326)
Save those as CSV for reuse

This step is crucial for anyone wanting to enhance ML features with geographic context.

Step 3: Merging Data for Contextual Features

To tie the lookup and centroid data together, we merged them into one enriched table on LocationID:

# Merge trip data with zone lookup
df = df.merge(lookup_table, left_on="PULocationID", right_on="LocationID", how="left")
df = df.rename(columns={"Borough": "pickup_borough", "Zone": "pickup_zone", "service_zone": "pickup_service_zone"})
df = df.drop(columns=["LocationID"])  # Remove duplicate LocationID column

df = df.merge(lookup_table, left_on="DOLocationID", right_on="LocationID", how="left")
df = df.rename(columns={"Borough": "dropoff_borough", "Zone": "dropoff_zone", "service_zone": "dropoff_service_zone"})
df = df.drop(columns=["LocationID"])  # Remove duplicate LocationID column

# Merge centroids with trip data (ensure `PULocationID` is matched)
df = df.merge(centroids, left_on="PULocationID", right_on="LocationID", how="left")
df = df.drop(columns=["LocationID"]) # Remove duplicate LocationID column

df["hour"] = df["tpep_pickup_datetime"].dt.hour

This combined dataset now contains:

Zone and borough names
Latitude/Longitude of zone centroids

Such a table becomes critical when analyzing pickup or drop-off zones in the raw trip data.

Step 4: Understanding `df.compute()`

Since we used Dask to load the lookup table, we needed to convert the lazy Dask DataFrame into a real in-memory Pandas DataFrame using .compute():

df.compute()

This command triggers Dask to execute any queued operations and return the full Pandas DataFrame. It’s necessary when you want to:

Perform operations that require full materialization (like merging with other Pandas data)
Preview results or export data
Move from parallel to single-machine processing when the data size is manageable

Using .compute() strategically allows us to benefit from Dask’s scalability without sacrificing compatibility with core Python tools.

Step 5: Computing Basic Zone Statistics

We then computed some basic counts of how often each zone appeared:

# Merge zone counts with coordinates
zone_counts = zone_counts.merge(
    centroids,
    left_on="PULocationID",
    right_on="LocationID",
    how="left"
)

These summaries provide insight into high-traffic areas, which can later be used to:

Create features for popular zones
Weight zones based on frequency
Visualize spatial activity levels

Step 6: Creating Visualization File for Kepler.gl

Once the data is merged and enriched with lat/lon coordinates, it's ready to be exported and visualized in a tool like Kepler.gl.

# Export to CSV for Kepler
zone_counts[[
    "timestamp", "Latitude", "Longitude", "trip_count",
    "hour", "day", "day_of_year", "day_of_week"
]].to_csv("kepler_zone_heatmap_detailed.csv", index=False)

With Kepler.gl, we can:

Plot pickup/drop-off zone centers
Visualize heatmaps of high-traffic zones
Animate trip volume over time (in future posts)

This step is especially useful for communicating insights spatially and uncovering regional patterns before modeling.

Step 7: Uncovering Demand Patterns by Hour and Borough

To better understand rider behavior, we visualized taxi demand over time and across boroughs.

Demand by Hour:

import matplotlib.pyplot as plt

df['hour'] = df['tpep_pickup_datetime'].dt.hour
hourly_demand = df.groupby('hour').size()

plt.figure(figsize=(10, 5))
plt.plot(hourly_demand.index, hourly_demand.values, marker='o')
plt.xlabel("Hour of Day")
plt.ylabel("Number of Trips")
plt.title("NYC Taxi Demand by Hour")
plt.grid()
plt.show()

This line plot shows daily fluctuations in demand, highlighting rush hours and late-night surges.

Demand by Borough:

borough_counts = df["pickup_borough"].value_counts().drop('N/A')

plt.figure(figsize=(10, 5))
borough_counts.plot(kind="bar", color="orange")
plt.xlabel("Borough")
plt.ylabel("Number of Trips")
plt.title("NYC Taxi Demand by Borough")
plt.xticks(rotation=45)
plt.show()

This bar chart reveals which boroughs are most active for pickups. Both analyses provide valuable context when building time-based or location-based features.

Next Steps: Feature Engineering & ML

Modeling Possibilities

With zone metadata and demand patterns now accessible, we can engineer features like:

pickup_hour, pickup_zone, dropoff_zone (categorical features)
Trip distance (from raw data)
Average tip by zone or hour (aggregated features)
Popularity rank of pickup zones

These features could feed into models predicting:

Fare amount (regression)
Trip duration (regression)
Demand forecasting (time series or classification)

Stay tuned for the next post where we'll start visualizing the data with Kepler.gl.

Preprocessing NYC Taxi Trip Data with Python 🗽🚕

Table of contents

tags: [Python, Data Analysis, Pandas, Data Visualization, NYC]

Overview

The Dataset

What Do the Colors Mean?

💛 Yellow Taxis

💚 Green Taxis

🚘 FHV (For-Hire Vehicles)

🚐 High Volume FHV

Step 1: Loading Lookup Metadata

Step 2: Preparing Spatial Coordinates

Step 3: Merging Data for Contextual Features

Step 4: Understanding `df.compute()`

Step 5: Computing Basic Zone Statistics

Step 6: Creating Visualization File for Kepler.gl

Step 7: Uncovering Demand Patterns by Hour and Borough

Next Steps: Feature Engineering & ML

Modeling Possibilities

Subscribe to my newsletter

Martin Tran

Martin Tran

Preprocessing NYC Taxi Trip Data with Python 🗽🚕

Table of contents

tags: [Python, Data Analysis, Pandas, Data Visualization, NYC]

Overview

The Dataset

What Do the Colors Mean?

💛 Yellow Taxis

💚 Green Taxis

🚘 FHV (For-Hire Vehicles)

🚐 High Volume FHV

Step 1: Loading Lookup Metadata

Step 2: Preparing Spatial Coordinates

Step 3: Merging Data for Contextual Features

Step 4: Understanding df.compute()

Step 5: Computing Basic Zone Statistics

Step 6: Creating Visualization File for Kepler.gl

Step 7: Uncovering Demand Patterns by Hour and Borough

Next Steps: Feature Engineering & ML

Modeling Possibilities

Subscribe to my newsletter

Martin Tran

Martin Tran

Step 4: Understanding `df.compute()`