Handling GTFS data with DuckDB


The General Transit Feed Specification (GTFS) is a standardized, open data format for public transportation schedules and geographic information. In practice, a GTFS feed is simply a ZIP archive of text (CSV) tables - such as stops.txt
, routes.txt
, and trips.txt
- that together describe an agency’s routes, stops, schedules, and fare rules.
By following this common schema, transit agencies can publish their data so that any GTFS-compatible application (trip planners, mapping tools, or analytics pipelines) can consume it easily. Because GTFS is an open standard, it has become the go-to format for thousands of transit agencies worldwide.
This broad adoption means developers can integrate multiple feeds into a single pipeline without writing custom parsers. For example, many journey-planning apps and mapping services rely on GTFS feeds to deliver route and schedule information to users.
The specification even separates static and dynamic data: the Schedule portion provides planned timetables (in CSV), while the Realtime extension (using Protocol Buffers) adds live trip updates, vehicle positions, and service alerts.
The main benefits of GTFS are:
It defines a consistent schema (stops, routes, trips, etc), so users / developers only need to learn one format for all agencies
Interoperability: Any GTFS-compliant tool can read feeds from any transit provider, enabling seamless integration across different platforms and systems
Plain-text and open: The format is based on simple CSV files, which are easy to generate, ingest (egvia SQL or pandas), and maintain without proprietary software
Static + real-time: GTFS distinguishes fixed schedules from live updates. The static GTFS-Schedule files cover routes and timetables, while GTFS-Realtime (Protocol Buffers) provides real-time trip updates, vehicle positions, and alerts.
Ecosystem and tooling: A large community supports GTFS with open documentation and many libraries/validators (e.g. GTFS-kit in Python), so developer can leverage existing tools rather than build custom parsers
Creating a DuckDB database for GTFS Schedule data
As the GTFS Schedule standard uses ZIP-compressed CSV files, it’s not possible out of the box to directly analyze datasets with the most tools.
Also, a lot of GTFS datasets don’t publish all dataset files, as some of the data files are optional, additionally not all fields in each those files are required as well:
File Name | Presence | Description |
agency.txt | Required | Transit agencies with service represented in this dataset. |
stops.txt | Conditionally Required | Stops where vehicles pick up or drop off riders. Also defines stations and station entrances. Optional if demand-responsive zones are defined in locations.geojson, required otherwise. |
routes.txt | Required | Transit routes. A route is a group of trips that are displayed to riders as a single service. |
trips.txt | Required | Trips for each route. A trip is a sequence of two or more stops that occur during a specific time period. |
stop_times.txt | Required | Times that a vehicle arrives at and departs from stops for each trip. |
calendar.txt | Conditionally Required | Service dates specified using a weekly schedule with start and end dates. Required unless all dates of service are defined in calendar_dates.txt, optional otherwise. |
calendar_dates.txt | Conditionally Required | Exceptions for the services defined in calendar.txt. Required if calendar.txt is omitted (must then contain all dates of service), optional otherwise. |
fare_attributes.txt | Optional | Fare information for a transit agency’s routes. |
fare_rules.txt | Optional | Rules to apply fares for itineraries. |
timeframes.txt | Optional | Date and time periods to use in fare rules for fares that depend on date/time factors. Forbidden if network_id exists in routes.txt, optional otherwise. |
shapes.txt | Optional | Rules for mapping vehicle travel paths (route alignments). |
frequencies.txt | Optional | Headway (time between trips) for headway-based service or compressed fixed-schedule service. |
transfers.txt | Optional | Rules for making connections at transfer points between routes. |
pathways.txt | Optional | Pathways linking together locations within stations. |
levels.txt | Conditionally Required | Levels within stations. Required when describing pathways with elevators (pathway_mode=5 ), optional otherwise. |
location_groups.txt | Optional | A group of stops that together indicate locations where a rider may request pickup or drop off. |
location_group_stops.txt | Optional | Rules to assign stops to location groups. |
locations.geojson | Optional | Zones for rider pickup/drop-off requests by on-demand services, represented as GeoJSON polygons. |
booking_rules.txt | Optional | Booking information for rider-requested services. |
translations.txt | Optional | Translations of customer-facing dataset values. |
feed_info.txt | Conditionally Required | Dataset metadata, including publisher, version, and expiration information. Required if translations.txt is provided, recommended otherwise. |
attributions.txt | Optional | Dataset attributions. |
Taken that into account, the plan for being able to use GTFS Schedule datasets with DuckDB is: Create a DuckDB database that contains all tables with all columns. The DDL for the tables will be derived from the official standard’s website. We want to use the foreign key relationships outlined in the standard, and thus need to make sure that the tables are created in the right order.
The database creation script can be found at queries/create_gtfs_database.sql.
This results in a database that has the following Entity Relationship Diagram (ERD):
Creating a DuckDB GTFS database from an example dataset
To load some sample dataset, we need to choose one of the many available data sources. For German data, the gtfs.de website provides some interesting datasets.
There’s one dataset that contains the full train and local traffic for Germany: de_full, and that’s the one we’ll use. It contains 1.6 million trips, 663 thousand stops, and more than 32 million stop times.
The accompanying GitHub repository tobilg/duckdb-gtfs has the necessary scripts available. You’ll need a locally installed DuckDB CLI to be able to follow the next steps on your machine.
Preparations
You need to clone the repository from GitHub first:
git clone git@github.com:tobilg/duckdb-gtfs.git && cd duckdb-gtfs
Downloading the dataset
To download the example dataset (218MB zipped, around 1.4GB unzipped), you can run the download script:
scripts/providers/gtfs-de/full/download_data.sh
Loading the dataset in a new GTFS database
As outlined above, not all dataset providers use the full set of tables in their datasets, so we’ll have to write a custom database creation script for each provider. In our case, the SQL for the example dataset looks like this (see load_data.sql):
INSERT INTO agency (agency_id,agency_name,agency_url,agency_timezone,agency_lang)
SELECT * FROM read_csv('source-data/providers/gtfs-de/full/agency.txt', delim = ',', header = true);
INSERT INTO attributions (attribution_id,organization_name,is_producer,attribution_url,attribution_email)
SELECT * FROM read_csv('source-data/providers/gtfs-de/full/attributions.txt', delim = ',', header = true);
INSERT INTO calendar (monday,tuesday,wednesday,thursday,friday,saturday,sunday,start_date,end_date,service_id)
SELECT * FROM read_csv('source-data/providers/gtfs-de/full/calendar.txt', delim = ',', header = true, dateformat = '%Y%m%d');
INSERT INTO calendar_dates (service_id,exception_type,"date")
SELECT * FROM read_csv('source-data/providers/gtfs-de/full/calendar_dates.txt', delim = ',', header = true, columns = {
'service_id': 'VARCHAR',
'exception_type': 'INTEGER',
'date': 'DATE'
}, dateformat = '%Y%m%d', ignore_errors = true);
INSERT INTO feed_info (feed_publisher_name,feed_publisher_url,feed_lang,feed_version,feed_contact_email,feed_contact_url)
SELECT * FROM read_csv('source-data/providers/gtfs-de/full/feed_info.txt', delim = ',', header = true, dateformat = '%Y%m%d');
INSERT INTO routes (route_long_name,route_short_name,agency_id,route_type,route_id)
SELECT * FROM read_csv('source-data/providers/gtfs-de/full/routes.txt', delim = ',', header = true);
INSERT INTO trips (route_id,service_id,trip_id)
SELECT * FROM read_csv('source-data/providers/gtfs-de/full/trips.txt', delim = ',', header = true, ignore_errors = true);
INSERT INTO stops (stop_name,parent_station,stop_id,stop_lat,stop_lon,location_type)
SELECT * FROM read_csv('source-data/providers/gtfs-de/full/stops.txt', delim = ',', header = true);
INSERT INTO stop_times (trip_id,arrival_time,departure_time,stop_id,stop_sequence,pickup_type,drop_off_type)
SELECT * FROM read_csv('source-data/providers/gtfs-de/full/stop_times.txt', delim = ',', header = true, ignore_errors = true);
Now, run the database creation script (this will take some time depending on your machine):
scripts/providers/gtfs-de/full/create_database.sh
The resulting DuckDB database file can be found at exported-data/providers/gtfs-de/full/data.duckdb
.
Querying the new GTFS database
To query the new GTFS database, you can just use the newly created database file with your local DuckDB CLI:
duckdb exported-data/providers/gtfs-de/full/data.duckdb
v1.2.2 7c039464e4
Enter ".help" for usage hints.
D show tables;
┌──────────────────────┐
│ name │
│ varchar │
├──────────────────────┤
│ agency │
│ areas │
│ attributions │
│ booking_rules │
│ calendar │
│ calendar_dates │
│ fare_attributes │
│ fare_leg_join_rules │
│ fare_leg_rules │
│ fare_media │
│ fare_products │
│ fare_rules │
│ fare_transfer_rules │
│ feed_info │
│ frequencies │
│ levels │
│ location_group_stops │
│ location_groups │
│ networks │
│ pathways │
│ rider_categories │
│ route_networks │
│ routes │
│ shapes │
│ stop_areas │
│ stop_times │
│ stops │
│ timeframes │
│ transfers │
│ translations │
│ trips │
├──────────────────────┤
│ 31 rows │
└──────────────────────┘
D SELECT count(*) FROM stop_times;
┌─────────────────┐
│ count_star() │
│ int64 │
├─────────────────┤
│ 32228711 │
│ (32.23 million) │
└─────────────────┘
D SELECT count(*) FROM stops;
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 678388 │
└──────────────┘
D SELECT count(*) FROM trips;
┌────────────────┐
│ count_star() │
│ int64 │
├────────────────┤
│ 1630671 │
│ (1.63 million) │
└────────────────┘
Export the data in Parquet format
If you want to export the data in an optimized format like Parquet, this is very straight forward with DuckDB (as demonstrated with the export_data.sh script:
duckdb exported-data/providers/gtfs-de/full/data.duckdb -c "EXPORT DATABASE 'exported-data/providers/gtfs-de/full' (FORMAT parquet);"
This will export each table as a Parquet file in the exported-data/providers/gtfs-de/full/
directory.
Summary
Using DuckDB for storing and analyzing GTFS Schedule data seems like a great choice. For example, the CSV reader of DuckDB can handle a lot of different error scenarios, which are quite common with data providers.
We successfully created a default GTFS Schedule database, that can be loaded with data from different providers with custom loading scripts. Those require specific effort to create, but as the result is a standardized database schema, it can create value for data-driven organizations to do so.
Being able to export the data to storage-optimized formats out-of-the-box is also a great benefit of DuckDB. We were able to shrink the 1.4GB unzipped data to 118MB in compressed Parquet files.
If you want to query the data live and in your browser, you can use SQL Workbench.
Subscribe to my newsletter
Read articles from Tobias Müller directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
