Handling GTFS data with DuckDB

Tobias MüllerTobias Müller
8 min read

The General Transit Feed Specification (GTFS) is a standardized, open data format for public transportation schedules and geographic information. In practice, a GTFS feed is simply a ZIP archive of text (CSV) tables - such as stops.txt, routes.txt, and trips.txt - that together describe an agency’s routes, stops, schedules, and fare rules.

By following this common schema, transit agencies can publish their data so that any GTFS-compatible application (trip planners, mapping tools, or analytics pipelines) can consume it easily. Because GTFS is an open standard, it has become the go-to format for thousands of transit agencies worldwide.

This broad adoption means developers can integrate multiple feeds into a single pipeline without writing custom parsers. For example, many journey-planning apps and mapping services rely on GTFS feeds to deliver route and schedule information to users.

The specification even separates static and dynamic data: the Schedule portion provides planned timetables (in CSV), while the Realtime extension (using Protocol Buffers) adds live trip updates, vehicle positions, and service alerts.

The main benefits of GTFS are:

  • It defines a consistent schema (stops, routes, trips, etc), so users / developers only need to learn one format for all agencies

  • Interoperability: Any GTFS-compliant tool can read feeds from any transit provider, enabling seamless integration across different platforms and systems

  • Plain-text and open: The format is based on simple CSV files, which are easy to generate, ingest (egvia SQL or pandas), and maintain without proprietary software

  • Static + real-time: GTFS distinguishes fixed schedules from live updates. The static GTFS-Schedule files cover routes and timetables, while GTFS-Realtime (Protocol Buffers) provides real-time trip updates, vehicle positions, and alerts.

  • Ecosystem and tooling: A large community supports GTFS with open documentation and many libraries/validators (e.g. GTFS-kit in Python), so developer can leverage existing tools rather than build custom parsers

Creating a DuckDB database for GTFS Schedule data

As the GTFS Schedule standard uses ZIP-compressed CSV files, it’s not possible out of the box to directly analyze datasets with the most tools.

Also, a lot of GTFS datasets don’t publish all dataset files, as some of the data files are optional, additionally not all fields in each those files are required as well:

File NamePresenceDescription
agency.txtRequiredTransit agencies with service represented in this dataset.
stops.txtConditionally RequiredStops where vehicles pick up or drop off riders. Also defines stations and station entrances. Optional if demand-responsive zones are defined in locations.geojson, required otherwise.
routes.txtRequiredTransit routes. A route is a group of trips that are displayed to riders as a single service.
trips.txtRequiredTrips for each route. A trip is a sequence of two or more stops that occur during a specific time period.
stop_times.txtRequiredTimes that a vehicle arrives at and departs from stops for each trip.
calendar.txtConditionally RequiredService dates specified using a weekly schedule with start and end dates. Required unless all dates of service are defined in calendar_dates.txt, optional otherwise.
calendar_dates.txtConditionally RequiredExceptions for the services defined in calendar.txt. Required if calendar.txt is omitted (must then contain all dates of service), optional otherwise.
fare_attributes.txtOptionalFare information for a transit agency’s routes.
fare_rules.txtOptionalRules to apply fares for itineraries.
timeframes.txtOptionalDate and time periods to use in fare rules for fares that depend on date/time factors. Forbidden if network_id exists in routes.txt, optional otherwise.
shapes.txtOptionalRules for mapping vehicle travel paths (route alignments).
frequencies.txtOptionalHeadway (time between trips) for headway-based service or compressed fixed-schedule service.
transfers.txtOptionalRules for making connections at transfer points between routes.
pathways.txtOptionalPathways linking together locations within stations.
levels.txtConditionally RequiredLevels within stations. Required when describing pathways with elevators (pathway_mode=5), optional otherwise.
location_groups.txtOptionalA group of stops that together indicate locations where a rider may request pickup or drop off.
location_group_stops.txtOptionalRules to assign stops to location groups.
locations.geojsonOptionalZones for rider pickup/drop-off requests by on-demand services, represented as GeoJSON polygons.
booking_rules.txtOptionalBooking information for rider-requested services.
translations.txtOptionalTranslations of customer-facing dataset values.
feed_info.txtConditionally RequiredDataset metadata, including publisher, version, and expiration information. Required if translations.txt is provided, recommended otherwise.
attributions.txtOptionalDataset attributions.

Taken that into account, the plan for being able to use GTFS Schedule datasets with DuckDB is: Create a DuckDB database that contains all tables with all columns. The DDL for the tables will be derived from the official standard’s website. We want to use the foreign key relationships outlined in the standard, and thus need to make sure that the tables are created in the right order.

The database creation script can be found at queries/create_gtfs_database.sql.

This results in a database that has the following Entity Relationship Diagram (ERD):

Creating a DuckDB GTFS database from an example dataset

To load some sample dataset, we need to choose one of the many available data sources. For German data, the gtfs.de website provides some interesting datasets.

There’s one dataset that contains the full train and local traffic for Germany: de_full, and that’s the one we’ll use. It contains 1.6 million trips, 663 thousand stops, and more than 32 million stop times.

The accompanying GitHub repository tobilg/duckdb-gtfs has the necessary scripts available. You’ll need a locally installed DuckDB CLI to be able to follow the next steps on your machine.

Preparations

You need to clone the repository from GitHub first:

git clone git@github.com:tobilg/duckdb-gtfs.git && cd duckdb-gtfs

Downloading the dataset

To download the example dataset (218MB zipped, around 1.4GB unzipped), you can run the download script:

scripts/providers/gtfs-de/full/download_data.sh

Loading the dataset in a new GTFS database

As outlined above, not all dataset providers use the full set of tables in their datasets, so we’ll have to write a custom database creation script for each provider. In our case, the SQL for the example dataset looks like this (see load_data.sql):

INSERT INTO agency (agency_id,agency_name,agency_url,agency_timezone,agency_lang) 
SELECT * FROM read_csv('source-data/providers/gtfs-de/full/agency.txt', delim = ',', header = true);

INSERT INTO attributions (attribution_id,organization_name,is_producer,attribution_url,attribution_email) 
SELECT * FROM read_csv('source-data/providers/gtfs-de/full/attributions.txt', delim = ',', header = true);

INSERT INTO calendar (monday,tuesday,wednesday,thursday,friday,saturday,sunday,start_date,end_date,service_id) 
SELECT * FROM read_csv('source-data/providers/gtfs-de/full/calendar.txt', delim = ',', header = true, dateformat = '%Y%m%d');

INSERT INTO calendar_dates (service_id,exception_type,"date") 
SELECT * FROM read_csv('source-data/providers/gtfs-de/full/calendar_dates.txt', delim = ',', header = true, columns = {
  'service_id': 'VARCHAR',
  'exception_type': 'INTEGER',
  'date': 'DATE'
}, dateformat = '%Y%m%d', ignore_errors = true);

INSERT INTO feed_info (feed_publisher_name,feed_publisher_url,feed_lang,feed_version,feed_contact_email,feed_contact_url) 
SELECT * FROM read_csv('source-data/providers/gtfs-de/full/feed_info.txt', delim = ',', header = true, dateformat = '%Y%m%d');

INSERT INTO routes (route_long_name,route_short_name,agency_id,route_type,route_id) 
SELECT * FROM read_csv('source-data/providers/gtfs-de/full/routes.txt', delim = ',', header = true);

INSERT INTO trips (route_id,service_id,trip_id) 
SELECT * FROM read_csv('source-data/providers/gtfs-de/full/trips.txt', delim = ',', header = true, ignore_errors = true);

INSERT INTO stops (stop_name,parent_station,stop_id,stop_lat,stop_lon,location_type) 
SELECT * FROM read_csv('source-data/providers/gtfs-de/full/stops.txt', delim = ',', header = true);

INSERT INTO stop_times (trip_id,arrival_time,departure_time,stop_id,stop_sequence,pickup_type,drop_off_type) 
SELECT * FROM read_csv('source-data/providers/gtfs-de/full/stop_times.txt', delim = ',', header = true, ignore_errors = true);

Now, run the database creation script (this will take some time depending on your machine):

scripts/providers/gtfs-de/full/create_database.sh

The resulting DuckDB database file can be found at exported-data/providers/gtfs-de/full/data.duckdb.

Querying the new GTFS database

To query the new GTFS database, you can just use the newly created database file with your local DuckDB CLI:

duckdb exported-data/providers/gtfs-de/full/data.duckdb
v1.2.2 7c039464e4
Enter ".help" for usage hints.
D show tables;
┌──────────────────────┐
│         name         │
│       varchar        │
├──────────────────────┤
│ agency               │
│ areas                │
│ attributions         │
│ booking_rules        │
│ calendar             │
│ calendar_dates       │
│ fare_attributes      │
│ fare_leg_join_rules  │
│ fare_leg_rules       │
│ fare_media           │
│ fare_products        │
│ fare_rules           │
│ fare_transfer_rules  │
│ feed_info            │
│ frequencies          │
│ levels               │
│ location_group_stops │
│ location_groups      │
│ networks             │
│ pathways             │
│ rider_categories     │
│ route_networks       │
│ routes               │
│ shapes               │
│ stop_areas           │
│ stop_times           │
│ stops                │
│ timeframes           │
│ transfers            │
│ translations         │
│ trips                │
├──────────────────────┤
│       31 rows        │
└──────────────────────┘
D SELECT count(*) FROM stop_times;
┌─────────────────┐
│  count_star()   │
│      int64      │
├─────────────────┤
│    32228711     │
│ (32.23 million) │
└─────────────────┘
D SELECT count(*) FROM stops;
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│    678388    │
└──────────────┘
D SELECT count(*) FROM trips;
┌────────────────┐
│  count_star()  │
│     int64      │
├────────────────┤
│    1630671     │
│ (1.63 million) │
└────────────────┘

Export the data in Parquet format

If you want to export the data in an optimized format like Parquet, this is very straight forward with DuckDB (as demonstrated with the export_data.sh script:

duckdb exported-data/providers/gtfs-de/full/data.duckdb -c "EXPORT DATABASE 'exported-data/providers/gtfs-de/full' (FORMAT parquet);"

This will export each table as a Parquet file in the exported-data/providers/gtfs-de/full/ directory.

Summary

Using DuckDB for storing and analyzing GTFS Schedule data seems like a great choice. For example, the CSV reader of DuckDB can handle a lot of different error scenarios, which are quite common with data providers.

We successfully created a default GTFS Schedule database, that can be loaded with data from different providers with custom loading scripts. Those require specific effort to create, but as the result is a standardized database schema, it can create value for data-driven organizations to do so.

Being able to export the data to storage-optimized formats out-of-the-box is also a great benefit of DuckDB. We were able to shrink the 1.4GB unzipped data to 118MB in compressed Parquet files.

If you want to query the data live and in your browser, you can use SQL Workbench.

1
Subscribe to my newsletter

Read articles from Tobias Müller directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Tobias Müller
Tobias Müller