The General Transit Feed Specification (GTFS) is a standardized, open data format for public transportation schedules and geographic information. In practice, a GTFS feed is simply a ZIP archive of text (CSV) tables - such as stops.txt, routes.txt, and trips.txt - that together describe an agency’s routes, stops, schedules, and fare rules.

By following this common schema, transit agencies can publish their data so that any GTFS-compatible application (trip planners, mapping tools, or analytics pipelines) can consume it easily. Because GTFS is an open standard, it has become the go-to format for thousands of transit agencies worldwide.

This broad adoption means developers can integrate multiple feeds into a single pipeline without writing custom parsers. For example, many journey-planning apps and mapping services rely on GTFS feeds to deliver route and schedule information to users.

The specification even separates static and dynamic data: the Schedule portion provides planned timetables (in CSV), while the Realtime extension (using Protocol Buffers) adds live trip updates, vehicle positions, and service alerts.

The main benefits of GTFS are:

It defines a consistent schema (stops, routes, trips, etc), so users / developers only need to learn one format for all agencies
Interoperability: Any GTFS-compliant tool can read feeds from any transit provider, enabling seamless integration across different platforms and systems
Plain-text and open: The format is based on simple CSV files, which are easy to generate, ingest (egvia SQL or pandas), and maintain without proprietary software
Static + real-time: GTFS distinguishes fixed schedules from live updates. The static GTFS-Schedule files cover routes and timetables, while GTFS-Realtime (Protocol Buffers) provides real-time trip updates, vehicle positions, and alerts.
Ecosystem and tooling: A large community supports GTFS with open documentation and many libraries/validators (e.g. GTFS-kit in Python), so developer can leverage existing tools rather than build custom parsers

Creating a DuckDB database for GTFS Schedule data

As the GTFS Schedule standard uses ZIP-compressed CSV files, it’s not possible out of the box to directly analyze datasets with the most tools.

Also, a lot of GTFS datasets don’t publish all dataset files, as some of the data files are optional, additionally not all fields in each those files are required as well:

File Name	Presence	Description
agency.txt	Required	Transit agencies with service represented in this dataset.
stops.txt	Conditionally Required	Stops where vehicles pick up or drop off riders. Also defines stations and station entrances. Optional if demand-responsive zones are defined in locations.geojson, required otherwise.
routes.txt	Required	Transit routes. A route is a group of trips that are displayed to riders as a single service.
trips.txt	Required	Trips for each route. A trip is a sequence of two or more stops that occur during a specific time period.
stop_times.txt	Required	Times that a vehicle arrives at and departs from stops for each trip.
calendar.txt	Conditionally Required	Service dates specified using a weekly schedule with start and end dates. Required unless all dates of service are defined in calendar_dates.txt, optional otherwise.
calendar_dates.txt	Conditionally Required	Exceptions for the services defined in calendar.txt. Required if calendar.txt is omitted (must then contain all dates of service), optional otherwise.
fare_attributes.txt	Optional	Fare information for a transit agency’s routes.
fare_rules.txt	Optional	Rules to apply fares for itineraries.
timeframes.txt	Optional	Date and time periods to use in fare rules for fares that depend on date/time factors. Forbidden if `network_id` exists in routes.txt, optional otherwise.
shapes.txt	Optional	Rules for mapping vehicle travel paths (route alignments).
frequencies.txt	Optional	Headway (time between trips) for headway-based service or compressed fixed-schedule service.
transfers.txt	Optional	Rules for making connections at transfer points between routes.
pathways.txt	Optional	Pathways linking together locations within stations.
levels.txt	Conditionally Required	Levels within stations. Required when describing pathways with elevators (`pathway_mode=5`), optional otherwise.
location_groups.txt	Optional	A group of stops that together indicate locations where a rider may request pickup or drop off.
location_group_stops.txt	Optional	Rules to assign stops to location groups.
locations.geojson	Optional	Zones for rider pickup/drop-off requests by on-demand services, represented as GeoJSON polygons.
booking_rules.txt	Optional	Booking information for rider-requested services.
translations.txt	Optional	Translations of customer-facing dataset values.
feed_info.txt	Conditionally Required	Dataset metadata, including publisher, version, and expiration information. Required if translations.txt is provided, recommended otherwise.
attributions.txt	Optional	Dataset attributions.

Taken that into account, the plan for being able to use GTFS Schedule datasets with DuckDB is: Create a DuckDB database that contains all tables with all columns. The DDL for the tables will be derived from the official standard’s website. We want to use the foreign key relationships outlined in the standard, and thus need to make sure that the tables are created in the right order.

The database creation script can be found at queries/create_gtfs_database.sql.

This results in a database that has the following Entity Relationship Diagram (ERD):

Creating a DuckDB GTFS database from an example dataset

To load some sample dataset, we need to choose one of the many available data sources. For German data, the gtfs.de website provides some interesting datasets.

There’s one dataset that contains the full train and local traffic for Germany: de_full, and that’s the one we’ll use. It contains 1.6 million trips, 663 thousand stops, and more than 32 million stop times.

The accompanying GitHub repository tobilg/duckdb-gtfs has the necessary scripts available. You’ll need a locally installed DuckDB CLI to be able to follow the next steps on your machine.

Preparations

You need to clone the repository from GitHub first:

git clone git@github.com:tobilg/duckdb-gtfs.git && cd duckdb-gtfs

Downloading the dataset

To download the example dataset (218MB zipped, around 1.4GB unzipped), you can run the download script:

scripts/providers/gtfs-de/full/download_data.sh

Loading the dataset in a new GTFS database

As outlined above, not all dataset providers use the full set of tables in their datasets, so we’ll have to write a custom database creation script for each provider. In our case, the SQL for the example dataset looks like this (see load_data.sql):

INSERT INTO agency (agency_id,agency_name,agency_url,agency_timezone,agency_lang) 
SELECT * FROM read_csv('source-data/providers/gtfs-de/full/agency.txt', delim = ',', header = true);

INSERT INTO attributions (attribution_id,organization_name,is_producer,attribution_url,attribution_email) 
SELECT * FROM read_csv('source-data/providers/gtfs-de/full/attributions.txt', delim = ',', header = true);

INSERT INTO calendar (monday,tuesday,wednesday,thursday,friday,saturday,sunday,start_date,end_date,service_id) 
SELECT * FROM read_csv('source-data/providers/gtfs-de/full/calendar.txt', delim = ',', header = true, dateformat = '%Y%m%d');

INSERT INTO calendar_dates (service_id,exception_type,"date") 
SELECT * FROM read_csv('source-data/providers/gtfs-de/full/calendar_dates.txt', delim = ',', header = true, columns = {
  'service_id': 'VARCHAR',
  'exception_type': 'INTEGER',
  'date': 'DATE'
}, dateformat = '%Y%m%d', ignore_errors = true);

INSERT INTO feed_info (feed_publisher_name,feed_publisher_url,feed_lang,feed_version,feed_contact_email,feed_contact_url) 
SELECT * FROM read_csv('source-data/providers/gtfs-de/full/feed_info.txt', delim = ',', header = true, dateformat = '%Y%m%d');

INSERT INTO routes (route_long_name,route_short_name,agency_id,route_type,route_id) 
SELECT * FROM read_csv('source-data/providers/gtfs-de/full/routes.txt', delim = ',', header = true);

INSERT INTO trips (route_id,service_id,trip_id) 
SELECT * FROM read_csv('source-data/providers/gtfs-de/full/trips.txt', delim = ',', header = true, ignore_errors = true);

INSERT INTO stops (stop_name,parent_station,stop_id,stop_lat,stop_lon,location_type) 
SELECT * FROM read_csv('source-data/providers/gtfs-de/full/stops.txt', delim = ',', header = true);

INSERT INTO stop_times (trip_id,arrival_time,departure_time,stop_id,stop_sequence,pickup_type,drop_off_type) 
SELECT * FROM read_csv('source-data/providers/gtfs-de/full/stop_times.txt', delim = ',', header = true, ignore_errors = true);

Now, run the database creation script (this will take some time depending on your machine):

scripts/providers/gtfs-de/full/create_database.sh

The resulting DuckDB database file can be found at exported-data/providers/gtfs-de/full/data.duckdb.

Querying the new GTFS database

To query the new GTFS database, you can just use the newly created database file with your local DuckDB CLI:

duckdb exported-data/providers/gtfs-de/full/data.duckdb

v1.2.2 7c039464e4
Enter ".help" for usage hints.
D show tables;
┌──────────────────────┐
│         name         │
│       varchar        │
├──────────────────────┤
│ agency               │
│ areas                │
│ attributions         │
│ booking_rules        │
│ calendar             │
│ calendar_dates       │
│ fare_attributes      │
│ fare_leg_join_rules  │
│ fare_leg_rules       │
│ fare_media           │
│ fare_products        │
│ fare_rules           │
│ fare_transfer_rules  │
│ feed_info            │
│ frequencies          │
│ levels               │
│ location_group_stops │
│ location_groups      │
│ networks             │
│ pathways             │
│ rider_categories     │
│ route_networks       │
│ routes               │
│ shapes               │
│ stop_areas           │
│ stop_times           │
│ stops                │
│ timeframes           │
│ transfers            │
│ translations         │
│ trips                │
├──────────────────────┤
│       31 rows        │
└──────────────────────┘
D SELECT count(*) FROM stop_times;
┌─────────────────┐
│  count_star()   │
│      int64      │
├─────────────────┤
│    32228711     │
│ (32.23 million) │
└─────────────────┘
D SELECT count(*) FROM stops;
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│    678388    │
└──────────────┘
D SELECT count(*) FROM trips;
┌────────────────┐
│  count_star()  │
│     int64      │
├────────────────┤
│    1630671     │
│ (1.63 million) │
└────────────────┘

Export the data in Parquet format

If you want to export the data in an optimized format like Parquet, this is very straight forward with DuckDB (as demonstrated with the export_data.sh script:

duckdb exported-data/providers/gtfs-de/full/data.duckdb -c "EXPORT DATABASE 'exported-data/providers/gtfs-de/full' (FORMAT parquet);"

This will export each table as a Parquet file in the exported-data/providers/gtfs-de/full/ directory.

Summary

Using DuckDB for storing and analyzing GTFS Schedule data seems like a great choice. For example, the CSV reader of DuckDB can handle a lot of different error scenarios, which are quite common with data providers.

We successfully created a default GTFS Schedule database, that can be loaded with data from different providers with custom loading scripts. Those require specific effort to create, but as the result is a standardized database schema, it can create value for data-driven organizations to do so.

Being able to export the data to storage-optimized formats out-of-the-box is also a great benefit of DuckDB. We were able to shrink the 1.4GB unzipped data to 118MB in compressed Parquet files.

If you want to query the data live and in your browser, you can use SQL Workbench.

Handling GTFS data with DuckDB

Table of contents