Geospatial Queries in Apache Druid: A Ride-Sharing Example

Apache Druid is a high-performance, real-time analytics database designed for large-scale data processing. Among its powerful features is the ability to handle geospatial data, enabling fast and efficient queries over latitude and longitude coordinates. In this post, we’ll explore a practical scenario where geospatial queries in Druid shine: real-time analytics for a ride-sharing platform. We’ll walk through the use case, provide a sample query, and highlight why Druid is a great fit for geospatial workloads.
The Scenario: Real-Time Ride-Sharing Analytics
Imagine a ride-sharing company that operates in a bustling city like New York. Every day, the platform processes millions of events, including driver location updates and passenger ride requests. A critical operational need is to match passengers with the nearest available driver in real time. For example, when a passenger requests a ride, the system must identify all drivers within a 10-kilometer radius of the pickup location and assign the closest one—all within milliseconds.
This is where geospatial data comes into play. Driver locations are continuously ingested as latitude and longitude coordinates, and the system needs to perform proximity searches to find drivers near a passenger’s coordinates. Apache Druid’s geospatial query capabilities make this possible at scale.
Setting Up the Data in Druid
Let’s assume the ride-sharing company stores driver location data in a Druid datasource called driver_locations
. The datasource has the following columns:
driver_id
: Unique identifier for each driver.latitude
: The driver’s latitude coordinate.longitude
: The driver’s longitude coordinate.__time
: The timestamp of the location update.
The data is ingested in real time as drivers move, ensuring the system always has fresh location information. Druid’s columnar storage and indexing optimize this dataset for fast queries, even with millions of rows.
Ingesting Geospatial Data in Druid
To enable geospatial queries, the ride-sharing company must first ingest driver location data into the driver_locations
datasource. For this example, we’ll use an inline ingestion spec to load a small dataset directly, which is useful for testing or demonstration purposes.
Here’s how the ingestion process works:
Data Source: Driver location updates are provided as CSV data within the ingestion spec. Each row contains
driver_id
,latitude
,longitude
, andtimestamp
. For example, the inline sample dataset below includes a few driver locations in New York City.Ingestion Spec: The company defines a Druid ingestion spec with an inline data source in CSV format. The spec maps the CSV columns to the
driver_locations
datasource, withtimestamp
as the__time
column,latitude
andlongitude
as double dimensions, and a spatial dimension for geospatial queries. Here’s the ingestion spec:{ "type": "index_parallel", "spec": { "dataSchema": { "dataSource": "driver_locations", "timestampSpec": { "column": "timestamp", "format": "iso" }, "dimensionsSpec": { "dimensions": [ { "type": "string", "name": "driver_id", "multiValueHandling": "SORTED_ARRAY", "createBitmapIndex": true }, { "type": "double", "name": "latitude", "multiValueHandling": "SORTED_ARRAY", "createBitmapIndex": false }, { "type": "double", "name": "longitude", "multiValueHandling": "SORTED_ARRAY", "createBitmapIndex": false }, { "type": "spatial", "name": "coordinates", "dims": ["latitude", "longitude"], "multiValueHandling": "SORTED_ARRAY", "createBitmapIndex": true } ], "dimensionExclusions": ["__time", "timestamp"], "includeAllDimensions": false, "useSchemaDiscovery": false }, "metricsSpec": [], "granularitySpec": { "type": "uniform", "segmentGranularity": "DAY", "queryGranularity": { "type": "none" }, "rollup": false } }, "ioConfig": { "type": "index_parallel", "inputSource": { "type": "inline", "data": "driver_id,latitude,longitude,timestamp\nD123,40.7128,-74.0060,2025-05-13T17:41:00Z\nD124,40.7210,-74.0050,2025-05-13T17:41:05Z\nD125,40.7050,-74.0080,2025-05-13T17:41:10Z" }, "inputFormat": { "type": "csv", "findColumnsFromHeader": true }, "appendToExisting": false, "dropExisting": false }, "tuningConfig": { "type": "index_parallel", "maxRowsPerSegment": 5000000, "partitionsSpec": { "type": "dynamic", "maxRowsPerSegment": 5000000 }, "indexSpec": { "bitmap": { "type": "roaring" }, "dimensionCompression": "lz4", "stringDictionaryEncoding": { "type": "utf8" }, "metricCompression": "lz4", "longEncoding": "longs" } } } }
Spatial Indexing: The
dimensionsSpec
includes aspatial
dimension namedcoordinates
that combineslatitude
andlongitude
, enabling efficient geospatial queries.Batch Processing: The inline spec processes the provided CSV data as a one-time batch using parallel indexing, ideal for testing geospatial queries with a small dataset before scaling to real-time ingestion.
This setup allows the ride-sharing platform to ingest a sample dataset and prepare it for geospatial querying, leveraging Druid’s spatial indexing for performance.
Querying Geospatial Data in Druid
To match a passenger with nearby drivers, the company needs to query the driver_locations
datasource for drivers within a 10-kilometer radius of the passenger’s pickup point. For this example, let’s say the passenger is at coordinates (40.7128, -74.0060), roughly the location of downtown Manhattan.
Druid supports geospatial queries through its native query language, using a scan
query with a spatial filter. Here’s the query to find drivers near the passenger:
{
"queryType": "scan",
"dataSource": {
"type": "table",
"name": "driver_locations"
},
"intervals": {
"type": "intervals",
"intervals": [
"2024-05-13T17:31:00.000Z/2026-05-13T17:42:00.000Z"
]
},
"resultFormat": "compactedList",
"limit": 1000,
"columns": [
"driver_id",
"coordinates"
],
"granularity": {
"type": "all"
},
"filter": {
"type": "spatial",
"dimension": "coordinates",
"bound": {
"type": "radius",
"coords": [
40.6678,
-74.051
],
"radius": 10,
"radiusUnit": "kilometers"
}
}
}
Let’s break down the query:
Time Filter: The
intervals
field restricts the query to data from the last 10 minutes (e.g., from 17:31 to 17:41 on May 13, 2025), ensuring only recent driver locations are considered for real-time matching.Geospatial Filter: The
filter
applies aspatial
condition on thecoordinates
dimension, using a radius bound. Thecoords
(40.6678, -74.0510),radius
10, andradiusUnit
“kilometers” define a circle around the passenger’s location (40.7128, -74.0060), approximating a 10-kilometer radius.Columns: The query returns
__time
,driver_id
,latitude
,longitude
, andcoordinates
, providing the necessary data to identify and locate matching drivers.Output: The
scan
query withresultFormat: "compactedList"
returns up to 1000 results, which the system can use to select the closest driver.
Druid’s spatial indexes on the coordinates
dimension ensure these proximity searches are blazing fast, even over large datasets.
Why Apache Druid for Geospatial Queries?
Geospatial queries in a ride-sharing platform demand low latency and high throughput, and Druid delivers on both fronts. Here’s why it’s a great choice:
Sub-Second Query Performance: Druid’s columnar storage and spatial indexing optimize geospatial queries, enabling sub-second response times for proximity searches.
Real-Time Ingestion: Druid supports streaming ingestion, so driver locations are available for querying as soon as they’re reported.
Scalability: Whether handling thousands or millions of location updates per second, Druid scales horizontally to meet demand.
Flexible Query Interface: Druid’s native query language, with spatial filters, supports complex geospatial queries, complementing its SQL capabilities.
Conclusion
Apache Druid’s geospatial query capabilities empower businesses to unlock insights from location-based data at scale. In the ride-sharing example, we saw how Druid can ingest location data and identify nearby drivers in real time using a native scan query with a spatial filter, leveraging spatial indexes for blazing-fast performance. Whether you’re building a ride-sharing platform or analyzing location data, Druid’s ability to handle geospatial queries makes it a powerful tool for real-time analytics.
Have you tried geospatial queries in Druid? Share your use cases or questions in the comments below!
Other Resources
Subscribe to my newsletter
Read articles from Nikhil Rao directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Nikhil Rao
Nikhil Rao
Los Angeles