Apache Druid is a high-performance, real-time analytics database designed for large-scale data processing. Among its powerful features is the ability to handle geospatial data, enabling fast and efficient queries over latitude and longitude coordinates. In this post, we’ll explore a practical scenario where geospatial queries in Druid shine: real-time analytics for a ride-sharing platform. We’ll walk through the use case, provide a sample query, and highlight why Druid is a great fit for geospatial workloads.

Imagine a ride-sharing company that operates in a bustling city like New York. Every day, the platform processes millions of events, including driver location updates and passenger ride requests. A critical operational need is to match passengers with the nearest available driver in real time. For example, when a passenger requests a ride, the system must identify all drivers within a 10-kilometer radius of the pickup location and assign the closest one—all within milliseconds.

This is where geospatial data comes into play. Driver locations are continuously ingested as latitude and longitude coordinates, and the system needs to perform proximity searches to find drivers near a passenger’s coordinates. Apache Druid’s geospatial query capabilities make this possible at scale.

Setting Up the Data in Druid

Let’s assume the ride-sharing company stores driver location data in a Druid datasource called driver_locations. The datasource has the following columns:

driver_id: Unique identifier for each driver.
latitude: The driver’s latitude coordinate.
longitude: The driver’s longitude coordinate.
__time: The timestamp of the location update.

The data is ingested in real time as drivers move, ensuring the system always has fresh location information. Druid’s columnar storage and indexing optimize this dataset for fast queries, even with millions of rows.

Ingesting Geospatial Data in Druid

To enable geospatial queries, the ride-sharing company must first ingest driver location data into the driver_locations datasource. For this example, we’ll use an inline ingestion spec to load a small dataset directly, which is useful for testing or demonstration purposes.

Here’s how the ingestion process works:

Data Source: Driver location updates are provided as CSV data within the ingestion spec. Each row contains driver_id, latitude, longitude, and timestamp. For example, the inline sample dataset below includes a few driver locations in New York City.

Ingestion Spec: The company defines a Druid ingestion spec with an inline data source in CSV format. The spec maps the CSV columns to the driver_locations datasource, with timestamp as the __time column, latitude and longitude as double dimensions, and a spatial dimension for geospatial queries. Here’s the ingestion spec:

  {
    "type": "index_parallel",
    "spec": {
      "dataSchema": {
        "dataSource": "driver_locations",
        "timestampSpec": {
          "column": "timestamp",
          "format": "iso"
        },
        "dimensionsSpec": {
          "dimensions": [
            {
              "type": "string",
              "name": "driver_id",
              "multiValueHandling": "SORTED_ARRAY",
              "createBitmapIndex": true
            },
            {
              "type": "double",
              "name": "latitude",
              "multiValueHandling": "SORTED_ARRAY",
              "createBitmapIndex": false
            },
            {
              "type": "double",
              "name": "longitude",
              "multiValueHandling": "SORTED_ARRAY",
              "createBitmapIndex": false
            },
            {
              "type": "spatial",
              "name": "coordinates",
              "dims": ["latitude", "longitude"],
              "multiValueHandling": "SORTED_ARRAY",
              "createBitmapIndex": true
            }
          ],
          "dimensionExclusions": ["__time", "timestamp"],
          "includeAllDimensions": false,
          "useSchemaDiscovery": false
        },
        "metricsSpec": [],
        "granularitySpec": {
          "type": "uniform",
          "segmentGranularity": "DAY",
          "queryGranularity": {
            "type": "none"
          },
          "rollup": false
        }
      },
      "ioConfig": {
        "type": "index_parallel",
        "inputSource": {
          "type": "inline",
          "data": "driver_id,latitude,longitude,timestamp\nD123,40.7128,-74.0060,2025-05-13T17:41:00Z\nD124,40.7210,-74.0050,2025-05-13T17:41:05Z\nD125,40.7050,-74.0080,2025-05-13T17:41:10Z"
        },
        "inputFormat": {
          "type": "csv",
          "findColumnsFromHeader": true
        },
        "appendToExisting": false,
        "dropExisting": false
      },
      "tuningConfig": {
        "type": "index_parallel",
        "maxRowsPerSegment": 5000000,
        "partitionsSpec": {
          "type": "dynamic",
          "maxRowsPerSegment": 5000000
        },
        "indexSpec": {
          "bitmap": {
            "type": "roaring"
          },
          "dimensionCompression": "lz4",
          "stringDictionaryEncoding": {
            "type": "utf8"
          },
          "metricCompression": "lz4",
          "longEncoding": "longs"
        }
      }
    }
  }

Spatial Indexing: The dimensionsSpec includes a spatial dimension named coordinates that combines latitude and longitude, enabling efficient geospatial queries.
Batch Processing: The inline spec processes the provided CSV data as a one-time batch using parallel indexing, ideal for testing geospatial queries with a small dataset before scaling to real-time ingestion.

This setup allows the ride-sharing platform to ingest a sample dataset and prepare it for geospatial querying, leveraging Druid’s spatial indexing for performance.

Querying Geospatial Data in Druid

To match a passenger with nearby drivers, the company needs to query the driver_locations datasource for drivers within a 10-kilometer radius of the passenger’s pickup point. For this example, let’s say the passenger is at coordinates (40.7128, -74.0060), roughly the location of downtown Manhattan.

Druid supports geospatial queries through its native query language, using a scan query with a spatial filter. Here’s the query to find drivers near the passenger:

{
  "queryType": "scan",
  "dataSource": {
    "type": "table",
    "name": "driver_locations"
  },
  "intervals": {
    "type": "intervals",
    "intervals": [
      "2024-05-13T17:31:00.000Z/2026-05-13T17:42:00.000Z"
    ]
  },
  "resultFormat": "compactedList",
  "limit": 1000,
  "columns": [
    "driver_id",
    "coordinates"
  ],
  "granularity": {
    "type": "all"
  },
  "filter": {
    "type": "spatial",
    "dimension": "coordinates",
    "bound": {
      "type": "radius",
      "coords": [
        40.6678,
        -74.051
      ],
      "radius": 10,
      "radiusUnit": "kilometers"
    }
  }
}

Let’s break down the query:

Time Filter: The intervals field restricts the query to data from the last 10 minutes (e.g., from 17:31 to 17:41 on May 13, 2025), ensuring only recent driver locations are considered for real-time matching.
Geospatial Filter: The filter applies a spatial condition on the coordinates dimension, using a radius bound. The coords (40.6678, -74.0510), radius 10, and radiusUnit “kilometers” define a circle around the passenger’s location (40.7128, -74.0060), approximating a 10-kilometer radius.
Columns: The query returns __time, driver_id, latitude, longitude, and coordinates, providing the necessary data to identify and locate matching drivers.
Output: The scan query with resultFormat: "compactedList" returns up to 1000 results, which the system can use to select the closest driver.

Druid’s spatial indexes on the coordinates dimension ensure these proximity searches are blazing fast, even over large datasets.

Why Apache Druid for Geospatial Queries?

Geospatial queries in a ride-sharing platform demand low latency and high throughput, and Druid delivers on both fronts. Here’s why it’s a great choice:

Sub-Second Query Performance: Druid’s columnar storage and spatial indexing optimize geospatial queries, enabling sub-second response times for proximity searches.
Real-Time Ingestion: Druid supports streaming ingestion, so driver locations are available for querying as soon as they’re reported.
Scalability: Whether handling thousands or millions of location updates per second, Druid scales horizontally to meet demand.
Flexible Query Interface: Druid’s native query language, with spatial filters, supports complex geospatial queries, complementing its SQL capabilities.

Conclusion

Apache Druid’s geospatial query capabilities empower businesses to unlock insights from location-based data at scale. In the ride-sharing example, we saw how Druid can ingest location data and identify nearby drivers in real time using a native scan query with a spatial filter, leveraging spatial indexes for blazing-fast performance. Whether you’re building a ride-sharing platform or analyzing location data, Druid’s ability to handle geospatial queries makes it a powerful tool for real-time analytics.

Have you tried geospatial queries in Druid? Share your use cases or questions in the comments below!

Other Resources

https://druid.apache.org/docs/latest/querying/geo/

Geospatial Queries in Apache Druid: A Ride-Sharing Example

Table of contents

Setting Up the Data in Druid

Ingesting Geospatial Data in Druid

Querying Geospatial Data in Druid

Why Apache Druid for Geospatial Queries?

Conclusion

Other Resources

Subscribe to my newsletter

Nikhil Rao

Nikhil Rao

Geospatial Queries in Apache Druid: A Ride-Sharing Example

Table of contents

The Scenario: Real-Time Ride-Sharing Analytics

Setting Up the Data in Druid

Ingesting Geospatial Data in Druid

Querying Geospatial Data in Druid

Why Apache Druid for Geospatial Queries?

Conclusion

Other Resources

Subscribe to my newsletter

Nikhil Rao

Nikhil Rao