Harnessing the IP Address  Data Type in Apache Druid (2025)

Nikhil RaoNikhil Rao
4 min read

Everything you need to ingest, store, and analyze billions of IPv4 & IPv6 addresses at interactive speed.


Why a native IP data type?

Storing every address as a plain string inflates segment size, hurts bitmap selectivity, and forces costly runtime casts for subnet math.

The complex ipAddress (128‑bit) and ipPrefix (136‑bit) encodings introduced in Druid 30 collapse IPv4 and IPv6 into compact binaries, enabling index‑friendly comparisons, hash‑based GROUP BY, and true subnet pruning down to the segment level.


Enabling the feature

# common.runtime.properties
druid.extensions.loadList=["imply-utility-belt"]   # ships the IP complex type

IPv6 support is still beta*—keep an eye on future releases before turning it on in production.*


Designing the ingestion spec

"dimensionsSpec": {
  "dimensions": [
    "page",
    "language",
    { "type": "ipAddress", "name": "remote_address" },  // single host
    { "type": "ipPrefix",  "name": "network_prefix" }   // CIDR block
  ]
}

ipAddress stores 128‑bit hosts, while ipPrefix keeps the host bits plus an 8‑bit prefix length so range look‑ups remain constant‑time.

Example Ingestion Spec with Data

An example ingestion spec with inline data is shown below

{
  "type": "index_parallel",
  "spec": {
    "dataSchema": {
      "dataSource": "inline_data_ip_address",
      "timestampSpec": {
        "column": "timestamp",
        "format": "iso",
        "missingValue": null
      },
      "dimensionsSpec": {
        "dimensions": [
          {
            "type": "string",
            "name": "page",
            "multiValueHandling": "SORTED_ARRAY",
            "createBitmapIndex": true
          },
          {
            "type": "string",
            "name": "language",
            "multiValueHandling": "SORTED_ARRAY",
            "createBitmapIndex": true
          },
          {
            "type": "ipAddress",
            "name": "remote_address",
            "multiValueHandling": "SORTED_ARRAY",
            "createBitmapIndex": true
          },
          {
            "type": "ipPrefix",
            "name": "network_prefix",
            "multiValueHandling": "SORTED_ARRAY",
            "createBitmapIndex": true
          }
        ],
        "dimensionExclusions": [
          "__time",
          "timestamp"
        ],
        "includeAllDimensions": false,
        "useSchemaDiscovery": false
      },
      "metricsSpec": [],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "DAY",
        "queryGranularity": {
          "type": "none"
        },
        "rollup": false,
        "intervals": []
      },
      "transformSpec": {
        "filter": null,
        "transforms": []
      }
    },
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": {
        "type": "inline",
        "data": "timestamp,page,language,remote_address,network_prefix\n2025-05-17T00:00:00,/index.html,en,bc89:60a9:23b8:c1e9:3924:56de:3eb1:3b90,bc89:60a9:23b8::/48\n2025-05-17T00:05:00,/index.html,jp,108.3.17.153,108.3.17.0/24\n2025-05-17T00:10:00,/,en,59.143.170.24,59.143.170.0/24\n2025-05-17T00:15:00,/checkout,jp,50.231.6.41,50.231.6.0/24\n2025-05-17T00:20:00,/checkout,de,150.218.29.172,150.218.29.0/24\n2025-05-17T00:25:00,/login,en,571a:a876:6c30:7511:b2b9:437a:28df:6ec4,571a:a876:6c30::/48\n2025-05-17T00:30:00,/login,es,195.116.89.238,195.116.89.0/24\n2025-05-17T00:35:00,/account,en,24.194.103.151,24.194.103.0/24\n2025-05-17T00:40:00,/account,fr,759c:de66:bacf:b3d0:b1f:9163:ce9f:f57f,759c:de66:bacf::/48\n2025-05-17T00:45:00,/checkout,en,4b0d:bb41:8d52:88f1:142c:3fe8:60e7:a113,4b0d:bb41:8d52::/48\n"
      },
      "inputFormat": {
        "type": "csv",
        "findColumnsFromHeader": true
      },
      "appendToExisting": false,
      "dropExisting": false
    },
    "tuningConfig": {
      "type": "index_parallel",
      "maxRowsPerSegment": 5000000,
      "appendableIndexSpec": {
        "type": "onheap",
        "preserveExistingMetrics": false
      },
      "maxRowsInMemory": 1000000,
      "maxBytesInMemory": 0,
      "skipBytesInMemoryOverheadCheck": false,
      "maxTotalRows": null,
      "numShards": null,
      "splitHintSpec": null,
      "partitionsSpec": {
        "type": "dynamic",
        "maxRowsPerSegment": 5000000,
        "maxTotalRows": null
      },
      "indexSpec": {
        "bitmap": {
          "type": "roaring"
        },
        "dimensionCompression": "lz4",
        "stringDictionaryEncoding": {
          "type": "utf8"
        },
        "metricCompression": "lz4",
        "longEncoding": "longs"
      },
      "indexSpecForIntermediatePersists": {
        "bitmap": {
          "type": "roaring"
        },
        "dimensionCompression": "lz4",
        "stringDictionaryEncoding": {
          "type": "utf8"
        },
        "metricCompression": "lz4",
        "longEncoding": "longs"
      },
      "maxPendingPersists": 0,
      "forceGuaranteedRollup": false,
      "reportParseExceptions": false,
      "pushTimeout": 0,
      "segmentWriteOutMediumFactory": null,
      "maxNumConcurrentSubTasks": 1,
      "maxRetry": 3,
      "taskStatusCheckPeriodMs": 1000,
      "chatHandlerTimeout": "PT10S",
      "chatHandlerNumRetries": 5,
      "maxNumSegmentsToMerge": 100,
      "totalNumMergeTasks": 10,
      "logParseExceptions": false,
      "maxParseExceptions": 2147483647,
      "maxSavedParseExceptions": 0,
      "maxColumnsToMerge": -1,
      "awaitSegmentAvailabilityTimeoutMillis": 0,
      "maxAllowedLockCount": -1,
      "numPersistThreads": 1,
      "partitionDimensions": []
    }
  },
  "context": {
    "forceTimeChunkLock": true,
    "useLineageBasedSegmentAllocation": true
  },
  "dataSource": "inline_data_ip_address"
}

Quick sanity check

SELECT 
  IPV4_STRINGIFY(remote_address) AS ip,
  IPV4_STRINGIFY(network_prefix) AS cidr
FROM traffic LIMIT 5

Querying with IP‑centric functions

Matching & searching

-- Is the client inside a zero‑trust subnet?
SELECT COUNT(*) 
FROM traffic 
WHERE IPV4_MATCH(remote_address, '108.3.0.0/16')

IPV4_MATCH needs an exact host or CIDR on the right.


Migrating from STRING → IP

  1. Backfill: Re‑ingest recent partitions with the new dimensionSpec.

  2. Realtime streams: Add a transform when your parser emits the address:

{ "type":"expression", "name":"remote_address",
  "expression":"IP_PARSE(remote_address_str)" }
  1. Join strategy: For mixed historical segments, wrap filters in COALESCE(IP_PARSE(addr_str), addr_bin) so queries hit both encodings.

  2. Once old segments age out, drop the legacy string column.


Performance tips & caveats

TipWhy
Use ipPrefix for common subnet filtersBitmap indexes & Bloom filters can prune segments before the broker fan‑out.
Keep prefix lengths ≤ 48 for IPv6Finer masks can explode the prefix space and hurt pruning.
Disable dictionary encoding on bulk host columns (dimensionCompression: "none")When cardinality ≈ row‑count, front‑load encoding cost is wasteful.
Be mindful of JSON serializationResult sets with ipAddress columns emit the binary as Base64; wrap with IPV4_STRINGIFY for dashboards.
Feature maturityIPv6 operations are marked beta—benchmark with real workloads before committing.

Wrapping up

The native IP data types radically simplify network analytics in Druid:

  • Smaller segments → lower storage and faster scans.

  • True subnet intelligence → no more ad‑hoc regex parsing.

  • Full SQL ergonomics with first‑class GROUP BY and WHERE support.

Whether you’re protecting an edge, feeding a SIEM, or building click‑stream geo dashboards, upgrading to the ipAddress/ipPrefix complex types unlocks millisecond‑level insights at cloud scale.

Happy querying! 🐉✨

0
Subscribe to my newsletter

Read articles from Nikhil Rao directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Nikhil Rao
Nikhil Rao

Los Angeles