High cardinality meets columnar time series system

Table of contents
- Introduction
- Problem: high cardinality in row oriented systems
- The time series problem: index explosion and write amplification
- We found that columnar stores are better for high cardinality time series
- Key benefits:
- High cardinality ≠ high cost anymore
- Example of columnar systems indexing each label independently
- Conclusion
- What’s Next?

Introduction
Parseable is a high-performance observability platform, with diskless, object store first design. Here, efficient storage and fast access to large volumes of telemetry data are key. To support this, we adopted Apache Parquet, a columnar storage format, well known in the data ecosystem.
Traditionally time-series engines rely on inverted indexes and store data row-wise along timelines. Columnar formats like Parquet on the other hand store data by columns, enabling highly selective scans and predicate pushdowns.
This storage model inherently sidesteps many of the scaling issues caused by high cardinality, because data in a high-cardinality column is physically isolated from others, the cardinality of one field doesn’t balloon memory usage or index size. In our experience, column-level compression and efficient on-disk scans allow a better approach instead of maintaining in-memory indexes.
In this post, we’ll share what we’ve learned about storing and querying high-cardinality observability data in a columnar format, and why this model aligns well with scalable, cloud-native log and metrics systems.
Problem: high cardinality in row oriented systems
In traditional row-based storage (e.g., MySQL, PostgreSQL), data is stored tuple wise:
[uid, name, email, status]
[1001, "Alice", "a@example.com", "active"]
[1002, "Bob", "b@example.com", "inactive"]
...
So, every read and write operation deals with the entire row, regardless of whether the query involves one or multiple columns. High-cardinality columns are the ones where there are many possible values a given key or a column. For example uid
, email
etc. Such columns, affect the
Index size: B-tree or hash indexes grow with the number of unique keys
(O(N))
.Compression: Less redundancy → less effective compression.
Scan performance: Queries like
SELECT
nameWHERE uid = 1001
must process full rows or traverse large indexes.
From a complexity standpoint:
Storage complexity:
O(N)
, whereN
= number of unique values.Query time (index-based):
O(log N)
but often requires random I/O.Compression ratio: Inversely proportional to cardinality.
The time series problem: index explosion and write amplification
Time series data (e.g., metrics, logs) is typically organised by timestamp and labels. As more labels are added, especially with high cardinality, row-based time series databases suffer from severe index explosion and write amplification.
Let’s look at simple maths:
L₁
regions with cardinalityC₁
(e.g., 5 regions),L₂
log levels with cardinalityC₂
(e.g., 3 levels),L₃
services with cardinalityC₃
(e.g., 50 services),L₄
user IDs with cardinalityC₄
(e.g., 10 million users),
The total number of unique combinations (i.e., the number of time-series) is given by the Cartesian product:
This results in 7.5 billion index entries, all of which need to be created, stored, maintained, and queried. The index size and query cost increase dramatically as the cardinality grows.
Problem 1: index explosion from label combinations
In traditional time-series systems, each unique combination of labels is treated as a separate time-series. This exponential growth of index entries, with a Cartesian product of label cardinalities, leads to massive metadata overhead.
For example, a simple query filtering by user_id
would require scanning through billions of label combinations.
Problem 2: cold writes and sparse data
Time-series databases are forced to allocate new series for every unique combination of label values, resulting in:
High write amplification, as new entries are constantly added to the index
(O(N))
.Fragmentation of data across multiple blocks or files, leading to poor cache locality and inefficient writes.
Problem 3: query fragmentation
Queries in traditional systems require scanning through fragmented data blocks, leading to high overhead. Even a simple query like:
SELECT rate(requests_total{user_id="abc123"}[5m])
requires searching through billions of time-series entries, leading to high query complexity:
Where N_series
is the total number of time-series (7.5 billion), and fragmentation
represents the cost of accessing fragmented data blocks.
We found that columnar stores are better for high cardinality time series
Columnar storage fundamentally changes how time series data is indexed and queried. In a columnar system, each label is stored in its own independent column. This decouples the impact of high cardinality across labels.
Given the same labels as before, it is considered a column here. So,
Column 1:
region
with cardinalityC₁ = 5
Column 2:
log_level
with cardinalityC₂ = 3
Column 3:
service
with cardinalityC₃ = 50
Column 4:
user_id
with cardinalityC₄ = 10,000,000
The total storage cost is the sum of the costs of each column, instead of the multiplicative blow-up (O(C₁ × C₂ × C₃ × C₄))
:
For our example:
Columnar storage ensures that the cost grows linearly with the sum of individual cardinalities, drastically reducing the total storage overhead.
Columnar storage stores each column as a separate contiguous block:
Column: uid → [1001, 1002, 1003, ...]
Column: name → ["Alice", "Bob", "Charlie", ...]
Column: email → ["a@...", "b@...", ...]
Each column is compressed and indexed independently. This separation is critical for mitigating high-cardinality costs.
Key benefits:
Columnar isolation
High-cardinality affects only the specific column. Other columns with low cardinality (e.g., status, gender) remain highly compressible. This creates an asymptotic decoupling:
- Total storage =
Σ O(C_i)
, whereC_i
is the cardinality of column i.
Let’s break it down:
Σ
(sigma) is the summation operator.Cᵢ
refers to the cardinality of the i-th column (i.e., how many distinct values are in that column).O(Cᵢ)
is Big-O notation, which describes the asymptotic cost (e.g., in terms of storage or computation) for handling columni
.
In simpler terms, if column A has 10 unique values, column B has 1000, and column C has 10 million, then the total cost is proportional to:
O(10) + O(1000) + O(10,000,000) = O(10,011,000)
Instead of something like:
O(10 × 1000 × 10,000,000) = O(100,000,000,000)
Columnar formats contain the impact of high cardinality to just the affected column, rather than letting it explode across the dataset.
Efficient compression techniques
Columnar systems compress each column independently using techniques like:
Dictionary encoding (great for low/medium cardinality)
Run-Length Encoding (RLE) (ideal for repeated values)
Bit-packing and Delta encoding (for numeric IDs or timestamps)
While dictionary encoding may not help for high-cardinality, the engine can skip it and apply alternate schemes like delta or binary packing. This avoids compression penalties while keeping scan operations efficient. The compression is applied selectively based on the cardinality and data distribution of each column, avoiding unnecessary overhead for high-cardinality fields.
For example, the user_id
column, with 10 million unique values, will be compressed using delta encoding, whereas the region
column with just 5 values can benefit from dictionary encoding.
Selective scans
Queries that touch only a few columns avoid the cost of reading high-cardinality ones. For example:
SELECT status FROM users WHERE status = 'active'
In row-based systems, this would require scanning the entire dataset (O(N)
). In columnar systems, only the relevant column (status
) is scanned, reducing complexity to:
Where M
is the number of rows in the status
column, and N
is the total number of rows.
Predicate pushdown & min-max indexes
Columnar systems maintain min-max indexes per block, which allow for efficient pruning of irrelevant blocks. Even high-cardinality columns can benefit from block-level filtering. For instance:
For numeric types (e.g.,
user_id
), block-level min-max filtering can quickly skip over blocks that don’t match the query predicate.For string-based columns, Bloom filters provide probabilistic filtering, further reducing the number of blocks that need to be scanned.
The result is:
This helps optimize scan performance and reduces the need for full-table scans, even in the presence of high-cardinality fields.
High cardinality ≠ high cost anymore
In a well-designed columnar system:
High-cardinality columns are stored, compressed, and queried in isolation.
High-volume scans leverage vectorized execution and SIMD optimizations.
Indexing strategies like zone maps or bloom filters bypass full scans.
Adaptive encoding chooses the best compression algorithm based on cardinality and entropy.
As a result:
Metric | Row-Oriented | Columnar Format |
Storage Size | O(N) × W (full rows) | Σ O(C_i) (per column) |
Query Latency | O(log N) + disk seek | O(M) with pushdowns |
Compression Ratio | Inversely ~ cardinality | Per-column optimal |
Write Path | Tuple-based | Appends + columnar flush |
Example of columnar systems indexing each label independently
Imagine you have millions of logs. Each log has fields like:
region =
"us-east"
log_level =
"info"
service =
"api-gateway"
user_id = unique for every user (like
user_42069
,user_12345
, etc.)
Now, suppose you're building an index a map that helps you find logs faster later.
Traditional time series indexing
In traditional systems, all the fields are bundled together into a single index entry. It’s like saying:
“I saw a log where region=us-east AND log_level=info AND service=api-gateway AND user_id=user_42069
”
That means, every unique combination of values needs a new entry in the index. If:
region
has 5 possible valueslog_level
has 3service
has 50user_id
has 10 million
Then the number of possible combinations = 5 × 3 × 50 × 10,000,000 = 7.5 billion
That's 7.5 billion index entries just to cover the possibilities! You’re wasting space and time just to support search.
Columnar storage (Parseable way)
In columnar systems, each field (region, log_level, service, user_id) is stored separately. So the system says: “Here’s a list of all region
values. Separately, here’s all user_ids
.”
That means:
Only 5 entries for
region
3 for
log_level
50 for
service
10 million for
user_id
Total: 5 + 3 + 50 + 10,000,000 = ~10 million
Instead of 7.5 billion entries, you only need about 10 million. That’s 700x less metadata to store and search through.
Why is this faster for search?
Let’s say you’re running this query → "Find all logs where user_id
= "user_42069"
In row based systems:
The system has to scan massive index tables of combined fields.
Filtering by
user_id
might require scanning billions of combinations to find matches.
In columnar systems:
The system jumps straight to the
user_id
column.It quickly checks if
"user_42069"
exists.It never even looks at
region
orlog_level
unless you ask.
So:
Less to scan
More precise targeting
Way faster response
Bottom Line
Row-Based | Columnar (Parseable) | |
Index entries | ~7.5 billion | ~10 million |
Search behavior | Scan combinations | Scan only 1 column |
Query speed | Slower (bloated index) | Faster (targeted and lean) |
Storage overhead | Very high | Much smaller, even compressed |
Conclusion
High cardinality in time-series data is not inherently problematic, but only in systems where data modeling fails to isolate it. By decoupling storage and indexing per column, columnar formats like Parquet can effectively address the scaling challenges presented by high cardinality, even in time-series workloads.
Instead of facing exponential growth in storage and query complexity due to the Cartesian product of labels, columnar storage systems reduce the impact of high cardinality to the sum of individual column cardinalities:
This model decouples storage from indexing, enabling faster queries, better compression, and more efficient write paths, even in systems handling billions of time-series entries.
In contrast to row-based systems, which suffer from the explosion of metadata due to label combinations, columnar storage provides a scalable, efficient solution for handling high-cardinality time-series data.
This is why Parseable is designed with a columnar-first approach, ensuring we can handle large, high-cardinality telemetry data efficiently and at scale.
What’s Next?
We’re continuing to build from first principles, doubling down on high-cardinality performance, scalable search, and seamless observability on S3 through columnar formats. 🚀
If you like what we're building, show us some ❤️ by starring our repository, it keeps our team motivated to keep pushing for fast observability on S3!
Subscribe to my newsletter
Read articles from Debabrata Panigrahi directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
