Optimizing ClickHouse for Financial Time Series


Our product, https://marketlens.app, is a market visualization tool that receives data from exchanges’ WebSockets. Top exchanges produce gigabytes of data daily. For instance, Binance ETH/USDT perpetual futures recorded 7 million trades in a single day on February 3, 2025.
To illustrate the volume of trades, we queried the top three trading days by trade count:
SELECT symbol, toDate(ts), count(*) FROM trades GROUP BY symbol, toDate(ts) ORDER BY count(*) DESC LIMIT 3
Symbol | Date | Trade Count |
binance:futures:linear:ETHUSDT | 2021-05-19 | 8,001,640 |
bybit:futures:linear:ETHUSDT | 2025-02-03 | 7,229,811 |
bybit:futures:linear:BTCUSDT | 2025-02-03 | 6,713,708 |
Orderbook updates can significantly outnumber trades. For example, Binance BTC/USDT perpetual futures had 267 million orderbook updates on April 7, 2025, despite Binance providing only aggregated updates every 100 milliseconds.
Here are the top three days by orderbook update volume:
SELECT symbol, toDate(ts), sum(length(prices)) AS num_of_updates FROM orderbooks GROUP BY symbol, toDate(ts) ORDER BY num_of_updates DESC limit 3
Symbol | Date | Number of Updates |
binance:futures:linear:BTCUSDT | 2025-04-07 | 267,488,693 |
binance:futures:linear:BTCUSDT | 2025-03-04 | 246,896,146 |
binance:futures:linear:BTCUSDT | 2025-03-07 | 241,906,906 |
We have collected approximately 900 GB of data over the past several months for top markets like BTC and ETH on Binance, and we have optimized storage as much as possible. We would like to share our findings with you.
Both trades and orderbook updates can be naturally sorted by timestamps, which determines the order for columns and codecs. Let’s assume this sort order for our analysis.
Trade IDs
Trade IDs are used by exchanges to uniquely identify each trade. Most exchanges use sequential IDs, while some, like Bybit Futures, use UUIDs.
Sequential IDs are straightforward, and the DoubleDelta
codec performs best. The Delta
codec transforms a sequence like [2, 3, 4, 5, 6]
into [1, 1, 1, 1, 1]
, and DoubleDelta
goes further, producing [0, 0, 0, 0, 0]
. Applying ZSTD
further improves the compression ratio, as shown below.
UUIDs, on the other hand, are designed to be unique and random, making them effectively incompressible. Compression methods like NONE
, LZ4
, and ZSTD
yield a compression ratio of 1. Fortunately, Bybit Futures is the only exchange we encountered that relies on UUIDs.
Trade Timestamps
Binance provides a stream that aggregates trades with the same prices, while Bybit pushes all unique trades separately. Both claim to push data in real-time. Most exchanges provide trade timestamps with millisecond precision, with only Coinbase providing microseconds. Therefore, our datatype is DateTime64(6, 'UTC')
, which stores microseconds since epoch start (1970-01-01 00:00:00 UTC) as Int64
. It seems that trade timestamps will be quite random, but let’s see what we get.
ZSTD
effectively eliminates repetitive parts in sequential timestamps. We didn’t expect anything from T64
and were surprised that it improved plain LZ4
compression, however, worsened plain ZSTD
. Delta
improved compression for both LZ4
and ZSTD
probably just by simplifying work for them.
Trade Prices
Prices are decimal values. Binance and Bybit provide them as JSON strings, while Deribit uses the JSON number type, which is effectively a string as well, just without quotes. Precision varies significantly. The smallest precision we encountered was ZENIX/USDT on Mexc with 25(!) decimal places; most exchanges are more reasonable, with no more than 9 decimal places used. We considered several options.
Float64
The binary internal representation of floats leads to precision loss for decimal numbers; however, under certain constraints, precision loss is recoverable back to exact initial decimal values.
Trade price changes are random but changes are likely just several ticks, so the Delta
codec gives a clear improvement. Gorilla
is a codec specifically designed for floating-point numbers and their binary representation. It calculates XOR between adjacent values and writes it in compact binary form. Here we see a little improvement over plain LZ4
compression, but no improvement over plain ZSTD
.
We tried “dumb” combinations like Gorilla + Delta
and Delta + Gorilla
and we got:
Gorilla + Delta + LZ4
: 1.89Gorilla + Delta + ZSTD
: 1.88Delta + Gorilla + LZ4
: 2.08Delta + Gorilla + ZSTD
: 3.12
Gorilla
and Delta
duplicate each other’s purpose and reasonably give bad results.
Decimal(38, 19)
This format uses 38 digits, with 19 allocated to the fractional part, and is stored as UInt128
.
GCD
is an interesting codec. It calculates the greatest common divisor for values inside blocks (usually from 64KB to 1MB), then divides each value by it and stores the GCD value and values after divisions. It looks perfect for decimal prices that have implicit or explicit ticks. Decimal(38, 19)
is a 128-bit number which is 2x of Float64
, so 18.0 compression ratio above corresponds to 9.0 for Float64
.
Unfortunately, the Delta
codec cannot be applied to 16-byte numbers. It’s possible to specify Delta(8)
, which limits differences to 8 bytes, but if a difference overflows there will be no error and result values will be meaningless.
UInt64
Exchanges’ specifications for markets provide minimal price ticks. We could just store price as an integer number of minimal ticks. However, we later discovered price ticks can change from time to time. Also, Bitfinex and Hyperliquid do not use a minimal tick concept. Instead, they specify a number of significant digits for a price, making minimal ticks become dynamic.
Tuple(UInt64, Int8)
A pair of (x, p)
so that an initial price is represented as x * 10 ^ p
. p
is chosen to minimize the number of non-zero digits.
ClickHouse does not allow specifying individual codecs for tuple components. Tuple components are stored as separate streams and compressed individually. The precision component has very low cardinality, so codecs have minimal impact because their values can be easily dictionary encoded by both LZ4
and ZSTD
.
Winner
Float64
takes 8 bytes, Decimal(38, 19)
- 16, and Tuple(UInt64, Int8)
- 9. Let’s compare storage sizes in bytes of the best results.
We can see that Tuple(UInt64, Int8)
improves a bit over Decimal(38, 19)
. However, it’s quite good; probably it could benefit from the Delta
codec if Delta(16)
were possible.
Trade Sizes
Trade sizes are similar to prices. Most exchanges provide a minimal tick size that can change from time to time. Bitfinex has fixed precision of 8 decimal places. Hyperliquid relies on the same maximum significant digits concept as for prices.
However, trade sizes differ in their expected values. Prices evolve over time: one trade ticks up, another trade ticks down, and can sit in a specific range for some time. Sizes, on the other hand, appear to be more random. Let’s compare distributions of prices and sizes for Binance Perpetual Futures ETH/USDT trades on May 19, 2021. The Y-axis is the number of trades.
And let’s take a look at a more calm day like yesterday, April 19, 2025.
Unfortunately, compressing sizes is going to be more difficult. We already know that Float64
isn’t that compressible, so let’s explore Decimal(38, 19)
and (Int64, Int8)
.
Decimal(38, 19)
It’s surprising that GCD
almost doesn’t improve compression. Apparently, the reason is that we use signed sizes, where positive values indicate buys and negative - sells. GCD
doesn’t work when integers have mixed signs. So, we switched to a tuple Tuple(Decimal(38, 19), Bool)
and retested.
That’s much better!
Tuple(Int64, Int8)
Delta
worsens results, which is usually the case when series data are random. T64
improves results a bit.
Winner
We think that Decimal + GCD
performed better because it was able to find better common divisors than our code that converts decimals into (x, p)
pairs.
Conclusions
ZSTD
provides better compression thanLZ4
.ZSTD
alone, without specialized codecs, provides good results.Delta
is essential for monotonic values, such as ordered but random timestamps.DoubleDelta
is perfect for sequential IDs.T64
provides a slight improvement for values with limited rangeFloat64
is not good for decimal values. Instead, we can use a larger type likeDecimal(38, 19)
withGCD
, ensuring that prices and sizes are accurately represented without concern for type constraints.
These strategies helped us to significantly reduce storage size for MarketLens from 5.91 TiB to 897.97 GiB.
Subscribe to my newsletter
Read articles from MS Dev directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
