Optimizing ClickHouse for Financial Time Series

MS DevMS Dev
7 min read

Our product, https://marketlens.app, is a market visualization tool that receives data from exchanges’ WebSockets. Top exchanges produce gigabytes of data daily. For instance, Binance ETH/USDT perpetual futures recorded 7 million trades in a single day on February 3, 2025.

To illustrate the volume of trades, we queried the top three trading days by trade count:

SELECT symbol, toDate(ts), count(*) FROM trades GROUP BY symbol, toDate(ts) ORDER BY count(*) DESC LIMIT 3
SymbolDateTrade Count
binance:futures:linear:ETHUSDT2021-05-198,001,640
bybit:futures:linear:ETHUSDT2025-02-037,229,811
bybit:futures:linear:BTCUSDT2025-02-036,713,708

Orderbook updates can significantly outnumber trades. For example, Binance BTC/USDT perpetual futures had 267 million orderbook updates on April 7, 2025, despite Binance providing only aggregated updates every 100 milliseconds.

Here are the top three days by orderbook update volume:

SELECT symbol, toDate(ts), sum(length(prices)) AS num_of_updates FROM orderbooks GROUP BY symbol, toDate(ts) ORDER BY num_of_updates DESC limit 3
SymbolDateNumber of Updates
binance:futures:linear:BTCUSDT2025-04-07267,488,693
binance:futures:linear:BTCUSDT2025-03-04246,896,146
binance:futures:linear:BTCUSDT2025-03-07241,906,906

We have collected approximately 900 GB of data over the past several months for top markets like BTC and ETH on Binance, and we have optimized storage as much as possible. We would like to share our findings with you.

Both trades and orderbook updates can be naturally sorted by timestamps, which determines the order for columns and codecs. Let’s assume this sort order for our analysis.

Trade IDs

Trade IDs are used by exchanges to uniquely identify each trade. Most exchanges use sequential IDs, while some, like Bybit Futures, use UUIDs.

Sequential IDs are straightforward, and the DoubleDelta codec performs best. The Delta codec transforms a sequence like [2, 3, 4, 5, 6] into [1, 1, 1, 1, 1], and DoubleDelta goes further, producing [0, 0, 0, 0, 0]. Applying ZSTD further improves the compression ratio, as shown below.

UUIDs, on the other hand, are designed to be unique and random, making them effectively incompressible. Compression methods like NONE, LZ4, and ZSTD yield a compression ratio of 1. Fortunately, Bybit Futures is the only exchange we encountered that relies on UUIDs.

Trade Timestamps

Binance provides a stream that aggregates trades with the same prices, while Bybit pushes all unique trades separately. Both claim to push data in real-time. Most exchanges provide trade timestamps with millisecond precision, with only Coinbase providing microseconds. Therefore, our datatype is DateTime64(6, 'UTC'), which stores microseconds since epoch start (1970-01-01 00:00:00 UTC) as Int64. It seems that trade timestamps will be quite random, but let’s see what we get.

ZSTD effectively eliminates repetitive parts in sequential timestamps. We didn’t expect anything from T64 and were surprised that it improved plain LZ4 compression, however, worsened plain ZSTD. Delta improved compression for both LZ4 and ZSTD probably just by simplifying work for them.

Trade Prices

Prices are decimal values. Binance and Bybit provide them as JSON strings, while Deribit uses the JSON number type, which is effectively a string as well, just without quotes. Precision varies significantly. The smallest precision we encountered was ZENIX/USDT on Mexc with 25(!) decimal places; most exchanges are more reasonable, with no more than 9 decimal places used. We considered several options.

Float64

The binary internal representation of floats leads to precision loss for decimal numbers; however, under certain constraints, precision loss is recoverable back to exact initial decimal values.

Trade price changes are random but changes are likely just several ticks, so the Delta codec gives a clear improvement. Gorilla is a codec specifically designed for floating-point numbers and their binary representation. It calculates XOR between adjacent values and writes it in compact binary form. Here we see a little improvement over plain LZ4 compression, but no improvement over plain ZSTD.

We tried “dumb” combinations like Gorilla + Delta and Delta + Gorilla and we got:

  • Gorilla + Delta + LZ4: 1.89

  • Gorilla + Delta + ZSTD: 1.88

  • Delta + Gorilla + LZ4: 2.08

  • Delta + Gorilla + ZSTD: 3.12

Gorilla and Delta duplicate each other’s purpose and reasonably give bad results.

Decimal(38, 19)

This format uses 38 digits, with 19 allocated to the fractional part, and is stored as UInt128.

GCD is an interesting codec. It calculates the greatest common divisor for values inside blocks (usually from 64KB to 1MB), then divides each value by it and stores the GCD value and values after divisions. It looks perfect for decimal prices that have implicit or explicit ticks. Decimal(38, 19) is a 128-bit number which is 2x of Float64, so 18.0 compression ratio above corresponds to 9.0 for Float64.

Unfortunately, the Delta codec cannot be applied to 16-byte numbers. It’s possible to specify Delta(8), which limits differences to 8 bytes, but if a difference overflows there will be no error and result values will be meaningless.

UInt64

Exchanges’ specifications for markets provide minimal price ticks. We could just store price as an integer number of minimal ticks. However, we later discovered price ticks can change from time to time. Also, Bitfinex and Hyperliquid do not use a minimal tick concept. Instead, they specify a number of significant digits for a price, making minimal ticks become dynamic.

Tuple(UInt64, Int8)

A pair of (x, p) so that an initial price is represented as x * 10 ^ p. p is chosen to minimize the number of non-zero digits.

ClickHouse does not allow specifying individual codecs for tuple components. Tuple components are stored as separate streams and compressed individually. The precision component has very low cardinality, so codecs have minimal impact because their values can be easily dictionary encoded by both LZ4 and ZSTD.

Winner

Float64 takes 8 bytes, Decimal(38, 19) - 16, and Tuple(UInt64, Int8) - 9. Let’s compare storage sizes in bytes of the best results.

We can see that Tuple(UInt64, Int8) improves a bit over Decimal(38, 19). However, it’s quite good; probably it could benefit from the Delta codec if Delta(16) were possible.

Trade Sizes

Trade sizes are similar to prices. Most exchanges provide a minimal tick size that can change from time to time. Bitfinex has fixed precision of 8 decimal places. Hyperliquid relies on the same maximum significant digits concept as for prices.

However, trade sizes differ in their expected values. Prices evolve over time: one trade ticks up, another trade ticks down, and can sit in a specific range for some time. Sizes, on the other hand, appear to be more random. Let’s compare distributions of prices and sizes for Binance Perpetual Futures ETH/USDT trades on May 19, 2021. The Y-axis is the number of trades.

And let’s take a look at a more calm day like yesterday, April 19, 2025.

Unfortunately, compressing sizes is going to be more difficult. We already know that Float64 isn’t that compressible, so let’s explore Decimal(38, 19) and (Int64, Int8).

Decimal(38, 19)

It’s surprising that GCD almost doesn’t improve compression. Apparently, the reason is that we use signed sizes, where positive values indicate buys and negative - sells. GCD doesn’t work when integers have mixed signs. So, we switched to a tuple Tuple(Decimal(38, 19), Bool) and retested.

That’s much better!

Tuple(Int64, Int8)

Delta worsens results, which is usually the case when series data are random. T64 improves results a bit.

Winner

We think that Decimal + GCD performed better because it was able to find better common divisors than our code that converts decimals into (x, p) pairs.

Conclusions

  • ZSTD provides better compression than LZ4. ZSTD alone, without specialized codecs, provides good results.

  • Delta is essential for monotonic values, such as ordered but random timestamps. DoubleDelta is perfect for sequential IDs.

  • T64 provides a slight improvement for values with limited range

  • Float64 is not good for decimal values. Instead, we can use a larger type like Decimal(38, 19) with GCD, ensuring that prices and sizes are accurately represented without concern for type constraints.

These strategies helped us to significantly reduce storage size for MarketLens from 5.91 TiB to 897.97 GiB.

0
Subscribe to my newsletter

Read articles from MS Dev directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

MS Dev
MS Dev