An interactive guide to HyperLogLog


The problem
Imagine you’re running a large-scale online store. Thousands of users visit your website every second, and you want to know how many unique users visit each day. This sounds straightforward—just track each user by their IP address or login ID. But here’s the catch: keeping a list of every unique user requires a lot of memory, especially as the number of users grows into the billions.
How do you solve this problem without drowning in memory usage? This is where HyperLogLog, a probabilistic data structure, comes into play.
The solution
HyperLogLog (HLL) is a clever algorithm that provides an approximate count of unique items (billion items) while using a fraction of the memory (1.5 kB) required by exact methods. It achieves this by trading off a small amount of accuracy for significant space savings.
Play with HyperLogLog
I have create a fun little app that let you play with HyperLogLog. Here is how this app works
Input IP Address:
Click "Random IP" to generate IP address automatically.
Add the entered IP to the HyperLogLog by clicking "Add to HLL".
Adjust Bucket Count:
Use the slider to adjust the number of buckets.
This resets the HyperLogLog and clears all previous data.
Add Multiple Random IPs:
- Use the preset buttons to add 1K, 5K, 10K, 50K, or 100K random IP addresses to the HyperLogLog.
View Metrics:
Check Actual Count, Estimated Count, Difference, Margin of Error, and Actual Error in the metrics cards.
Accurate metrics are displayed as the HyperLogLog processes the inputs.
Visualize Data:
- Explore area and bar graphs for trends in Estimate vs. Actual Count, Difference, and Percentage Error.
Inspect Buckets:
- Scroll through individual buckets to observe how HyperLogLog distributes and calculates run lengths.
Things to notice
As you increase then number of buckets Margin of error reduces. This is because a few bucket may get unlucky and get a big run early on, but when you spread that luck over large number of buckets chances of such mistakes reduces. (Kinda like how insurance work)
The error never actually reaches zero, this is because this is probabilistic algorithms i.e. unexpected wild swing are possible.
See what happens to the estimate if you skimp on number of buckets.
Working
Hash Functions and Uniform Distribution
If you have reached here that I trust you know how hash function works. To refresh you memory
Hashing involves using a hash function to convert input data (like an IP address) into a fixed-size output, often a number. Good hash functions are deterministic (they always produce the same output for the same input) and uniformly distribute outputs across the possible range.
Leading Zeros and Cardinality
The core insight is that for a uniform random hash value:
The probability of encountering a hash value with at least \(k\) leading zeros in its binary representation is \(2^{-k}\).
Example: For \(k=3\), the binary prefix must start with \(0\), which occurs \(2^{-3} = \frac{1}{8}\) of the time.The expected maximum run of leading zeros in the hash values increases logarithmically with the number of distinct elements \(n\) in the dataset.
Bucketing and Parallelism
To reduce variance and improve accuracy, the HyperLogLog algorithm splits the hash values into \(m=2^p\) buckets (where \(p\) is a tunable parameter).
Each bucket is determined by the first \(p\) bits of the hash value, which serve as the bucket index.
The remaining bits of the hash value are used to compute the number of leading zeros for that bucket.
Each bucket keeps track of the maximum number of leading zeros observed for hash values assigned to it.
Harmonic Mean of Maximum Leading Zeros
Each bucket contributes an estimate of the cardinality based on the leading zeros it observes. Since these estimates can vary significantly, HyperLogLog uses the harmonic mean of these estimates to combine the results:
$$E = α_m⋅m^2⋅ \left(\sum_{j=1}^{m} 2^{-M[j]} \right)$$
Where:
\(m=2^p\): Number of buckets.
\(M[j]\): Maximum number of leading zeros observed in the \(j-th\) bucket.
\(α_m\): A bias correction constant dependent on \(m\) derived empirically.
Bias Correction and Range Adjustment
The raw estimate \(E\) can be biased for small or large cardinalities. HyperLogLog applies bias correction in the following ways:
1. Small Range Correction: If \(E\) is small ( \(E \leq \frac{5}{2}\)), it applies a correction to handle underestimation caused by hash collisions:
$$E_{\text{corrected}} = m \cdot \log \left( \frac{m}{V} \right)$$
Where \(V\) is the number of empty buckets.
- Large Range Correction: If \(E\) exceeds a threshold (typically when \(n\) approaches or exceeds \(2^{32}\), the algorithm switches to a simpler linear counting method.
Error and Memory Efficiency
The relative error of HyperLogLog is approximately:
$$Error \approx \frac{1.04}{\sqrt{m}}$$
Larger \(m\) (more buckets) reduces error but increases memory usage.
Memory usage is proportional to \(mlog_2(log_2(n))\) bits, making HyperLogLog extremely space-efficient.
Intuition Behind Logarithmic Behavior
The logarithmic behavior of leading zeros stems from the exponential relationship between probabilities and cardinalities:
As the cardinality \(n\) grows, the probability of observing hash values with more leading zeros increases logarithmically.
HyperLogLog aggregates these local estimates (per bucket) and normalizes them using the harmonic mean, resulting in a robust global estimate.
Where is this used?
Counting unique items in a massive dataset is a fundamental problem in computer science. Consider these real-world scenarios:
Web analytics: Counting unique visitors to a website.
Database de-duplication: Estimating unique rows in a table.
Network monitoring: Counting unique IP addresses for security analysis.
A naive approach might involve storing every unique item in a set and counting the set's size. However, this requires memory proportional to the number of unique items, which will be very expensive for large datasets.
Where it’s used
Web Analytics: Google Analytics and YouTube use algorithms similar to HyperLogLog to estimate unique visitors.
Databases: TimescaleDB and Redis implement HyperLogLog for approximate distinct counts.
Big Data Platforms: Apache Druid and Presto use HyperLogLog to provide fast, approximate query results.
Downsides
While HyperLogLog is powerful, it’s important to understand its limitations:
Approximation: The algorithm provides an estimate, not an exact count. The error rate is about
Hash Collisions: The accuracy depends on the quality of the hash function. Poor hashing can lead to inaccuracies.
Further reading
HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm
Subscribe to my newsletter
Read articles from Sagyam Thapa directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by