An Interactive Guide To Count Min Sketch


Introduction
Count min sketch is a probabilistic data structure that can guess the frequency of items in a stream. It is an improvement over Hyperloglog. While hyperloglog can estimate the number of unique items in a fixed amount of data, count min sketch can do that over a stream of data. Think of hyperloglog as something that can guess the frequency of unique items on an image (something that is fixed) while count min sketch as something that can do that for a live video stream (stream of data). Meaning even when you don’t know how long the data stream is you can still guess the frequency of items in the stream
Working principle
This blog is a third installment of my probabilistic data structure series. I have written similar interactive guides of Bloom Filter and Hyperloglog. If you are unfamiliar with probabilistic data structure then reading the similar interactive guide in Hyperloglog should give you a good idea.
Count min sketch is made of
A 2D array of counters with
d
rows andw
columns.Each row has its own hash function (h1, h2, h3..).
Insert Operation (Adding an item):
For an item
x
, hash it using alld
hash functions.For each hash, increment the corresponding counter in its row:
count[i][hash_i(x)] += 1
Query Operation (Getting frequency estimate of
x
):Hash
x
with the samed
hash functions.Fetch the counts from the corresponding cells.
Return the minimum value among those
d
counters:estimate = min(count[0][h1(x)], count[1][h2(x)], ..., count[d-1][hd(x)])
I have created a fun little app that lets you see the working of count min sketch. Adjust the number of rows and columns. Click to generate to a random number. The number is hashed n times. Each time it’s hashed a location for cell whose value needs to incremented by one is found. Clicking on a number follows a similar process. Except instead of incrementing the values we take the min of all cells to get the estimate for that item.
Can you get count min sketch to always get it right?
Fun facts
It is called count min sketch because it counts the min from a sketch (sketch is an compact summary of large dataset).
It has sub-linear space complexity meaning it takes less space than storing accurate count.
The reason it never underestimates is because counters can only ever be incremented, and the minimum count is taken.
Increasing
d
(rows) means higher probability of accurate results because more independent estimates but with more time complexity.Increasing
w
(columns) means better accuracy due to less chance of collision but more memory usage.
Demo
I have create a fun little app that puts all the pieces together to show you count min sketch would work. Here our app is guessing the frequency of fruits in a stream of 5000 fruits. Hit start and see a stream of fruits appear. See how the hash table is updated in real-time. Notice that the count min sketch never underestimates the real amount.
Mathematical Relationships
Error Bounds
Error in frequency estimate ≤ ε × N with probability 1 - δ
Where:
ε = error factor (e.g., 0.001 means 0.1% error)
N = total number of items processed
δ = failure probability (e.g., 0.01 means 99% confidence)
Formula for Parameters
Width:
w = ⌈e/ε⌉ (where e ≈ 2.718)
Depth:
d = ⌈ln(1/δ)⌉
Use Cases
Finding heavy hitters in a stream.
Detecting DDoS attack.
Tracking popular search queries in search engine
References
Subscribe to my newsletter
Read articles from Sagyam Thapa directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Sagyam Thapa
Sagyam Thapa
Passionate about all areas of computer science, currently exploring DevOps, SRE and Linux