Probabilistic Data Structures Part 1

Amr ElhewyAmr Elhewy
7 min read

Introduction

Today is all about probabilistic data structures, probabilistic data structures are data structures that use randomization and approximation to achieve efficient storage and processing of large-scale data. These structures typically trade off perfect accuracy for lower memory usage and faster performance, making them useful for handling big data, streaming data, and distributed systems.

Examples of using something like this would be when you want to get some count of millions of records, when you want to get the cardinality level in a set of data, when you want to check if something exists amongst millions of records. The aim is to sacrifice perfect accuracy for scalability and that is of course if the business is okay with something like this.

I’ll be going through the ones i’ve used and some more too, they are as follows;

  1. Count-Min Sketch ✅

  2. Bloom Filter ✅

  3. HyperLogLog ✅

  4. Skip Lists ✅

  5. K-Minimum Values

  6. LogLog & SuperLogLog

  7. Top K & Heavy Hitters

Count-Min Sketch

Let’s say we are receiving a continuous stream of data {"video_id": 1} which are view counts, and we’re required to sum them up. The most straightforward way of doing this is by using a HashMap where we’d have the video_id as a key and the value being the total count.

The problem here is that the videos are a lot and the hash map size would increase significantly potentially being memory inefficient. If the business requirement was not strict on showing the exact view counts then count min sketch is the way to go.

Count-Min Sketch is a probabilistic data structure that provides an approximate frequency count of elements in a data stream using constant space.

Instead of storing each video’s view count in a HashMap, which grows as the number of videos increases, we use a 2D array (matrix) of counters with multiple hash functions to track approximate counts.

Step by step breakdown:

  1. Initialize a matrix (w × d)

    • w columns (width): Represents the number of counters in each row.

    • d rows (depth): Each row corresponds to a different hash function.

    • All counters start at 0.

  2. Updating the Count

    • When a new view event { "video_id": 1 } arrives:

      • The video ID is hashed using d different hash functions.

      • Each hash function maps the video ID to a column in its respective row.

      • The counters at those positions are incremented.

  3. Querying the Count

    • To estimate the view count of a specific video ID, hash it with the same d hash functions and mod it with the column length. The more columns you add the more accuracy but also more space.

    • Look at the corresponding positions in each row and take the minimum value across all rows.

    • This minimizes the effect of hash collisions (hence "count-min").

The reason we pick the minimum value is to minimize the possibility of hash-collisions where different video ids might hit the same row & col. That’s why when always picking the minimum would minimize the collisions. Sometimes also we can have 5 hash functions instead of 3 which would increase the accuracy. But increasing the number of hash functions wouldn’t always be beneficial as there is a point where the more you add won’t offer any significant improvement in accuracy. Below is a gif example of how it works.

Bloom Filters

Another cool probabilistic data structure now, bloom filters is a space efficient data structure that tells you if an element is 100% not in a set of elements or if maybe is.

Meaning it may give false positives but never gives a false negative so it either definitely doesn’t exist or maybe exists.

Similar to the idea of count min sketch a value gets passed in multiple hash functions and the results would then map to different indexes in a fixed size bit array and the bits at these indexes are set to 1.

So if we’re looking for a value we pass it through multiple hash functions and check if the bits in the respective indices are 1 or not. If there exist any zeroes then its 100% not in the set of elements but if they’re all ones then it maybe in the set or maybe just collided with other members in the set.

Here’s how they work

[animate output image]

Skip Lists

Skip lists are really efficient data structures that help optimize searching, insertion and deletion in linked lists. In sorted linked lists finding elements will always take O(n) time because you have to traverse to reach the required node. However a way to optimize that is by using skip lists.

They are simply levels above each other (the levels are called express lanes (faster routes)) on top of a basic linked list, something like this

And when traversing we start from the top left 1 and check the next station in the express lane. So for example if we’re looking for the value 5

  1. Check the first top left value and its next station in express lane 4

  2. since 5 > 4 we go through the express lane

  3. we check the next express lane station 6

  4. since 6 > 5 we traverse through the basic linked list till we find 5

This can go for multiple levels having a 1 → 5 layer above the one in the image above. This helps to skip a lot of nodes which would optimize search, insertion and deletion time effectively.

Here’s an example if we’re looking for the value 5 in the skip list.

[animate output image]

HyperLogLog

This one is all about cardinality, it would give you an estimation on how many unique items exist in a set of elements and is efficient memory wise.

Think of a lottery where you randomly pick a number between 1 and 100.

  • If you only draw 5 numbers, you probably won’t get anything close to 100.

  • If you draw a million numbers, chances are, you'll eventually get 100.

🔸 The more numbers you pick, the higher the chance of getting a big number.

HyperLogLog works the same way, but instead of picking numbers, it’s looking at how many leading zeroes appear in a hash.

  1. Each an every element gets hashed into a random binary hash (a binary representation of a hashed value)

  2. Once we start hashing elements, we look at where the first 1 appears in the binary hash and remember the biggest number we've seen so far.

The more unique values we hash, the higher the chance that one of them has many leading zeroes.

If you’ve seen a hash with a 1 at position 7, that means you must have seen a lot of unique elements to get such a rare case.

HyperLogLog tracks the largest position seen and uses math to estimate how many unique elements must exist for that to happen.

Now that items are hashed, they are split into different buckets, buckets help smooth out extreme cases where for example we have 3 items but luckily a leftmost 1 was found at the 7th position in one of the items. let me explain

Buckets are small memory slots where HyperLogLog stores information separately for different groups of elements.

Imagine you're counting unique people entering a stadium.

  • If you only look at one entrance, you might get a bad estimate because not all people use that entrance.

  • Instead, you track multiple entrances separately and combine the results to get a better estimate.

Each bucket only remembers the biggest value it has seen.

Now, instead of estimating based on one extreme value, we take an average of all the buckets.

For example:

  • If all buckets saw only small values (1, 1, 2), we probably have few unique elements.

  • If some buckets saw high values (like 5, 6, 7), that suggests we have a lot of unique elements.

It’s kind of confusing but the more buckets the better accuracy you might have

Then we proceed to take the average of all the buckets to estimate the number of unique elements using a mathematical formula. (something called the harmonic mean)

0
Subscribe to my newsletter

Read articles from Amr Elhewy directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Amr Elhewy
Amr Elhewy