TIL but from youtube: CVM algorithm

Introduction

This time its stuff related to data streams and how to optimise it to ensure we can get an approximation of number of distinct elements. So the video talks about the paper https://arxiv.org/abs/2301.10191. I did try to read it, some of it went over my head, some of it was fine, some of it was super cool, overall it was good, I had fun. The video helped a lot though.

Algorithm

So the paper talks about an estimator that processes the data streams and maintains a sample of the elements, we want to ensure we can estimate distinct number of elements without having to store them all which uses up a lot of memory, which as engineers, we do not like.

  • Initialisation:

    • Set the initial probability p to 1.

    • Initialise an empty set X.

    • Calculate the threshold as

$$\text{thresh} = \left\lceil \frac{12}{\epsilon^2} \log\left(\frac{8m}{\delta}\right) \right\rceil$$

where ϵ is the desired accuracy, δ is the failure probability, and m is the number of elements in the stream.

  • Processing the Stream:

    • For each element i​ in the stream:

    • Remove i from X (if it’s there).

    • With probability p, add i​ to X.

    • If the size of X reaches the threshold, halve the probability p and remove each element in X with probability 0.5 just like flipping a coin.

  • Output the Estimate:

    • At the end, the algorithm outputs

$$\frac{|X|}{p}$$

as the estimate for the number of distinct elements.

This method provides us a statistical model to estimate distinct elements.

If we take a look at accuracy once again.

  • Accuracy (ϵ): The smaller ϵ is, the more accurate the estimate needs to be, requiring a larger threshold.

  • Failure Probability (δ): The smaller δ is, the lower the probability of failure, also requiring a larger threshold.

  • Stream Length (m): The length of the data stream affects the threshold logarithmically. Longer streams mean a higher chance of reaching the threshold.

Coding time

import math
import random
def cvm(stream, epsilon, delta):
    p = 1
    X = set()
    m = len(stream)
    threshold = math.ceil(12/(epsilon**2)) * math.log(8 * m / delta)
    for i in stream:
        if i in X:
            X.remove(i)
        if random.random() < p:
            X.add(i)
        if(len(X) >= threshold):
            p /= 2
            X = {x for x in X if random.random() < 0.5}
    return len(X) / p

def generate_stream(length, range_start, range_end):
    list = [random.randint(range_start, range_end) for _ in range(length)]
    return list

# Example usage
stream_length = 10000  # Length of the random stream
range_start = 1
range_end = 8000
stream = generate_stream(stream_length, range_start, range_end)
epsilon = 0.1
delta = 0.05
estimate = cvm(stream, epsilon, delta)
print("Estimated number of distinct elements:", estimate)
0
Subscribe to my newsletter

Read articles from Sammith S Bharadwaj directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Sammith S Bharadwaj
Sammith S Bharadwaj