“Regular” Hashing

Say we’re building a contact management system - given a contact number, we want to display the contact’s name. Let’s start with a simple hash table that:

Uses a simple modulo function to hash contact numbers
Uses separate chaining to handle collisions

Starting off with ‘modulo 5’ and 3 phone numbers, this hash table would look something like:

Bucket 0 -> 3105550000
Bucket 1 ->
Bucket 2 ->
Bucket 3 -> 7028885678
Bucket 4 -> 9845551234

So far, so good, but what if more numbers are added to the map? Collisions.

Bucket 0 -> 3105550000
Bucket 1 -> 6467771111
Bucket 2 -> 9174442222
Bucket 3 -> 7028885678
Bucket 4 -> 9845551234, 4152223344 (collision)

Rehashing

At this point, we might decide to use a different hash function with more buckets, since any new contact will surely cause a collision. Why don’t we expand the map to 15 buckets, using ‘modulo 15’ this time? This process is called rehashing.

Bucket 4 -> 9845551234
Bucket 7 -> 9174442222
Bucket 8 -> 7028885678
Bucket 9 -> 4152223344
Bucket 10 -> 3105550000
Bucket 11 -> 6467771111
(the other buckets remain empty)

Great! No collisions, at least for now. However, we had to reconstruct the hash table all over again, and move all elements to their new buckets. That’s an expensive operation - the time complexity of rehashing is O(n).

The load factor (the ratio of elements to buckets) is commonly used to decide when to rehash. For instance, when the load factor of a Java HashMap exceeds 0.75, the number of buckets is doubled, and all the keys get rehashed to new buckets.

Since rehashing occurs infrequently, adding new keys to a hash table is amortized O(1) complexity. Clearly, for regular hash tables, it’s not too big a deal. But what if we’re dealing with a scenario where the number of buckets keeps changing constantly?

A Web Cache

Consider a distributed web cache with a network of servers that cache websites to reduce load times. We naturally require hashing to determine which server contains the cached copy of a particular website. Let's start with 3 servers:

import hashlib

NUMBER_OF_SERVERS = 3

def map_website_to_server(website: str):
    website_bytes = website.encode()
    sha256_hash = hashlib.sha256(website_bytes).hexdigest()
    return int(sha256_hash, 16) % NUMBER_OF_SERVERS

Here’s our cache containing a few popular websites:

Server 0 -> google.com
Server 1 -> youtube.com
Server 2 -> linkedin.com

Here comes the problem - what if a server goes down? What if we want to add a new server to the system to scale it up? In either case, all keys must be rehashed to their new servers. This would make our system inefficient at handling changes in the server network.

The Solution: Consistent Hashing

Consistent hashing solves this problem by hashing the websites and the cache servers to a common range of values. Here’s one example that hashes all websites and servers to a value between 0 and 1:

To find the server that a website gets mapped to, simply find the server closest to the website’s hash value in the clockwise direction. In this example, amazon.com gets cached in Server 1, and linkedin.com is handled by Server 3.
When a server goes down, only the websites that were cached in that server get moved to the next server in the circle, leaving the other websites are unaffected.
Assuming keys and servers are evenly distributed in the circle, the affected keys are 1/n of the total number of keys, where n is the number of servers.

import hashlib

class ConsistentHasher:
    def __init__(self, servers):
        self.servers = servers
        self.ring = [(self._hash(server), server) for server in self.servers]
        self.ring.sort()

    def _hash(self, key):
        hash_value = hashlib.sha256(key.encode()).hexdigest()
        int_hash = int(hash_value, 16)
        return round((int_hash % (10**8)) / 10**8, 2)  # Scale to [0.00, 1.00]

    def get_server(self, website):
        website_hash = self._hash(website)
        for server_hash, server in self.ring:
            if website_hash <= server_hash:
                return server
        return self.ring[0][1]  # if we don't find a server, loop back to the first one

ch = ConsistentHasher(['server1', 'server2'])
ch.get_server("google.com")

Boom! We now have a flexible hashing mechanism that allows us to add and remove buckets with minimal remapping of keys. Consistent hashing also finds applications in distributed databases, content delivery networks, and load balancing.

Why Consistent Hashing?

“Regular” Hashing

Rehashing

A Web Cache

The Solution: Consistent Hashing

Subscribe to my newsletter

Saketh Raman KS

Saketh Raman KS