In distributed systems, managing data placement and load balancing efficiently is crucial. One powerful tool for addressing these challenges is consistent hashing. This blog will explore consistent hashing, provide an example, and discuss why it is superior to other approaches in certain scenarios.

What is Consistent Hashing?

To understand consistent hashing, it is helpful to first examine traditional hashing and its limitations.

Traditional Hashing:

In traditional hashing, a hash function maps keys directly to buckets (nodes).

Hash Functions are any functions that map value from an arbitrarily sized domain to another fixed-sized domain, usually called the Hash Space. The values generated as an output of these hash functions are typically used as keys to enable efficient lookups of the original entity.

For example:

- Suppose you have a Distributed File Store system where users can upload and read files. The files are stored across 4 servers, and the hash function assigns files to servers using the formula hash(fileName) % number_of_servers.
  - If the number of servers is 4, and hash(fileName) returns 9 for a specific file (e.g., "image.png"), it will be stored in server 9 % 4 = 1. Similarly, a request to read "image.png" will also be routed to server 1 using the same logic.

Limitations of Traditional Hashing:

High Disruption with Node Changes:
- If a new server is added or an existing server is removed, almost all keys need to be rehashed and redistributed.
  - For example, consider a Distributed File Store system where users upload and read files, and files are distributed across 4 servers using hash(fileName) % 4. If a new server is added (making it 5 servers), the formula changes to hash(fileName) % 5. As a result, files previously mapped to a specific server will now likely be assigned to different servers. For instance, a file that was on Server 3 with hash(fileName) % 4 = 3 might now be moved to Server 4 with hash(fileName) % 5 = 4.
- This results in significant overhead and potential performance degradation.
Load Imbalance:
- If the hash function does not distribute keys evenly, some servers may become overloaded while others remain underutilized.
Scalability Issues:
- Scaling up or down in response to load is not seamless due to the need for global rehashing.

How Consistent Hashing is Better:

Consistent hashing addresses these issues by using a different approach. Instead of directly mapping keys to nodes, both keys and nodes are placed on a virtual ring. Keys are assigned to the nearest node in the clockwise direction.

When the nodes are added to the virtual ring, only the keys mapped to the adjacent nodes will be remapped. Similarly, when nodes are removed from the virtual ring, only the keys of the removed node will be remapped. This approach ensures that when a node is added or removed, only a subset of keys needs to be remapped, making the system more resilient to changes.

The Key Idea:

The hash space is visualised as a circle (0 to 2^m - 1, where m is the number of bits in the hash).
Each node (e.g., server) is assigned a position on the circle using a hash function.
Each key is also assigned a position on the circle.
A key is assigned to the first node clockwise from its position.

Example

Suppose we have three servers (A, B, and C) in a distributed file storage system, where users upload and read files, and we use consistent hashing to distribute files.
We created 2 virtual nodes of each server so that load will distribute among them evenly and reducing the possibility of cascading failure.

Step 1: Assign Nodes to the Ring
- Server N1a,N2b,N3c is hashed to position 10, 60 and 80.
- Server N2a, N2b, N3c is hashed to position 30,90 and 110.
- Server N3a,N3b,N3c is hashed to position 120, 20 and 50.
- Server N4a,N4b,N4c is hashed to position 40, 70 and 100.
Step 2: Map Files to the Ring
- File F1 ("image1.png") is hashed to position 5.
- File F2 ("doc1.pdf") is hashed to position 25.
- File F3 ("video1.mp4") is hashed to position 70.
Step 3: Place Files
- F1 (position 5) is assigned to Server N1a (first node clockwise).
- F2 (position 25) is assigned to Server N2a.
- F3 (position 70) is assigned to Server N4b.

Adding a new Node

Suppose a new server, N5, is added and hashed to position 27.

Only one file (F2 position 25) is reassigned to N5, illustrating minimal disruption.

Removing a Node

Suppose a Server N4 is down and removed. Keys mapped to Server N4 will be remapped.
Only File F3(position 70) will be reassigned to Server N1c.

Implementation Details

Here is a basic Java implementation of consistent hashing:

import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.*;

public class ConsistentHashing {
    private final int numReplicas;
    private final SortedMap<Integer, String> ring;

    public ConsistentHashing(int numReplicas) {
        this.numReplicas = numReplicas;
        this.ring = new TreeMap<>();
    }

    // Hash function - MD5
    private int hash(String key) {
        try {
            MessageDigest md = MessageDigest.getInstance("MD5");
            byte[] digest = md.digest(key.getBytes());
            return ((digest[0] & 0xFF) << 24) | ((digest[1] & 0xFF) << 16) | ((digest[2] & 0xFF) << 8) | (digest[3] & 0xFF);
        } catch (NoSuchAlgorithmException e) {
            throw new RuntimeException(e);
        }
    }

    // Adding new node in the ring
    public void addNode(String node) {
        for (int i = 0; i < numReplicas; i++) {
            String replicaKey = node + ":" + i;
            ring.put(hash(replicaKey), node);
        }
    }

    // Removing node from the ring
    public void removeNode(String node) {
        for (int i = 0; i < numReplicas; i++) {
            String replicaKey = node + ":" + i;
            ring.remove(hash(replicaKey));
        }
    }

    // Get the node/server to map the given key/value
    public String getNode(String key) {
        if (ring.isEmpty()) {
            return null;
        }
        int hashKey = hash(key);
        if (!ring.containsKey(hashKey)) {
            SortedMap<Integer, String> tailMap = ring.tailMap(hashKey);
            hashKey = tailMap.isEmpty() ? ring.firstKey() : tailMap.firstKey();
        }
        return ring.get(hashKey);
    }

    public static void main(String[] args) {
        ConsistentHashing ch = new ConsistentHashing(3);
        ch.addNode("A");
        ch.addNode("B");
        ch.addNode("C");

        System.out.println(ch.getNode("K1")); // Node responsible for K1
        System.out.println(ch.getNode("K2")); // Node responsible for K2

        ch.addNode("D"); // Add a new node
        System.out.println(ch.getNode("K2")); // Node responsible for K2 after adding D
    }
}

Benefits of Consistent Hashing

Minimal Key Movement:
- When a node joins or leaves, only a small portion of keys are remapped. This is in contrast to traditional hashing, where all keys might need to be redistributed.
Load Balancing:
- Keys are distributed across nodes more evenly, especially when using techniques like virtual nodes (assigning multiple positions for each physical node on the ring).
Scalability:
- Adding or removing nodes is seamless, making consistent hashing ideal for systems with dynamic scaling requirements, such as cloud-based applications.
Fault Tolerance:
- When a node fails, its keys are redistributed to adjacent nodes on the ring, ensuring system continuity.

Comparison with Other Hashing Techniques

Feature	Traditional Hashing	Consistent Hashing
Key Movement on Changes	High (many keys remapped)	Low (few keys remapped)
Scalability	Poor (requires full rehash)	Excellent
Load Balancing	Depends on hash function	Enhanced with virtual nodes
Resilience to Failures	Limited	High

Applications of Consistent Hashing

Distributed Caching:
- Systems like Memcached and Redis use consistent hashing to distribute keys across nodes.
Load Balancers:
- Consistent hashing helps in assigning incoming requests to servers in web applications.
Distributed Databases:
- Databases like Cassandra and DynamoDB leverage consistent hashing for data partitioning and replication.

Conclusion

Consistent hashing is a cornerstone of modern distributed systems, enabling efficient and resilient data placement. Its ability to handle dynamic changes with minimal disruption makes it a go-to strategy for scalable and fault-tolerant applications.

Whether you’re building a distributed cache, a load balancer, or a database, understanding and implementing consistent hashing can significantly enhance your system's performance and reliability.

Consistent Hashing: Explained with Implementation Steps

Table of contents