Exploring Hashing
Introduction
Imagine you're tasked with creating a digital English dictionary. The simplest method might be to replicate a physical dictionary into a text file. However, this makes word searches as cumbersome as flipping through the pages of a book. We need a better solution.
1. The Magic of Hashing
To elevate our digital dictionary, we think of storing words such that they can be quickly retrieved. One method is storing the words in a sorted manner and then using binary search, but this results in a time complexity based on the logarithm of the number of words.
Instead, we ponder over a 'magical transformation function' that would convert each word into a unique number and use it as an index to store the word's meaning in an array. This magic is what we call a hashing function.
2. Crafting the Ideal Hash Function
For a hash function to be efficient:
It should convert an object into a number between 0 to N, where N is the array length.
It must be pure and deterministic. The same input always yields the same output in constant time.
Consider our initial hash function:
Convert characters to numbers: a-1, b-2... z-26. Sum these numbers and take the remainder when divided by a constant N.
This raises two pertinent questions:
What's the optimal value for N?
What happens when two different words result in the same number?
3. The Collision Conundrum
When different inputs produce the same output in a hashing function, it's termed as a collision. A crucial measure of a hash function's efficiency is its ability to prevent collisions.
Enter the pigeonhole principle: If you have N pigeonholes and more than N pigeons, at least one hole will contain more than one pigeon. Similarly, with N possible output values from a hash function, hashing more than N items guarantees at least one collision.
Would increasing N solve our problem? Let's evaluate.
By converting each character to its respective number, concatenating them, and considering a word limit of 10 characters, our hash function could yield a maximum value of 2626262626... - quite enormous. An array to accommodate this would be impractical.
The conclusion? Maintain a reasonable N and devise strategies to handle collisions.
4. Tackling Collisions
A few methods to handle collisions include:
Open Addressing (like linear probing or double hashing): When a collision occurs, we look for the next available slot.
Chaining: Each array index points to a list. In case of collisions, entries are added to the respective list.
5. Cryptographic Hashing: Beyond the Basics
While the methods we've discussed are suitable for general use-cases, there's a category of hash functions built for security: cryptographic hash functions.
a. The Giants: MD-5, SHA-1, and SHA-256
Cryptographic hashing functions like MD-5, SHA-1, and SHA-256 have been at the forefront of data integrity and security.
MD-5 (Message Digest Algorithm 5): It produces a 128-bit hash value, typically rendered as a 32-digit hexadecimal number. Although widely used in the past, vulnerabilities have been discovered, making it unsuitable for further use.
SHA-1 (Secure Hash Algorithm 1): Produces a 160-bit hash value, commonly represented as a 40-digit hexadecimal number. Like MD-5, SHA-1 is no longer considered secure against well-funded attackers.
SHA-256: Part of the SHA-2 (Secure Hash Algorithm 2) family, it returns a 256-bit hash value. SHA-256 remains robust and is commonly used for securing sensitive data.
b. The Art of Finding Collisions
Finding two different inputs that produce the same hash is known as a collision. For cryptographic hash functions, this is a significant flaw. With advancements in computing, finding collisions in algorithms like MD-5 and SHA-1 has become feasible, leading to their deprecation.
However, with SHA-256, finding collisions remains computationally infeasible due to its design and the vastness of its output space.
Image prompt for DALLE-3: A maze representing the process of finding collisions, with paths leading to dead ends for SHA-256.
c. Bitcoin and the Quest for Collisions
The cryptocurrency Bitcoin employs SHA-256 for its proof-of-work mechanism. In essence, miners hunt for specific hash values. While not exactly searching for collisions, the process involves trying a vast number of inputs to find a hash that meets specific criteria. To incentivize this computationally intense task, miners are rewarded with bitcoins when they succeed.
The design of Bitcoin's protocol showcases the robustness of SHA-256 and the current impracticality of finding collisions.
6. Consistent Hashing: The Game-Changer in Distributed Systems
At its core, consistent hashing is a technique used to distribute data across multiple servers or buckets, minimizing data reshuffling when servers are added or removed. While consistent hashing can be used with a variety of hash functions, let's zoom in on its application with SHA-1.
a. How does Consistent Hashing with SHA-1 work?
The Hash Ring: Imagine a circle, termed the "hash ring". The goal is to map both data and servers onto this ring.
Mapping Data: When we have a piece of data (like a word from our dictionary), we hash it using SHA-1. The resulting 160-bit value determines its position on the hash ring.
Mapping Servers: Servers (or buckets) are also mapped onto this ring. We can hash the server's IP or a unique identifier using SHA-1 to get its position.
Data Placement: Once both data and servers are mapped, each piece of data is assigned to the first server it encounters moving clockwise on the ring.
b. The Magic of Minimal Reshuffling:
Let's illustrate this with an example:
Suppose we have 3 servers initially, and our words are distributed among them.
Later, we introduce a 4th server. In traditional hashing, we'd likely have to reshuffle many words. But with consistent hashing, only the words between the 3rd and 1st server on our ring would be affected.
This property is particularly useful in real-world distributed systems where server additions and failures are common.
c. Virtual Nodes for Load Balancing:
One challenge with consistent hashing is that it doesn’t guarantee a uniform distribution of data. To combat this, we introduce "virtual nodes". Each server can be represented by multiple virtual nodes on the hash ring, thus ensuring a more balanced distribution.
Applications in the Real World:
Distributed Caches: Popular systems like Memcached utilize consistent hashing. When a caching server goes down or is added, consistent hashing ensures that only a minimal number of keys are remapped.
Database Sharding: Large-scale databases distribute data across multiple servers or "shards". Consistent hashing aids in determining where each piece of data should reside while minimizing data movement when scaling the system.
Content Delivery Networks (CDNs): CDNs distribute content to servers across the globe. Consistent hashing ensures users are directed to the nearest server with the content they need and facilitates smooth content replication and distribution.
Distributed File Systems: Systems like the Apache Cassandra database use consistent hashing to determine which node should store a particular file or piece of data.
Further Reading and References:
Books:
"Data Structures and Algorithm Analysis in C++" by Mark A. Weiss: A comprehensive text with a dedicated section on hashing.
"Designing Data-Intensive Applications" by Martin Kleppmann: Provides insights into the challenges and solutions of large-scale data systems, including the use of consistent hashing.
Research Papers:
Karger, D. et al. (1997). Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web. In Proceedings of the 29th Annual ACM Symposium on Theory of Computing.
Litwin, W. (1980). Linear Hashing: A New Tool for File and Table Addressing. In Proceedings of the 6th International Conference on Very Large Data Bases.
Online Resources:
Consistent Hashing - An article on Toptal that provides a hands-on guide to the concept.
Introduction to Cryptographic Hash Functions for the Non-Techie - A beginner-friendly dive into the world of cryptographic hashing.
Courses and Lectures:
- MIT OpenCourseWare, Introduction to Algorithms - This course covers fundamental algorithms, including a section on hashing.
Embark on these pathways of knowledge, and soon, the intricate dance of numbers and algorithms will unveil its deeper mysteries. Whether you're just starting out or seeking advanced wisdom, there's always more to discover in the world of computer science.
Subscribe to my newsletter
Read articles from Siva Sagar Kalepu directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by