Collision Handling Techniques in Hashing You Need to Know
Table of contents
Introduction to Collisions
Hashing is a fundamental concept in computer science, widely used in various applications such as database indexing, caching, and data retrieval. The essence of hashing lies in mapping large datasets to smaller, fixed-size tables using a hash function. Ideally, each unique input (or key) maps to a unique location in the hash table. However, due to the finite size of the hash table and the vast number of possible keys, it's inevitable that two different keys might produce the same hash value. This situation is known as a collision.
The Frequency of Collisions: The Birthday Paradox
To understand how frequent collisions can be, let’s delve into the Birthday Paradox. This paradox illustrates a surprising probability theory result: in a room of just 23 people, there’s more than 50% chance that two people share the same birthday. This is counterintuitive because the likelihood of any two specific individuals sharing a birthday seems low. However, as the number of people increases, the number of possible pairs grows rapidly, leading to a higher probability of a shared birthday.
In hashing, a similar situation arises. Even if a hash table is large, the chances of collisions increase as more keys are hashed into the table. Just as with the Birthday Paradox, the likelihood of a collision becomes significant even when the table isn’t close to full. Understanding this frequency is crucial because it helps in choosing the right collision handling technique to maintain efficiency in data operations.
Collision Handling Techniques
Given that collisions are unavoidable, several techniques have been developed to handle them efficiently. The two most common methods are Open Addressing and Separate Chaining. Let’s explore these in detail.
Open Addressing
In Open Addressing, when a collision occurs, the algorithm searches for the next available slot in the hash table according to a specific probing sequence. The idea is to keep all the keys within the hash table itself, without needing extra storage.
Types of Open Addressing:
Linear Probing:
Description: When a collision occurs, the algorithm checks the next empty slot in a linear sequence (i.e., the next slot is
(hash + 1) % table_size
).Advantages: Simple to implement and fast cache performance due to spatial locality.
Disadvantages: Tends to cluster keys together, leading to increased search times (known as primary clustering).
Given below is an example of Open Addressing using Linear Probing
Quadratic Probing:
Description: Instead of checking the next slot linearly, quadratic probing checks slots by a quadratic function (e.g.,
(hash + i^2) % table_size
).Advantages: Reduces clustering issues found in linear probing.
Disadvantages: Can still lead to secondary clustering and may not probe all slots, leading to inefficiency.
Double Hashing:
Description: Uses a second hash function to determine the probing sequence (e.g.,
hash2(key) = 1 + (key % (table_size - 1))
).Advantages: Minimizes clustering and provides a more uniform distribution of keys.
Disadvantages: More complex to implement and requires careful selection of the second hash function to avoid patterns.
Pros of Open Addressing:
All data is stored within the hash table, leading to compact memory usage.
Efficient when the load factor (the ratio of the number of elements to the table size) is low.
Cons of Open Addressing:
Performance degrades as the table fills up.
Deletion is tricky, requiring careful handling to avoid breaking the probing sequence.
Separate Chaining
Separate Chaining is a technique where each slot in the hash table points to a linked list (or another data structure) that stores all keys that hash to that slot. When a collision occurs, the new key is simply appended to the linked list at that slot.
Given below is an example of Separate Chaining using Linked Lists:
Description:
Insertion: The new key is added to the list at the appropriate index.
Search: The hash table is accessed, and the linked list at the corresponding slot is traversed to find the key.
Deletion: The key is removed from the linked list, if present.
Advantages:
Handles collisions efficiently even when the table is heavily loaded.
Easy to implement and allows the hash table to grow dynamically (only the linked lists need to grow).
Disadvantages:
Extra memory overhead for storing pointers in the linked lists.
Slower access time due to potential traversal of linked lists.
Poor cache performance because linked lists may not be stored contiguously in memory.
Separate Chaining with Different Data Structures
Separate Chaining can be implemented using various data structures to optimize performance based on specific needs.
Linked List
Insertion: New keys are added to the head or tail of the linked list.
Search: Traverse the linked list to find the key.
Deletion: Remove the key from the linked list if it exists.
Advantages:
Simple to implement.
Efficient for small lists.
Disadvantages:
Slower search times for long lists.
Extra memory overhead for pointers.
Dynamic Arrays
Insertion: New keys are added to the end of the dynamic array, resizing if necessary.
Search: Iterate through the array to find the key.
Deletion: Remove the key and shift elements if necessary.
Advantages:
Faster access times compared to linked lists.
Better cache performance.
Disadvantages:
Overhead of resizing the array.
Potentially inefficient deletions due to shifting elements.
Self-Balancing Binary Search Trees (BSTs)
Insertion: Insert the key while maintaining the tree's balanced property.
Search: Perform a binary search to find the key.
Deletion: Remove the key while maintaining the tree's balanced property.
Advantages:
Faster search, insertion, and deletion times for large datasets.
Keeps data sorted, which can be useful for range queries.
Disadvantages:
More complex to implement.
Higher memory overhead compared to linked lists and dynamic arrays.
Some Other Collision Handling Techniques
Besides Open Addressing and Separate Chaining, several other techniques have been developed to handle collisions. While these are less commonly used, they offer unique benefits for specific scenarios.
Robin Hood Hashing
Description: In Robin Hood Hashing, when a collision occurs, the key with the greater “probe distance” (the distance from its original hash index) is swapped with the current key. This method attempts to equalize the distribution of keys across the table.
Advantages: Reduces the variance in probe lengths, leading to more predictable performance.
Disadvantages: More complex insertion process and slightly increased average probe lengths.
Cuckoo Hashing
Description: Cuckoo Hashing uses two (or more) hash functions and stores each key in one of the two possible locations. If both locations are occupied, one of the existing keys is "kicked out" and rehashed to its alternative location.
Advantages: Guarantees O(1) lookup time and provides good worst-case performance.
Disadvantages: Insertion can be complex and may require a complete rehashing of the table if cycles are formed.
Hopscotch Hashing
Description: A variant of linear probing, Hopscotch Hashing maintains a neighborhood of nearby slots for each key, allowing more flexibility in placement and improving the chance of finding an empty slot nearby.
Advantages: Improves cache performance and reduces clustering.
Disadvantages: More complex implementation, especially in ensuring that the neighborhood property is maintained during insertions and deletions.
Conclusion
Collisions in hashing are inevitable due to the nature of hash functions and finite hash tables. Understanding the frequency and handling collisions efficiently is crucial for maintaining performance in applications that rely on hashing. While Open Addressing and Separate Chaining are the most commonly used methods, alternative techniques like Robin Hood Hashing, Cuckoo Hashing, and Hopscotch Hashing offer interesting solutions to specific challenges. The choice of collision handling technique depends on the specific requirements of the application, including memory constraints, speed, and the expected load factor of the hash table.
Subscribe to my newsletter
Read articles from Darsh Patel directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by