Inside Hash Tables

Sanchit JainSanchit Jain
7 min read

Hash tables are a fundamental data structure used widely in computer science for their efficiency in storing and retrieving data. They offer average-case constant time complexity for operations like insertion, deletion, and lookup, making them ideal for use cases such as dictionaries, caches, and symbol tables.

Despite their frequent use in high-level languages, it's easy to use them without fully understanding how they work internally. In this post, I attempt to explore the inner workings of hash tables, from the basic idea behind hashing, to how collisions are handled using different techniques.

I’ve implemented basic versions of two common collision resolution strategies: chaining and open addressing, in Python to see how they compare. While performance isn’t the main focus here, I’ve included a small benchmark at the end for curiosity.

If you’re someone who has used hash tables but wants a clearer picture of what’s going on under the hood, I hope this helps!

Also, a lot of my understanding of hash tables came from watching Arpit Bhayani's Hash Table Internals videos on the topic. He explains things very clearly and intuitively. I highly recommend checking out these videos and his channel in general if you like to learn things in depth.

What are Hash Tables?

A hash table is a data structure that stores data in the form of key-value pairs. It is designed to provide fast lookups, insertions, and deletions. Internally, it uses an array where a hash function is applied to each key to determine the index at which the corresponding value should be stored. However, different keys can sometimes produce the same hash leading to same indexes, this situation is known as a collision. To handle this, various collision resolution techniques are used.

How are Hash Tables Implemented?

To implement a hash table, we start with an array of a certain size. When inserting a key-value pair, a hash function is applied to the key, which produces an index within the bounds of the array. The pair is then stored at that index. For lookups, the same hash function is applied to the key, and the value stored at the resulting index is returned.

A good hash function is crucial for performance. It should:

  • Distribute values uniformly across the array to avoid clustering.

  • Be fast to compute.

  • Have a low chance of collisions, meaning it rarely assigns the same index to different keys.

What should the size of a hash table be? A power of two is often chosen for the table size. Why? Because the modulus operation needed to find the index can be optimized into fast bitwise operations. For instance, instead of computing hash % size, we can compute hash & (size - 1) if size is a power of 2, which is faster.

When resizing, the table size should typically be doubled rather than increased by a small amount, as doubling allows for efficient rehashing and maintains good performance. To understand this in detail, I highly recommend checking out the videos where it's explained beautifully.

Collisions in Hash Tables

Collisions happen when two different keys produce the same hash value, causing them to be assigned to the same index in the hash table. Since the array size is fixed but the possible keys are virtually unlimited, collisions are inevitable. This makes having a collision resolution strategy essential for any effective hash table implementation.

Collision Resolution Techniques

Chaining

One common way to handle collisions is chaining, where each index of the hash table stores a linked list of key-value pairs. When a new key hashes to an index that already has one or more keys, a new node is created and added to the linked list at that index. This approach is effective in resolving collisions, but it has some downsides. For example, lookups can be slower in the worst case since you might have to traverse the entire linked list to find a key. Also, because the linked list nodes are scattered across memory, they aren’t cache-friendly, which can impact performance. Additionally, chaining requires extra memory to store the linked list nodes.

Open Addressing

Another strategy is open addressing, where all key-value pairs are stored directly inside the hash table’s array itself. If a collision occurs (meaning the calculated index is already occupied), the hash table uses a method called probing to find the next available slot. Probing systematically checks other positions in the array until it finds an empty one, and the key-value pair is inserted there. The same probing method is used during lookups to find the correct key.

When deleting keys in open addressing, you can’t simply clear the slot, because it may break the search chain and make some keys unreachable. Instead, the slot is soft deleted, marked as a "tombstone" to indicate it’s free but still part of the probe path. If too many tombstones build up, the table can be rehashed i.e. recreated, to remove these markers and free up space.

There are different types of probing techniques:

  • Linear Probing, which checks the next slot sequentially (but can suffer from clustering),

  • Quadratic Probing, which jumps by increasing intervals to reduce clustering, and

  • Double Hashing, which uses a second hash function to calculate the probe step size, providing better distribution.

Load Factor

The load factor of a hash table is the ratio of the number of stored elements to the total number of slots in the table. It essentially measures how full the table is.

For chaining, the load factor can be any value from 0 up to infinity, because each slot stores a linked list that can grow without limit as more keys collide at the same index.

In open addressing, however, the load factor ranges from 0 to 1. Since all key-value pairs must fit inside the fixed-size array itself, once the table is full, no more elements can be added without resizing or rehashing.

Performance Analysis: Chaining vs Open Addressing Hash Tables

In this benchmark here, I implemented two popular collision resolution strategies for hash tables: chaining and open addressing and measured their performance on insertion and search operations across varying load factors.

Key Metrics Tracked

  • Execution Time for insertions and searches

  • Number of Collisions during these operations

Load Factors Tested

I tested four load factors: 0.25, 0.5, 0.75, and 0.9, where load factor is defined as the ratio of elements to the total capacity of the table.

Insights from the Results

1. Insertion Time & Collisions

  • Chaining handles collisions by maintaining linked lists at each bucket. As the load factor increases, the average linked list length grows, causing more collisions, but insertion time increases gradually and remains manageable even at high load factors.

  • Open Addressing stores all elements directly in the table and resolves collisions by probing. As load factor approaches 1 (i.e., table fills up), insertion time and collisions increase sharply due to longer probe sequences and clustering effects.

2. Search Time & Collisions

  • For chaining, search time grows linearly with load factor because longer chains mean more nodes to traverse.

  • For open addressing, search time degrades faster near high load factors, since probes must check multiple slots, including deleted ones, until the key is found or an empty slot is reached.

3. Collision Behavior

  • Chaining can handle load factors greater than 1 because linked lists can grow arbitrarily long, making it flexible for dynamic data.

  • Open addressing must keep load factor below 1, otherwise, insertions fail as no empty slots remain.

Practical Considerations

  • Open addressing generally uses less memory as it stores data directly in the table array, improving cache locality and often giving better performance at low load factors.

  • Open addressing requires special “deleted” markers to avoid breaking probe chains, which can increase search time over many insert/delete operations. Chaining allows straightforward removals from linked lists.

  • Use Cases:

    • Use chaining when your dataset size can vary or you expect high load factors.

    • Use open addressing when memory efficiency and faster lookups at low load factors are priorities.

  • Adding resizing to open addressing could avoid performance collapse at high load factors, i.e. to rehash the table if the load factor goes over 0.5.

  • Exploring different probing techniques (quadratic, double hashing) might reduce clustering.

Conclusion

Hash tables are a foundational data structure that enable fast data access. Although many programming languages provide hash tables as built-in features, understanding how they work internally can give you a much clearer perspective on their strengths and limitations.

By implementing both chaining and open addressing in Python, I got hands-on experience with how these methods handle collisions and how performance varies with load factors. We’ve only scratched the surface here.

I hope this post helped you get a better sense of how hash tables work internally. If you're curious, I highly recommend trying to implement one yourself, it’s a fun way to reinforce the concepts. And once again, a big thanks to Arpit Bhayani for his amazing explanations that helped me understand all of this better.

0
Subscribe to my newsletter

Read articles from Sanchit Jain directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Sanchit Jain
Sanchit Jain

Hi! I'm Sanchit. A curious mind always exploring data, coding, and AI/ML. I blog about my learning journey, diving into everything from analytics to artificial intelligence and machine learning. Join me as I write, learn, and share insights along the way!