What are maps?

I started writing about astral maps and storytelling but had to go back and delete everything because I remembered we are in the context of software development. I am considering using some map’s analogies, but they are actually not good enough. Now I’m starting to think that maps should be called something else in computer programming, because the name is a bit misleading. The name probably came up due to the act of mapping one thing to another, but that’s definitely not the first thing that comes to my mind when I hear the word map.

Anyway, Maps are a crucial data structure in programming. They provide a way to store and retrieve data by mapping unique keys to corresponding values. In Java, maps are fundamental for efficient lookups and are widely used in applications ranging from caching to configuration storage.

The best analogy for me would be a table with two columns. The first column is the key and the second column the value. In a way, a map is like a bunch of pairs of values. So if you want to store information that relates to each other, you will be fine with maps.

In Java, in order of importance, I’d say these are the maps we should care about:

HashMaps (previously HashTables) - very important, like number 1 for the majority of use cases
TreeMap, LinkedHashMap - interesting for not so common use cases

It’s also important to know about ConcurrentHashMap, but it’s just a thread-safe HashMap, so let’s not spend too much time there.

Why HashMaps are so important then?

HashMaps are maybe only less common than ArrayLists in Java. Both of them are used all the time (we covered ArrayLists here by the way). The main benefit of HashMaps is how fast they are. Want to insert something? O(1), you’re welcome. Want to retrieve something? Just give me the key and there you have it, O(1). Want to delete something? You know it: O(1). But perhaps the main drawback (this is just my opinion) is that if you don’t understand the internals of the HashMap you might not get O(1) performance… worst case scenario, you might even get O(n) for everything.

Then how does a HashMap work internally? The HashMap will contain an array of buckets. One way to visualize this is to think that there are buckets for each empty space in [ ], [ ], [ ], [ ]. When storing a key-value pair, the HashMap will perform a hash function to decide which bucket it will use to store the value. In a happy path scenario, this is pretty simple and everything is O(1). The problems come up when the hash function for more than one key ends in the same result. When this happens, a HashMap will internally use a linkedlist as the bucket. So, if more than one element is added to the same bucket, it will just be added to the linkedlist. If the linkedlist grows too much, it will be converted into a red-black tree.

That was a lot to fit in a paragraph. Let’s expand on it a bit.

A nice analogy that I like is to think that I am the HashMap. I start with 26 drawers (a real HashMap usually starts with 16 buckets) that I label from A to Z. I am a HashMap of “name to age”, meaning that the key for my HashMap is a person’s name and the value is its age. So, when a new record comes into my HashMap, like “Edgar, 25”, I apply a hash function (in my case, my hash function is to take the first letter of the name, E) to decide the drawer. Then I store “Edgar, 25” in the drawer labeled E. A real HashMap would apply a hash function to the name Edgar that would return an integer between 0 and 15. This integer would then be used to decide which bucket to put “Edgar, 25” information in.

With this, it’s easy to see why it’s so fast to store and retrieve information. If you want to use Edgar as key, I take the first letter of the name and I already know in which drawer the information is. If you want to retrieve the record with Edgar as a key, same thing, I know the drawer quite fast. On the other hand, if you want to know the name of the person that is 25 years old, it’s a headache because I need to open every drawer and check everyone’s age until I report to you.

Cool, HashMaps are pretty simple then. Is there anything else?

To be fair, a lot of things were left out. The first thing to take into account is what would happen if I had a lot of names starting with the letter E. Then my drawer would be filled with a lot of records, and it would take me time to find stuff. The HashMap has some mechanisms in place to deal with such scenarios. Ideally, the hashfunction should not create too many collisions. Meaning that if you know that you need to store 50 names with the letter E and just 1 name with the letter Z, it doesn’t make sense to use a hash function that takes the first letter of the name, because you would have 50 records in one drawer and just 1 record in the other. But even if the bucket has a lot of records, it will convert from a linkedlist into a red-black tree, which improves retrieval performance from O(n) to O(log(n)).

Another thing is that the initial number of buckets can grow. To decide when to add more buckets, the hashamp has a load factor, usually 75%, that it uses to calculate in relation to the number of buckets it has. If I have an initial 23 buckets and the load factor is 75%, when I reach ~18 records, I will add more buckets (usually double the amount of buckets). And the final relevant thing is that when I add more buckets, I need to take all existing records and assign new buckets to them, in a process known as rehashing.

Increasing the number of buckets is helpful to keep the amount of records inside each bucket not too big. If the hashfunction works properly and distributes the records evenly, the performance of the HashMap is very good.

What are the main points?

In summary, HashMaps stores and retrieves records in a O(1) time complexity. To do that, it internally uses an array of buckets, and each bucket contains a LinkedList of elements. The LinkedList can be converted to a red-black tree if it grows too much (more than 8 records usually). The amount of buckets increases if the HashMap is close to its threshold (depending on the capacity and load factor). And whenever the number of buckets increases, there’s rehashing.

Maps and fortune telling