Imagine you’re at a wedding. You don’t know many people there, so you start talking to the person closest to you. You realize you both love cricket, instant connection! A bit later, you and your new friend meet another person who also loves cricket, so now the three of you form a small group.

Over time, groups start merging, cricket lovers join with football lovers, and eventually, everyone is part of one big social group at the wedding. This is hierarchical clustering in action, starting with individuals, merging them step by step until everyone is part of a bigger cluster.

Okay, so talking a bit more technical, Hierarchical clustering is a way to group things into a tree-like structure, called a dendrogram. Think of it like a family tree, but instead of parents and children, we have data points and clusters.

There are two main ways to build this tree:

Agglomerative → Bottom-up approach
Start with each point as its own cluster, then merge them gradually. (This is what we’ll focus on.)
Divisive → Top-down approach
Start with everything in one big cluster, then split it apart.

The Agglomerative Way

Since we are going to focus on Agglomerative approach, it is important to undertand how it works:

Think of it as making a friend group step-by-step:

Everyone starts alone (each person = 1 cluster).
Find the two most similar people.
Merge them into one group.
Repeat until everyone is in the same big group.

For Example:
At a conference:

First, two web developers start talking → form Group 1.
Two data scientists meet → form Group 2.
Later, Group 1 and Group 2 meet because they both love AI → merge into a bigger group.

Measuring Closeness

Agglomerative clustering needs to measure how far apart points are. There are a few common ways to determine that.

Euclidean Distance is simplest to grasp, it is basically a straight line distance. You may say it is like a distance between 2 cities if a crow flies in a straight line.
Manhattan Distance is like if you can only move in right angles. Just like walking on streets of the city.
Cosine Similarity Distance tends to measure angles between points often ignoring size. Its like comparing two playlists to see if they have similar taste, not quantity.

If you compare shopping habits:

Euclidean → looks at total difference in spending
Manhattan → looks at difference category by category
Cosine → looks at whether spending patterns are similar

Merging Clusters - The Linkage Methods

When merging clusters, you need a rule for "how close two groups are". Thats where Linkage Methods come into the picture. Here are few of them simplified.

Method	Idea	Analogy
Single Linkage	Take the closest pair of points	“Let’s connect if any of our members are friends”
Complete Linkage	Take the farthest pair of points	“We’ll merge only if everyone is close enough”
Average Linkage	Take average distance of all pairs	“On average, our members get along well”
Ward’s Method	Merge to minimize variance	“Let’s combine groups so members are as similar as possible”

It’s all about distance, the algorithm keeps merging the closest clusters. "Closest" can mean:

Nearest point between two clusters (Single Linkage)
Farthest point (Complete Linkage)
Average of all points (Average Linkage)
Smallest increase in total variance (Ward’s Method; most common)

Dendograms

A dendrogram is just a diagram that shows:

When two clusters merged
How far apart they were when they merged

If the horizontal line connecting two clusters is short → they’re very similar.
If it’s tall → they’re more different.

By “cutting” the tree at a certain height, you decide how many clusters you want.

Ready for Practicals?

The above example is small one becuase the only intention is to provide you an example on how a dendogram looks like. But we are going to perform a hierarchical clustering on more bigger set of data. We will create our own dataset of 200 rows and lets see how that pans out.

import pandas as pd
import numpy as np
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
import seaborn as sns
import matplotlib.pyplot as plt

Necessary imports, don’t forget them.

# Step 1: Create the dataset
np.random.seed(42)
n_customers = 200
df = pd.DataFrame({
    'CustomerID': np.arange(1, n_customers+1),
    'Age': np.random.randint(18, 70, size=n_customers),
    'Annual Income (k$)': np.random.randint(15, 140, size=n_customers),
    'Spending Score (1-100)': np.random.randint(1, 101, size=n_customers)
})

print(df)

This will create a dataset of 200 records. This is like a Mall Customers Dataset as we used in previous article.

# Step 2: Hierarchical clustering (Ward’s method)
X = df[['Annual Income (k$)', 'Spending Score (1-100)']]
Z = linkage(X, method='ward')

Lets begin our clustering operation. We will use Ward’s method. You may use any of linkage method mentioned above.

Time to plot the dendogram.

# Step 3: Plot dendrogram
plt.figure(figsize=(12,6))
dendrogram(Z)
plt.title("Dendrogram - Mall Customers")
plt.xlabel("Customer Index")
plt.ylabel("Distance")
plt.show()

Pretty humongous right?

The dendrogram tells you where to cut to get a desired number of clusters. So, lets cut the tree.

# Step 4: Choose clusters (cut the tree)
clusters = fcluster(Z, t=5, criterion='maxclust')
df['Cluster'] = clusters

# Step 5: Visualize clusters
sns.scatterplot(data=df, x='Annual Income (k$)', y='Spending Score (1-100)', hue='Cluster', palette='tab10')
plt.title("Customer Segments from Hierarchical Clustering")
plt.show()

The scatterplot shows actual groups of customers based on income and spending.

Interview Tip on Topic

If asked “Why use hierarchical over K-Means?”, you can say:

"Hierarchical clustering not only finds clusters but also reveals their relationships, and it doesn’t require specifying the number of clusters in advance. K-Means is faster, but it’s limited to predefined K and assumes clusters are spherical."

The Wrap Up

We learnt about core concepts surrounding the Hierarchical Clustering also we saw how the closeness is measured. We had a look around the linkage methods and understood when to use which one.

We had an implementation around Agglomerative clustering, the bottom-up approach and we used a dendrogram to decide where to "cut" the tree to form groups.

However this clustering may be slow as more bigger datasets emerge from the shadows; also it tends to be sensitive to outliers. However, it shows the relationships in depth, that is a valid plus point. Also there is no need to predefine number of clusters.

So, this is a wrap up for today, we learnt a lot about Hierarchical clustering. Now it is time to move forward to more advanced topics. Tweak around the code, play with examples to gain better understanding of concepts.

Ciao!!

Day 18: Hierarchical Clustering – Building a Family Tree for Data

Table of contents