Day 18: Hierarchical Clustering – Building a Family Tree for Data

Saket KhopkarSaket Khopkar
6 min read

Imagine you’re at a wedding. You don’t know many people there, so you start talking to the person closest to you. You realize you both love cricket, instant connection! A bit later, you and your new friend meet another person who also loves cricket, so now the three of you form a small group.

Over time, groups start merging, cricket lovers join with football lovers, and eventually, everyone is part of one big social group at the wedding. This is hierarchical clustering in action, starting with individuals, merging them step by step until everyone is part of a bigger cluster.

Okay, so talking a bit more technical, Hierarchical clustering is a way to group things into a tree-like structure, called a dendrogram. Think of it like a family tree, but instead of parents and children, we have data points and clusters.

There are two main ways to build this tree:

  1. Agglomerative → Bottom-up approach
    Start with each point as its own cluster, then merge them gradually. (This is what we’ll focus on.)

  2. Divisive → Top-down approach
    Start with everything in one big cluster, then split it apart.


The Agglomerative Way

Since we are going to focus on Agglomerative approach, it is important to undertand how it works:

Think of it as making a friend group step-by-step:

  1. Everyone starts alone (each person = 1 cluster).

  2. Find the two most similar people.

  3. Merge them into one group.

  4. Repeat until everyone is in the same big group.

For Example:
At a conference:

  • First, two web developers start talking → form Group 1.

  • Two data scientists meet → form Group 2.

  • Later, Group 1 and Group 2 meet because they both love AI → merge into a bigger group.


Measuring Closeness

Agglomerative clustering needs to measure how far apart points are. There are a few common ways to determine that.

  • Euclidean Distance is simplest to grasp, it is basically a straight line distance. You may say it is like a distance between 2 cities if a crow flies in a straight line.

  • Manhattan Distance is like if you can only move in right angles. Just like walking on streets of the city.

  • Cosine Similarity Distance tends to measure angles between points often ignoring size. Its like comparing two playlists to see if they have similar taste, not quantity.

If you compare shopping habits:

  • Euclidean → looks at total difference in spending

  • Manhattan → looks at difference category by category

  • Cosine → looks at whether spending patterns are similar


Merging Clusters - The Linkage Methods

When merging clusters, you need a rule for "how close two groups are". Thats where Linkage Methods come into the picture. Here are few of them simplified.

MethodIdeaAnalogy
Single LinkageTake the closest pair of points“Let’s connect if any of our members are friends”
Complete LinkageTake the farthest pair of points“We’ll merge only if everyone is close enough”
Average LinkageTake average distance of all pairs“On average, our members get along well”
Ward’s MethodMerge to minimize variance“Let’s combine groups so members are as similar as possible”

It’s all about distance, the algorithm keeps merging the closest clusters. "Closest" can mean:

  • Nearest point between two clusters (Single Linkage)

  • Farthest point (Complete Linkage)

  • Average of all points (Average Linkage)

  • Smallest increase in total variance (Ward’s Method; most common)


Dendograms

A dendrogram is just a diagram that shows:

  • When two clusters merged

  • How far apart they were when they merged

If the horizontal line connecting two clusters is short → they’re very similar.
If it’s tall → they’re more different.

By “cutting” the tree at a certain height, you decide how many clusters you want.


Ready for Practicals?

The above example is small one becuase the only intention is to provide you an example on how a dendogram looks like. But we are going to perform a hierarchical clustering on more bigger set of data. We will create our own dataset of 200 rows and lets see how that pans out.

import pandas as pd
import numpy as np
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
import seaborn as sns
import matplotlib.pyplot as plt

Necessary imports, don’t forget them.

# Step 1: Create the dataset
np.random.seed(42)
n_customers = 200
df = pd.DataFrame({
    'CustomerID': np.arange(1, n_customers+1),
    'Age': np.random.randint(18, 70, size=n_customers),
    'Annual Income (k$)': np.random.randint(15, 140, size=n_customers),
    'Spending Score (1-100)': np.random.randint(1, 101, size=n_customers)
})

print(df)

This will create a dataset of 200 records. This is like a Mall Customers Dataset as we used in previous article.

# Step 2: Hierarchical clustering (Ward’s method)
X = df[['Annual Income (k$)', 'Spending Score (1-100)']]
Z = linkage(X, method='ward')

Lets begin our clustering operation. We will use Ward’s method. You may use any of linkage method mentioned above.

Time to plot the dendogram.

# Step 3: Plot dendrogram
plt.figure(figsize=(12,6))
dendrogram(Z)
plt.title("Dendrogram - Mall Customers")
plt.xlabel("Customer Index")
plt.ylabel("Distance")
plt.show()

Pretty humongous right?

The dendrogram tells you where to cut to get a desired number of clusters. So, lets cut the tree.

# Step 4: Choose clusters (cut the tree)
clusters = fcluster(Z, t=5, criterion='maxclust')
df['Cluster'] = clusters
# Step 5: Visualize clusters
sns.scatterplot(data=df, x='Annual Income (k$)', y='Spending Score (1-100)', hue='Cluster', palette='tab10')
plt.title("Customer Segments from Hierarchical Clustering")
plt.show()

The scatterplot shows actual groups of customers based on income and spending.


Interview Tip on Topic

If asked “Why use hierarchical over K-Means?”, you can say:

"Hierarchical clustering not only finds clusters but also reveals their relationships, and it doesn’t require specifying the number of clusters in advance. K-Means is faster, but it’s limited to predefined K and assumes clusters are spherical."


The Wrap Up

We learnt about core concepts surrounding the Hierarchical Clustering also we saw how the closeness is measured. We had a look around the linkage methods and understood when to use which one.

We had an implementation around Agglomerative clustering, the bottom-up approach and we used a dendrogram to decide where to "cut" the tree to form groups.

However this clustering may be slow as more bigger datasets emerge from the shadows; also it tends to be sensitive to outliers. However, it shows the relationships in depth, that is a valid plus point. Also there is no need to predefine number of clusters.

So, this is a wrap up for today, we learnt a lot about Hierarchical clustering. Now it is time to move forward to more advanced topics. Tweak around the code, play with examples to gain better understanding of concepts.

Ciao!!

0
Subscribe to my newsletter

Read articles from Saket Khopkar directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Saket Khopkar
Saket Khopkar

Developer based in India. Passionate learner and blogger. All blogs are basically Notes of Tech Learning Journey.