Hierarchical Clustering – Building Dendrograms to Unveil Data Hierarchies

Tushar PantTushar Pant
5 min read

Introduction

Hierarchical Clustering is a popular unsupervised learning algorithm used to group data points into a hierarchy of clusters. Unlike K-Means, it doesn’t require the number of clusters to be specified in advance. Instead, it builds a hierarchy of clusters in the form of a Dendrogram, which helps in understanding the relationships between data points.

Hierarchical Clustering is particularly useful when you want to explore data structures, find hidden patterns, or identify hierarchical relationships.


1. What is Hierarchical Clustering?

Hierarchical Clustering is an unsupervised learning technique that builds a tree-like structure of clusters called a Dendrogram. It can be visualized as a tree where:

  • Leaves represent individual data points.

  • Nodes represent clusters of data points.

  • Height represents the distance or dissimilarity between clusters.

1.1 Why Use Hierarchical Clustering?

  • No Pre-defined K: Doesn’t require the number of clusters to be specified in advance.

  • Hierarchical Representation: Reveals nested patterns and hierarchical relationships.

  • Flexibility: Works well for small to medium-sized datasets.

  • Interpretability: Dendrograms provide a visual representation of clusters and their relationships.


2. Types of Hierarchical Clustering

There are two main types of Hierarchical Clustering:

2.1 Agglomerative Hierarchical Clustering

  • Bottom-Up Approach: Starts with each data point as an individual cluster and merges the closest clusters iteratively until one single cluster is formed.

  • Most Common Type: Due to its simplicity and ease of implementation.

  • Dendrogram Interpretation: The bottom of the dendrogram represents individual points, and the top represents the root cluster.

2.2 Divisive Hierarchical Clustering

  • Top-Down Approach: Starts with the entire dataset as one cluster and recursively splits it into smaller clusters until each data point is in its own cluster.

  • Less Common: More computationally expensive compared to agglomerative clustering.

  • Dendrogram Interpretation: The top of the dendrogram represents the root cluster, and the leaves represent individual points.


3. How Hierarchical Clustering Works

Hierarchical Clustering follows a series of steps to create a dendrogram:

Step 1: Calculate Distance Matrix

Compute the distance or dissimilarity between every pair of data points using a distance metric like Euclidean, Manhattan, or Cosine distance.

Step 2: Merge Closest Clusters

Identify the two closest clusters based on a linkage criterion (e.g., single linkage, complete linkage, average linkage) and merge them into one cluster.

Step 3: Update Distance Matrix

Update the distance matrix by calculating the distance between the newly formed cluster and the remaining clusters.

Step 4: Repeat Until One Cluster Remains

Continue merging clusters until all data points are merged into a single cluster.


4. Distance Metrics and Linkage Criteria

4.1 Distance Metrics

  • Euclidean Distance: Most common, suitable for continuous data.

  • Manhattan Distance: Suitable for high-dimensional data.

  • Cosine Distance: Measures similarity in direction rather than magnitude.

  • Hamming Distance: Used for categorical data.

4.2 Linkage Criteria

  • Single Linkage (Minimum Distance): Distance between the closest points of two clusters.

  • Complete Linkage (Maximum Distance): Distance between the farthest points of two clusters.

  • Average Linkage (Mean Distance): Average distance between all pairs of points in the two clusters.

  • Centroid Linkage: Distance between the centroids of two clusters.

  • Ward’s Method: Minimizes the variance within clusters.


5. Dendrograms and How to Interpret Them

A Dendrogram is a tree-like diagram that represents the hierarchy of clusters.

5.1 Structure of a Dendrogram

  • Horizontal Axis: Data points.

  • Vertical Axis: Distance or dissimilarity between clusters.

  • Branches (Links): Represent the merging of clusters.

5.2 Cutting the Dendrogram

To form clusters, cut the dendrogram at a specific height. The number of vertical lines intersected by the horizontal cut represents the number of clusters.

5.3 Determining the Optimal Number of Clusters

  • Visual Inspection: Cut the dendrogram at the largest vertical distance (longest branch).

  • Inconsistency Coefficient: Measures the dissimilarity at each level.


6. Advantages and Disadvantages

6.1 Advantages:

  • No Predefined K: Number of clusters is determined by analyzing the dendrogram.

  • Hierarchical Representation: Captures hierarchical relationships.

  • Versatility: Works with different types of data and distance metrics.

6.2 Disadvantages:

  • Computational Complexity: O(n3)O(n^3) time complexity and O(n2)O(n^2) space complexity.

  • No Reassignment: Data points can’t be reassigned once merged.

  • Sensitivity to Noise: Outliers can distort cluster structure.


7. Limitations and Solutions

  • Limitation: High computational complexity.

    • Solution: Use Agglomerative Clustering with Ward’s Method or Scalable Variants like BIRCH or CURE.
  • Limitation: Sensitive to noise and outliers.

    • Solution: Preprocess data using outlier detection and noise reduction techniques.

8. Implementation of Hierarchical Clustering in Python

Let's implement Agglomerative Hierarchical Clustering on the Iris dataset using Scikit-learn:

# Importing Libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data
# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Perform Hierarchical Clustering
linked = linkage(X_scaled, method='ward')
plt.figure(figsize=(10, 7))
dendrogram(linked, orientation='top', labels=iris.target)
plt.title('Dendrogram')
plt.xlabel('Data Points')
plt.ylabel('Euclidean Distance')
plt.show()

# Apply Agglomerative Clustering
hc = AgglomerativeClustering(n_clusters=3, metric='euclidean', linkage='ward')
y_hc = hc.fit_predict(X_scaled)
# Visualize Clusters
plt.scatter(X_scaled[y_hc == 0, 0], X_scaled[y_hc == 0, 1], s=100, c='red', label='Cluster 1')
plt.scatter(X_scaled[y_hc == 1, 0], X_scaled[y_hc == 1, 1], s=100, c='blue', label='Cluster 2')
plt.scatter(X_scaled[y_hc == 2, 0], X_scaled[y_hc == 2, 1], s=100, c='green', label='Cluster 3')
plt.title('Hierarchical Clustering')
plt.legend()
plt.show()


9. Real-world Applications

  • Customer Segmentation: Grouping customers based on purchase behavior.

  • Gene Expression Analysis: Identifying gene groups with similar expression patterns.

  • Document Clustering: Grouping documents based on topic similarity.

  • Image Segmentation: Segmenting images into different objects or regions.


10. Conclusion

Hierarchical Clustering is a powerful and versatile unsupervised learning algorithm. Its ability to reveal hierarchical relationships makes it a valuable tool for exploratory data analysis. Ready to explore DBSCAN next?

0
Subscribe to my newsletter

Read articles from Tushar Pant directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Tushar Pant
Tushar Pant