Mastering Clustering: Unsupervised Learning Techniques and Best Practices
Part - III
Clustering is a fundamental unsupervised learning technique used to group similar data points into clusters without predefined labels. It helps in discovering hidden patterns or structures within datasets, which is useful in various applications like customer segmentation, anomaly detection, and image compression.
Unsupervised learning contrasts with supervised learning as it doesn't rely on labeled data. Instead, it identifies natural groupings within the dataset based on similarity or distance metrics.
Real-world applications:
Market segmentation: Grouping customers based on purchasing behavior.
Image compression: Reducing image size by grouping pixels into clusters.
Anomaly detection: Identifying outliers in financial transactions.
1. K-Means Clustering
Definition:
K-Means is a centroid-based clustering algorithm that partitions the dataset into k
distinct clusters. Each cluster is represented by a centroid, and data points are assigned to the nearest centroid.
Specifications:
Parameters: Number of clusters (
k
), distance metric (e.g., Euclidean distance).Complexity: O(n k t), where
n
is the number of data points,k
is the number of clusters, andt
is the number of iterations.
Real-world example:
Customer segmentation based on purchasing behavior. Companies use K-Means to group customers into segments for targeted marketing.
Code Snippet: K-Means Implementation in Python
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# Generate sample data
X = np.array([[1, 2], [1, 4], [1, 0],
[4, 2], [4, 4], [4, 0]])
# K-Means clustering
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
# Cluster centers and labels
print("Cluster centers:\n", kmeans.cluster_centers_)
print("Labels:", kmeans.labels_)
# Visualize the clusters
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, cmap='rainbow')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=100, c='black')
plt.title('K-Means Clustering')
plt.show()
Visual Aid:
A scatter plot visualizing the two clusters with their centroids marked.
Use color to differentiate between clusters and display the convergence of data points toward centroids.
2. Hierarchical Clustering
Definition:
Hierarchical clustering either starts by treating each data point as a single cluster (agglomerative) or starts with a single cluster and splits it into smaller clusters (divisive). The result is a hierarchy of clusters, which can be visualized using a dendrogram.
Specifications:
Parameters: Linkage criteria (e.g., single, complete, average), distance metric.
Complexity: O(n²), where
n
is the number of data points.
Real-world example:
Document categorization based on textual similarity. Companies use hierarchical clustering to organize large text corpora into topics.
Code Snippet: Hierarchical Clustering in Python
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
# Generate sample data
X = np.array([[1, 2], [1, 4], [1, 0],
[4, 2], [4, 4], [4, 0]])
# Perform hierarchical/agglomerative clustering
Z = linkage(X, 'ward')
# Plot dendrogram
plt.figure()
dendrogram(Z)
plt.title('Hierarchical Clustering Dendrogram')
plt.show()
Visual Aid:
A dendrogram showing the hierarchical structure of clusters.
Demonstrates how data points merge into clusters step-by-step.
3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
Definition:
DBSCAN is a density-based clustering algorithm that groups together closely packed points while marking points that lie in low-density regions as outliers.
Specifications:
Parameters: Epsilon (
eps
), minimum samples (min_samples
).Complexity: O(n log n) where
n
is the number of points.
Real-world example:
Anomaly detection in financial data. DBSCAN can detect fraudulent transactions by identifying outliers in dense transaction clusters.
Code Snippet: DBSCAN in Python
from sklearn.cluster import DBSCAN
import numpy as np
import matplotlib.pyplot as plt
# Generate sample data
X = np.array([[1, 2], [2, 3], [2, 2],
[8, 7], [8, 8], [25, 80]])
# Perform DBSCAN clustering
db = DBSCAN(eps=3, min_samples=2).fit(X)
# Visualize clusters and noise points
plt.scatter(X[:, 0], X[:, 1], c=db.labels_, cmap='rainbow')
plt.title('DBSCAN Clustering')
plt.show()
Visual Aid:
A scatter plot showing clusters with distinct colors, and outliers marked differently.
Illustrates how DBSCAN handles noise effectively.
4. Gaussian Mixture Models (GMM)
Definition:
GMM assumes that data points are generated from a mixture of several Gaussian distributions, each representing a different cluster.
Specifications:
Parameters: Number of components (clusters), covariance type.
Complexity: O(n * d²), where
n
is the number of points andd
is the dimension.
Real-world example:
GMM can be used in bioinformatics to identify subpopulations within gene expression data.
Code Snippet: GMM in Python
from sklearn.mixture import GaussianMixture
import numpy as np
import matplotlib.pyplot as plt
# Generate sample data
X = np.array([[1, 2], [1, 4], [1, 0],
[10, 2], [10, 4], [10, 0]])
# Fit GMM model
gmm = GaussianMixture(n_components=2).fit(X)
labels = gmm.predict(X)
# Visualize clusters
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='rainbow')
plt.title('GMM Clustering')
plt.show()
Visual Aid:
A scatter plot showing the Gaussian distribution of clusters.
Helps visualize how GMM can model soft cluster assignments.
Evaluation of Clustering Algorithms
To evaluate clustering performance, we use different metrics depending on whether we have ground truth labels or not:
Silhouette Score: Measures how similar a data point is to its own cluster compared to other clusters.
from sklearn.metrics import silhouette_score score = silhouette_score(X, kmeans.labels_) print("Silhouette Score:", score)
Elbow Method: Used to determine the optimal number of clusters by plotting the sum of squared distances and finding the "elbow point."
Adjusted Rand Index: Compares the similarity between the true labels and predicted labels.
Challenges in Clustering
Choosing the Right Number of Clusters: Algorithms like K-Means require you to specify the number of clusters beforehand, which can be challenging.
Sensitivity to Noise: Algorithms like K-Means can struggle with noisy data, while DBSCAN handles noise better.
Initial Centroid Selection: In K-Means, bad initial centroid selection can lead to poor convergence. Using K-Means++ helps address this issue.
Conclusion
Clustering is an essential technique for understanding and organizing unlabeled data. With the right algorithm—whether it's K-Means, Hierarchical Clustering, DBSCAN, or GMM—you can unlock insights from your data by grouping similar instances together. Experiment with different clustering techniques and parameters to see which one works best for your dataset.
Subscribe to my newsletter
Read articles from Riya Bose directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by