K-Means Clustering – Unraveling Hidden Patterns in Data

Table of contents

Introduction
K-Means Clustering is one of the simplest and most widely used unsupervised learning algorithms for partitioning a dataset into K distinct non-overlapping clusters. It groups data points based on their similarity, aiming to minimize the distance between points within a cluster while maximizing the distance between points in different clusters.
Whether you’re analyzing customer segments, image compression, or anomaly detection, K-Means is a powerful tool for discovering hidden patterns in data.
1. What is K-Means Clustering?
K-Means Clustering is an iterative clustering algorithm that partitions a dataset into K clusters, where each cluster is defined by its centroid. Data points are assigned to the cluster with the nearest centroid based on a distance metric (usually Euclidean distance).
1.1 Objective Function
The objective of K-Means is to minimize the Within-Cluster Sum of Squares (WCSS), also known as the Inertia:
Where:
K = Number of clusters
Ci = Cluster ii
x = Data point
μi = Centroid of cluster Ci
1.2 Why Use K-Means Clustering?
Simple and Efficient: Easy to implement and computationally efficient.
Scalable: Works well with large datasets.
Interpretability: Easy to interpret results with well-separated clusters.
Versatility: Applied in various fields like customer segmentation, image compression, and anomaly detection.
2. How Does K-Means Clustering Work?
K-Means follows an iterative process to converge towards optimal clusters:
Step 1: Initialize Centroids
Randomly select K data points as initial centroids.
Step 2: Assign Data Points to Nearest Centroid
Assign each data point to the nearest centroid based on the Euclidean distance:
Step 3: Update Centroids
Calculate the new centroid for each cluster as the mean of all points assigned to that cluster:
Step 4: Repeat Until Convergence
Repeat Steps 2 and 3 until centroids no longer change or a maximum number of iterations is reached.
3. Choosing the Number of Clusters (K)
Choosing the right value for K is crucial for effective clustering. Here are some common methods:
3.1 Elbow Method
Plot WCSS against different values of K. Choose the value of K at the "elbow point," where the decrease in WCSS slows down.
3.2 Silhouette Score
Measures how similar a point is to its own cluster compared to other clusters. The silhouette score ranges from -1 to 1:
Close to +1: Data point is well-matched to its cluster.
Close to 0: Data point is on or near the decision boundary.
Close to -1: Data point is misclassified.
3.3 Gap Statistic
Compares the total within-cluster variation for different K values with their expected values under a null reference distribution.
4. Advantages and Disadvantages
4.1 Advantages:
Simplicity and Efficiency: Easy to implement and computationally efficient.
Scalable: Efficient on large datasets.
Generalization Capability: Generalizes to different kinds of datasets.
4.2 Disadvantages:
Fixed Number of Clusters: Requires the number of clusters (K) to be specified in advance.
Sensitivity to Initialization: Final clusters depend on the initial choice of centroids.
Assumes Spherical Clusters: Works best when clusters are spherical and of similar size.
Sensitive to Outliers: Outliers can significantly affect the cluster centroids.
5. Limitations and Solutions
Limitation: Sensitive to initialization and may converge to a local minimum.
- Solution: Use K-Means++ initialization to select initial centroids more intelligently.
Limitation: Assumes clusters are spherical and equally sized.
- Solution: Use other clustering algorithms like DBSCAN or Gaussian Mixture Models for complex cluster shapes.
Limitation: Not suitable for categorical data.
- Solution: Use K-Modes or K-Prototypes for categorical or mixed data types.
6. Implementation of K-Means Clustering in Python
Let's implement K-Means Clustering using Scikit-learn on the Iris dataset:
# Importing necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data
# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Determine optimal K using Elbow Method
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, random_state=42)
kmeans.fit(X_scaled)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss, marker='o')
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
# Applying KMeans with optimal K
kmeans = KMeans(n_clusters=3, init='k-means++', max_iter=300, random_state=42)
y_kmeans = kmeans.fit_predict(X_scaled)
# Silhouette Score
sil_score = silhouette_score(X_scaled, y_kmeans)
print(f'Silhouette Score: {sil_score}')
# Visualize the clusters
plt.scatter(X_scaled[y_kmeans == 0, 0], X_scaled[y_kmeans == 0, 1], s=100, c='red', label='Cluster 1')
plt.scatter(X_scaled[y_kmeans == 1, 0], X_scaled[y_kmeans == 1, 1], s=100, c='blue', label='Cluster 2')
plt.scatter(X_scaled[y_kmeans == 2, 0], X_scaled[y_kmeans == 2, 1], s=100, c='green', label='Cluster 3')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='yellow', label='Centroids')
plt.title('K-Means Clustering')
plt.legend()
plt.show()
7. Real-world Applications
Customer Segmentation: Grouping customers based on purchasing behavior.
Image Compression: Reducing the number of colors in an image.
Anomaly Detection: Identifying outliers or anomalies in data.
Recommendation Systems: Grouping users based on browsing patterns.
Social Media Analysis: Analyzing user communities or topics.
8. Tips for Optimal Clustering
Standardize the Data: Scale features for better performance.
Use K-Means++: For better initialization and faster convergence.
Elbow Method and Silhouette Score: To determine the optimal number of clusters.
Multiple Runs: Run the algorithm multiple times with different initializations.
9. Conclusion
K-Means Clustering is a powerful unsupervised learning algorithm for partitioning data into meaningful clusters. Its simplicity, scalability, and effectiveness have made it a popular choice in various domains.
Subscribe to my newsletter
Read articles from Tushar Pant directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
