K-Means Clustering – Unraveling Hidden Patterns in Data

Tushar PantTushar Pant
5 min read

Introduction

K-Means Clustering is one of the simplest and most widely used unsupervised learning algorithms for partitioning a dataset into K distinct non-overlapping clusters. It groups data points based on their similarity, aiming to minimize the distance between points within a cluster while maximizing the distance between points in different clusters.

Whether you’re analyzing customer segments, image compression, or anomaly detection, K-Means is a powerful tool for discovering hidden patterns in data.


1. What is K-Means Clustering?

K-Means Clustering is an iterative clustering algorithm that partitions a dataset into K clusters, where each cluster is defined by its centroid. Data points are assigned to the cluster with the nearest centroid based on a distance metric (usually Euclidean distance).

k_means_clustering

1.1 Objective Function

The objective of K-Means is to minimize the Within-Cluster Sum of Squares (WCSS), also known as the Inertia:

Where:

  • K = Number of clusters

  • Ci = Cluster ii

  • x = Data point

  • μi = Centroid of cluster Ci

1.2 Why Use K-Means Clustering?

  • Simple and Efficient: Easy to implement and computationally efficient.

  • Scalable: Works well with large datasets.

  • Interpretability: Easy to interpret results with well-separated clusters.

  • Versatility: Applied in various fields like customer segmentation, image compression, and anomaly detection.


2. How Does K-Means Clustering Work?

K-Means follows an iterative process to converge towards optimal clusters:

Step 1: Initialize Centroids

Randomly select K data points as initial centroids.

Step 2: Assign Data Points to Nearest Centroid

Assign each data point to the nearest centroid based on the Euclidean distance:

Step 3: Update Centroids

Calculate the new centroid for each cluster as the mean of all points assigned to that cluster:

Step 4: Repeat Until Convergence

Repeat Steps 2 and 3 until centroids no longer change or a maximum number of iterations is reached.


3. Choosing the Number of Clusters (K)

Choosing the right value for K is crucial for effective clustering. Here are some common methods:

3.1 Elbow Method

Plot WCSS against different values of K. Choose the value of K at the "elbow point," where the decrease in WCSS slows down.

3.2 Silhouette Score

Measures how similar a point is to its own cluster compared to other clusters. The silhouette score ranges from -1 to 1:

  • Close to +1: Data point is well-matched to its cluster.

  • Close to 0: Data point is on or near the decision boundary.

  • Close to -1: Data point is misclassified.

3.3 Gap Statistic

Compares the total within-cluster variation for different K values with their expected values under a null reference distribution.


4. Advantages and Disadvantages

4.1 Advantages:

  • Simplicity and Efficiency: Easy to implement and computationally efficient.

  • Scalable: Efficient on large datasets.

  • Generalization Capability: Generalizes to different kinds of datasets.

4.2 Disadvantages:

  • Fixed Number of Clusters: Requires the number of clusters (K) to be specified in advance.

  • Sensitivity to Initialization: Final clusters depend on the initial choice of centroids.

  • Assumes Spherical Clusters: Works best when clusters are spherical and of similar size.

  • Sensitive to Outliers: Outliers can significantly affect the cluster centroids.


5. Limitations and Solutions

  • Limitation: Sensitive to initialization and may converge to a local minimum.

    • Solution: Use K-Means++ initialization to select initial centroids more intelligently.
  • Limitation: Assumes clusters are spherical and equally sized.

    • Solution: Use other clustering algorithms like DBSCAN or Gaussian Mixture Models for complex cluster shapes.
  • Limitation: Not suitable for categorical data.

    • Solution: Use K-Modes or K-Prototypes for categorical or mixed data types.

6. Implementation of K-Means Clustering in Python

Let's implement K-Means Clustering using Scikit-learn on the Iris dataset:

# Importing necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data
# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Determine optimal K using Elbow Method
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, random_state=42)
    kmeans.fit(X_scaled)
    wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss, marker='o')
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

# Applying KMeans with optimal K
kmeans = KMeans(n_clusters=3, init='k-means++', max_iter=300, random_state=42)
y_kmeans = kmeans.fit_predict(X_scaled)
# Silhouette Score
sil_score = silhouette_score(X_scaled, y_kmeans)
print(f'Silhouette Score: {sil_score}')

# Visualize the clusters
plt.scatter(X_scaled[y_kmeans == 0, 0], X_scaled[y_kmeans == 0, 1], s=100, c='red', label='Cluster 1')
plt.scatter(X_scaled[y_kmeans == 1, 0], X_scaled[y_kmeans == 1, 1], s=100, c='blue', label='Cluster 2')
plt.scatter(X_scaled[y_kmeans == 2, 0], X_scaled[y_kmeans == 2, 1], s=100, c='green', label='Cluster 3')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='yellow', label='Centroids')
plt.title('K-Means Clustering')
plt.legend()
plt.show()


7. Real-world Applications

  • Customer Segmentation: Grouping customers based on purchasing behavior.

  • Image Compression: Reducing the number of colors in an image.

  • Anomaly Detection: Identifying outliers or anomalies in data.

  • Recommendation Systems: Grouping users based on browsing patterns.

  • Social Media Analysis: Analyzing user communities or topics.


8. Tips for Optimal Clustering

  • Standardize the Data: Scale features for better performance.

  • Use K-Means++: For better initialization and faster convergence.

  • Elbow Method and Silhouette Score: To determine the optimal number of clusters.

  • Multiple Runs: Run the algorithm multiple times with different initializations.


9. Conclusion

K-Means Clustering is a powerful unsupervised learning algorithm for partitioning data into meaningful clusters. Its simplicity, scalability, and effectiveness have made it a popular choice in various domains.

0
Subscribe to my newsletter

Read articles from Tushar Pant directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Tushar Pant
Tushar Pant