Day 17: K-Means Clustering – Discovering Hidden Groups in Data

Saket KhopkarSaket Khopkar
6 min read

K-Means is one of the simplest and most popular clustering algorithms.
It finds groups in your data, and every point belongs to one of these groups (called clusters).

Before we dive into K-Means, let's first set the stage.

In supervised learning, you have:

  • Input (X)

  • Output (Y)

  • And your goal is to learn a function that maps X → Y

But in unsupervised learning, you only have X (input data).
There are no labels, no ground truth, you're basically saying: "Here’s a bunch of data. Can you find patterns, structure, or groups in it?”

That is where K-Means Clustering comes into the picture. From the bunch of unlabelled data, it classifies them and the data is grouped for better understanding.


A Real life example would simplify your understanding:

Imagine you own a chain of ice cream shops. You have hundreds of customers, and for each one, you know:

  • Age

  • Money spent per visit

  • Number of visits per month

You want to segment your customers:

  • Who are your high spenders?

  • Who are regulars?

  • Who might churn?

But you don’t know these groups ahead of time, they need to be discovered from patterns. That’s where clustering comes in, it helps you group similar people together.

Let’s say we want to find 3 types of ice cream customers:

  1. Budget Teens

  2. Family Buyers

  3. Rich Regulars

Here’s how K-Means does it:

  1. Choose the number of clusters (K) you want. Let’s say K = 3

  2. Randomly initialize 3 centroids (center points of clusters)

  3. For each customer:

    • Assign them to the closest centroid (based on distance).
  4. Once everyone is assigned to a group:

    • Move each centroid to the average location of its group.
  5. Repeat steps 3 and 4 until: The cluster assignments stop changing (i.e., convergence)

The closeness of the point to the centroid is deterimed by the Euclidean Distance Formula, but don’t worry about complex computations, the library will do that for you.


Practical Time:

Let’s try this in Python with Seaborn, Matplotlib, and a real dataset.

Lets Create a dataset about Mall Customers Dataset.

import pandas as pd
import numpy as np

# Set seed for reproducibility
np.random.seed(42)

# Generate synthetic data
n_customers = 200
customer_id = np.arange(1, n_customers+1)
age = np.random.randint(18, 70, size=n_customers)
annual_income = np.random.randint(15, 140, size=n_customers)  # in k$
spending_score = np.random.randint(1, 101, size=n_customers)  # 1-100 scale

# Create DataFrame
df = pd.DataFrame({
    'CustomerID': customer_id,
    'Age': age,
    'Annual Income (k$)': annual_income,
    'Spending Score (1-100)': spending_score
})

df.head()

This will genearte a sample dataset with random values. Here we will be printing top 5 rows:

CustomerIDAgeAnnual Income (k$)Spending Score (1-100)
01568461
12698648
23464119
3432234
45607635

Next its time to visualize the data in the dataset.

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8,6))
sns.scatterplot(data=df, x='Annual Income (k$)', y='Spending Score (1-100)')
plt.title("Customer Distribution by Income and Spending")
plt.grid(True)
plt.show()

Good enough, seems like a nice data to work upon.

Our next step is to classify the data into various clusters. For this example,we will put out data in 5 clusters i.e. K = 5. To apply K-Means, use the below snippet:

from sklearn.cluster import KMeans

# Select only two columns for simplicity
X = df[['Annual Income (k$)', 'Spending Score (1-100)']]

# Apply KMeans with 5 clusters
kmeans = KMeans(n_clusters=5, random_state=42)
df['Cluster'] = kmeans.fit_predict(X)

df.head()
CustomerIDAgeAnnual Income (k$)Spending Score (1-100)Cluster
015684614
126986484
234641191
34322341
456076351

Now that our data is classified in to various clusters, lets visualize them:

plt.figure(figsize=(10,6))
sns.scatterplot(data=df, x='Annual Income (k$)', y='Spending Score (1-100)', hue='Cluster', palette='tab10')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=250, c='black', marker='X', label='Centroids')
plt.legend()
plt.title("Customer Segments via K-Means")
plt.grid(True)
plt.show()

The centroids are marked with Big Black “X“. In case if you see 6 X’s, the last one is actually from the “legend“ where it just depicts the information about the clusters by highlighting them with different colours.

💡
There is a concept called “Elbow Method“. The Elbow Method is a heuristic used to determine the optimal number of clusters (k) for a clustering algorithm, most commonly K-Means clustering. 

A typical demonstration can be view using the below code snippet:

inertia = []
K_range = range(1, 11)

for k in K_range:
    km = KMeans(n_clusters=k, random_state=42)
    km.fit(X)
    inertia.append(km.inertia_)

plt.plot(K_range, inertia, 'o-')
plt.xlabel("Number of Clusters (K)")
plt.ylabel("Inertia")
plt.title("Elbow Method")
plt.grid(True)
plt.show()

The more the data, the lesser clusters.

Last step is to test our model. We have created everything, the data, the clusters, the centroids. Now if a new data point emerges, in which cluster it will end up.

# Example: Customer with 60k annual income, spending score 80
new_customer = [[60, 80]]
predicted_cluster = kmeans.predict(new_customer)
print("Predicted Cluster:", predicted_cluster[0])

The output of this snippet I get is: Predicted Cluster: 3 This lets you tag future customers into discovered groups!

That means, if we put the values as in above example, that new data point will belong to cluster 3. Play around the values to check for yourself.


A short and sweet summary

ConceptWhat we covered
ClusteringGrouping similar data without labels
K-MeansFinds K clusters using centroids and distance
Elbow MethodUsed to select best value of K
Real DataSegmented mall customers into meaningful groups
PredictionYou can classify new data into clusters

Okay, let’s think about a few interview questions you may encounter around this topic:

  • How does K-Means work? → Explain basics with examples up above.

  • How do you choose K? → Elbow Method

  • What’s a weakness of K-Means? → Sensitive to scale + assumes spherical clusters

  • When to use K-Means? → When clusters are compact, data is numeric, low noise


We saw how we extracted certain pattern within the data and how data analysis and prediction becomes much more easier. We may have used a small set of values for our example in this blog, but in real-life, we may encounter a very humongous set of data, there we may see the real application of this algorithm.

For now, experiment around the values of centroids, the data sets etc., Ciao and happy coding.

0
Subscribe to my newsletter

Read articles from Saket Khopkar directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Saket Khopkar
Saket Khopkar

Developer based in India. Passionate learner and blogger. All blogs are basically Notes of Tech Learning Journey.