K-Means is one of the simplest and most popular clustering algorithms.
It finds groups in your data, and every point belongs to one of these groups (called clusters).

Before we dive into K-Means, let's first set the stage.

In supervised learning, you have:

Input (X)
Output (Y)
And your goal is to learn a function that maps X → Y

But in unsupervised learning, you only have X (input data).
There are no labels, no ground truth, you're basically saying: "Here’s a bunch of data. Can you find patterns, structure, or groups in it?”

That is where K-Means Clustering comes into the picture. From the bunch of unlabelled data, it classifies them and the data is grouped for better understanding.

A Real life example would simplify your understanding:

Imagine you own a chain of ice cream shops. You have hundreds of customers, and for each one, you know:

Age
Money spent per visit
Number of visits per month

You want to segment your customers:

Who are your high spenders?
Who are regulars?
Who might churn?

But you don’t know these groups ahead of time, they need to be discovered from patterns. That’s where clustering comes in, it helps you group similar people together.

Let’s say we want to find 3 types of ice cream customers:

Budget Teens
Family Buyers
Rich Regulars

Here’s how K-Means does it:

Choose the number of clusters (K) you want. Let’s say K = 3
Randomly initialize 3 centroids (center points of clusters)
For each customer:
- Assign them to the closest centroid (based on distance).
Once everyone is assigned to a group:
- Move each centroid to the average location of its group.
Repeat steps 3 and 4 until: The cluster assignments stop changing (i.e., convergence)

The closeness of the point to the centroid is deterimed by the Euclidean Distance Formula, but don’t worry about complex computations, the library will do that for you.

Practical Time:

Let’s try this in Python with Seaborn, Matplotlib, and a real dataset.

Lets Create a dataset about Mall Customers Dataset.

import pandas as pd
import numpy as np

# Set seed for reproducibility
np.random.seed(42)

# Generate synthetic data
n_customers = 200
customer_id = np.arange(1, n_customers+1)
age = np.random.randint(18, 70, size=n_customers)
annual_income = np.random.randint(15, 140, size=n_customers)  # in k$
spending_score = np.random.randint(1, 101, size=n_customers)  # 1-100 scale

# Create DataFrame
df = pd.DataFrame({
    'CustomerID': customer_id,
    'Age': age,
    'Annual Income (k$)': annual_income,
    'Spending Score (1-100)': spending_score
})

df.head()

This will genearte a sample dataset with random values. Here we will be printing top 5 rows:

	CustomerID	Age	Annual Income (k$)	Spending Score (1-100)
0	1	56	84	61
1	2	69	86	48
2	3	46	41	19
3	4	32	23	4
4	5	60	76	35

Next its time to visualize the data in the dataset.

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8,6))
sns.scatterplot(data=df, x='Annual Income (k$)', y='Spending Score (1-100)')
plt.title("Customer Distribution by Income and Spending")
plt.grid(True)
plt.show()

Good enough, seems like a nice data to work upon.

Our next step is to classify the data into various clusters. For this example,we will put out data in 5 clusters i.e. K = 5. To apply K-Means, use the below snippet:

from sklearn.cluster import KMeans

# Select only two columns for simplicity
X = df[['Annual Income (k$)', 'Spending Score (1-100)']]

# Apply KMeans with 5 clusters
kmeans = KMeans(n_clusters=5, random_state=42)
df['Cluster'] = kmeans.fit_predict(X)

df.head()

	CustomerID	Age	Annual Income (k$)	Spending Score (1-100)	Cluster
0	1	56	84	61	4
1	2	69	86	48	4
2	3	46	41	19	1
3	4	32	23	4	1
4	5	60	76	35	1

Now that our data is classified in to various clusters, lets visualize them:

plt.figure(figsize=(10,6))
sns.scatterplot(data=df, x='Annual Income (k$)', y='Spending Score (1-100)', hue='Cluster', palette='tab10')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=250, c='black', marker='X', label='Centroids')
plt.legend()
plt.title("Customer Segments via K-Means")
plt.grid(True)
plt.show()

The centroids are marked with Big Black “X“. In case if you see 6 X’s, the last one is actually from the “legend“ where it just depicts the information about the clusters by highlighting them with different colours.

💡

There is a concept called “Elbow Method“. The Elbow Method is a heuristic used to determine the optimal number of clusters (k) for a clustering algorithm, most commonly K-Means clustering.

A typical demonstration can be view using the below code snippet:

inertia = []
K_range = range(1, 11)

for k in K_range:
    km = KMeans(n_clusters=k, random_state=42)
    km.fit(X)
    inertia.append(km.inertia_)

plt.plot(K_range, inertia, 'o-')
plt.xlabel("Number of Clusters (K)")
plt.ylabel("Inertia")
plt.title("Elbow Method")
plt.grid(True)
plt.show()

The more the data, the lesser clusters.

Last step is to test our model. We have created everything, the data, the clusters, the centroids. Now if a new data point emerges, in which cluster it will end up.

# Example: Customer with 60k annual income, spending score 80
new_customer = [[60, 80]]
predicted_cluster = kmeans.predict(new_customer)
print("Predicted Cluster:", predicted_cluster[0])

The output of this snippet I get is: Predicted Cluster: 3 This lets you tag future customers into discovered groups!

That means, if we put the values as in above example, that new data point will belong to cluster 3. Play around the values to check for yourself.

A short and sweet summary

Concept	What we covered
Clustering	Grouping similar data without labels
K-Means	Finds K clusters using centroids and distance
Elbow Method	Used to select best value of K
Real Data	Segmented mall customers into meaningful groups
Prediction	You can classify new data into clusters

Okay, let’s think about a few interview questions you may encounter around this topic:

How does K-Means work? → Explain basics with examples up above.
How do you choose K? → Elbow Method
What’s a weakness of K-Means? → Sensitive to scale + assumes spherical clusters
When to use K-Means? → When clusters are compact, data is numeric, low noise

We saw how we extracted certain pattern within the data and how data analysis and prediction becomes much more easier. We may have used a small set of values for our example in this blog, but in real-life, we may encounter a very humongous set of data, there we may see the real application of this algorithm.

For now, experiment around the values of centroids, the data sets etc., Ciao and happy coding.

Day 17: K-Means Clustering – Discovering Hidden Groups in Data

Table of contents

A Real life example would simplify your understanding:

Practical Time:

A short and sweet summary

Subscribe to my newsletter

Saket Khopkar

Saket Khopkar