K-Means: Labeling, Not Learning

Imagine walking into a party...

You don’t know anyone.

There are no name tags, No signs,No seating charts but somehow, you start noticing patterns:

That group by the buffet is talking about tech.

The ones near the speaker? All dancing.

A bunch by the window? Deep in books and quiet talks.

No one told you how to sort them.

You just did naturally, Intuitively,effortlessly.

That’s what K-Means Clustering teaches machines to do.

To see structure where no labels exist — to separate the chaos into something meaningful.

A Short Origin Story

Unsupervised learning has been around as long as humans have tried to recognize patterns without labels. But in 1957, mathematician Stuart Lloyd introduced the algorithm that would evolve into K-Means — during a study on signal quantization at Bell Labs.

It wasn't about big data back then. It was about compressing information efficiently.

Fast forward, and K-Means now powers customer segmentation, market basket analysis, gene expression grouping and even image compression.

From audio signals to Amazon recommendations — K-Means quietly powers the structure behind the scenes.

Why You Should Care

Before we dive into math or steps, ask yourself this:

How does Spotify group users with similar music tastes?

How does Netflix suggest shows to just the right kind of viewer?

How do marketers know there are 5 core customer types, not 50?

How does Google Photos recognize your cousin's face… even if you never tagged her?

None of this is done manually, There is no army of humans labeling each user, face, or customer.

So, how does it work?

Behind the scenes, machines are looking for patterns. They’re identifying groups based on similarity — without any labels at all.

And one of the most elegant ways they do this is through K-Means Clustering, Of course, there are other clustering methods — but K-Means is a great place to start.

What is K-Means?

K-Means is a clustering algorithm which split data into K groups, based on how similar the data points are to each other.

Step-by-Step: How K-Means Actually Works

Let’s walk through the algorithm — imagine we’re trying to group students based on their math and science scores.

Choose K (Number of Clusters)

You decide how many groups you want.

Let’s say:

“I want to split my students into 3 performance groups.”

So, K = 3.

Randomly Place K Centroids

A centroid is the center of a cluster — like the leader or the “gravity point” of a group. At first, these are randomly scattered points in space.

Assign Each Point to the Nearest Centroid

Now each student (dot) looks at all the centroids and says:

“Who’s closest to me?” and joins that group.

Distance is usually calculated using Euclidean distance (the straight-line distance between two points).

Update the Centroids

After assignment, the algorithm says:

“Okay, now each group has members. Let's move the centroid to the average location of all the people in that group.”

This average is called the mean — hence the name: K-MEANS.

Repeat Until Nothing Changes

The process of:

Assigning points,updating centroids ...repeats over and over until:

The centroids stop moving, or A maximum number of loops is reached. At that point, you have your final clusters!

Let’s See the Math

Here’s a simple overview of the math behind each step.

1. Initialization

Choose K centroids:

These are initially random points in the same space as your data.

2. Assignment Step

Each data point is assigned to the cluster whose centroid is closest:

$$\text{Cluster}(x_i) = \underset{j}{\arg\min} \; ||x_i - c_j||^2$$

3. Update Step

For each cluster, update the centroid by computing the mean of all data points assigned to it:

$$c_j = \frac{1}{N_j} \sum_{x_i \in S_j} x_i$$

Where:

c_j is the updated centroid of cluster

S_j is the set of points in cluster

N_j is the number of points in cluster

4. Repeat

Repeat the assignment and update steps until the centroids no longer move significantly, or after a fixed number of iterations.

K-Means Clustering from Scratch in Python

We will implement K-Means Clustering from scratch using pure Python and numpy. We'll visualize the results using matplotlib. This is a great beginner-friendly intro to unsupervised learning.

#Step 1: Import Libraries and Generate Data
import numpy as np
import matplotlib.pyplot as plt

# Generate 2D sample data
np.random.seed(42)
data = np.random.randn(300, 2)  # 300 points, 2 features (2D)
def kmeans(data, k, max_iterations=100):
    # Step 1: Randomly select k data points as initial centroids
    indices = np.random.choice(len(data), k, replace=False)
    centroids = data[indices]
    for _ in range(max_iterations):
        clusters = [[] for _ in range(k)]
        # Step 2: Assign each data point to the nearest centroid
        for point in data:
            distances = [np.linalg.norm(point - centroid) for centroid in centroids]
            closest = distances.index(min(distances))
            clusters[closest].append(point)
        prev_centroids = centroids.copy()
        # Step 4: Recalculate centroids
        for i in range(k):
            if clusters[i]:  # Avoid empty cluster
               centroids[i] = np.mean(clusters[i], axis=0)
        # Step 5: Check for convergence
        if np.allclose(prev_centroids, centroids):
           break
    return centroids, clusters

k = 3  # Number of clusters
final_centroids, final_clusters = kmeans(data, k)

# Plot the results
colors = ['red', 'green', 'blue']
for i, cluster in enumerate(final_clusters):
    cluster = np.array(cluster)
    plt.scatter(cluster[:, 0], cluster[:, 1], c=colors[i], label=f'Cluster {i+1}')
# Plot centroids
final_centroids = np.array(final_centroids)
plt.scatter(final_centroids[:, 0], final_centroids[:, 1], c='black', marker='x', s=100, label='Centroids')
plt.title('K-Means Clustering from Scratch')
plt.legend()
plt.show()

K-Means Doesn’t Learn — It Just Labels Without Thinking