Clustering

Grouping data into similar groups based on their features- without knowing the labels in advance.

example 100 customer records (age, income, location)

You want to group similar customers together → That’s clustering

Application of clustering: image segmentation, customer segmentation, semi supervised learning , objection segmentation & detection

K means Geometry intitution

Step 1: Choose the number of clusters (K)

You decide how many clusters (K) you want to find.
(If you're not sure, later you can use the Elbow Method to choose K.)

Step 2: Initialize K random centroids

Pick K random points from the dataset as the starting centroids (cluster centers).

✅ Step 3: Assign each data point to the nearest centroid

For each data point:

Calculate the Euclidean distance to all K centroids
Assign the point to the closest one
→ This forms K initial clusters.

Step 4: Recalculate the centroids

For each cluster:

Compute the mean of all points in that cluster
Move the centroid to this new mean location

Step 5: Repeat Steps 3–4 until convergence

Repeat the process of assigning points + updating centroids until:

Centroids don’t move much anymore
Or a maximum number of iterations is reached

Step 6: Final clusters and centroids are ready

At the end, you have:

K clusters of data points
Final centroid positions representing the center of each cluster

How do we find the perfect number of k

Elbow Method

The Elbow Method helps you find the optimal number of clusters (K) by plotting how much inertia (or "within-cluster error") decreases as K increases.

What is inertia here

Inertia is the sum of squared distances between each point and its assigned cluster center.

Lower inertia = better fit (but also more clusters = overfitting risk).

Steps to Perform the Elbow Method

Try K-Means with different values of K (e.g. from 1 to 10)
For each K, compute the inertia (from .inertia_ attribute in scikit-learn)
Plot K vs. Inertia
Look for the elbow point — the K after which inertia drops slowly

How to Interpret the Plot:
- X-axis: Number of clusters (K)
- Y-axis: Inertia
- Find the "elbow point" where the curve starts flattening
- That value of K is the best balance between performance and simplicity

Limitations of the Elbow Method

1. No Clear Elbow

Sometimes the plot doesn’t form a sharp “elbow” — it looks like a smooth curve.
In that case, it becomes subjective to decide where the elbow actually is.

Problem: You might end up guessing or choosing K arbitrarily.

2. Only works with K-Means (or similar algorithms)

Elbow method is based on inertia (within-cluster squared distances), which only applies to algorithms like K-Means.
It won’t work for other clustering algorithms like DBSCAN or Agglomerative Clustering.

3. Assumes spherical (convex) clusters

K-Means and the Elbow Method assume clusters are:

Spherical
Separated clearly

If clusters have irregular shapes (e.g., moons or spirals), Elbow won’t help — and K-Means will likely fail too.

4. Sensitive to scaling

The inertia values (and hence the elbow shape) can be heavily affected by feature scale.
So if you don’t standardize your data, the plot can mislead you.

5. Inertia always decreases with K

Since inertia keeps decreasing with more clusters (K), there's always a temptation to overfit by picking a larger K, even if the "elbow" is not meaningful.

Assumptions of K-Means Clustering

1. Clusters are spherical (or convex)

K-Means assumes that clusters are shaped like spheres (in 2D: circles, in 3D: balls).

Why?
It uses Euclidean distance — which works best when the data is symmetrically distributed around the center.

2. Clusters are of similar size and density

K-Means works best when all clusters have roughly the same spread and number of points.

Why?
Otherwise, it may:

Split a big cluster into two
Merge two small clusters into one

3. The number of clusters (K) is known in advance

You must provide K, the number of clusters, before running the algorithm.

Why?
It doesn’t try to figure it out on its own — if you pick the wrong K, the results may be poor.

4. Data is continuous and numerical

K-Means assumes your features are numeric and distances between them are meaningful.

Why?
It uses distance-based calculations (like Euclidean).
It doesn't work well with categorical data (e.g., colors or cities).

5. Features are scaled properly

Features should be standardized (e.g., with z-score) so that no feature dominates due to larger units.

Why?
If "Age" goes from 1–100 and "CGPA" is 0–10, age will dominate the distance calculation unless scaled.

Limitations of K-Means Clustering

1. You must choose K in advance

K-Means requires you to manually specify the number of clusters (K) before starting.

Problem:

If you choose the wrong K, you might get poor or misleading clusters.
Requires techniques like the Elbow Method to guess K.

2. Assumes spherical, equally-sized clusters

It works best when clusters are circular and similar in size and density.

Problem:

It performs poorly on data with:
- Irregular shapes (like spirals, moons)
- Uneven cluster sizes
- Varying densities

3. Sensitive to outliers

K-Means uses means to find cluster centers — which can be skewed by outliers.

Problem:
A few extreme points can pull centroids away from the real center of the cluster.

4. Can converge to local minima

K-Means uses random initialization of centroids.

Problem:

Different runs can give different results.
Use KMeans++ or multiple runs to improve reliability.

5. Only works with numeric, continuous data

K-Means relies on distance calculations (usually Euclidean).

Problem:

It doesn't work with categorical features (like city names or colors).
You’d need to use a different method or transform data.

6. Not robust to non-globular clusters

K-Means won’t find clusters that are not convex or connected.

Example:
If you have two moons or spiral shapes, K-Means will fail — even if the data is perfectly clusterable.

Kmeans

Clustering

K means Geometry intitution

Step 1: Choose the number of clusters (K)

Step 2: Initialize K random centroids

✅ Step 3: Assign each data point to the nearest centroid

Step 4: Recalculate the centroids

Step 5: Repeat Steps 3–4 until convergence

Step 6: Final clusters and centroids are ready

How do we find the perfect number of k

Steps to Perform the Elbow Method

How to Interpret the Plot:

Limitations of the Elbow Method

1. No Clear Elbow

2. Only works with K-Means (or similar algorithms)

3. Assumes spherical (convex) clusters

4. Sensitive to scaling

5. Inertia always decreases with K

Assumptions of K-Means Clustering

1. Clusters are spherical (or convex)

2. Clusters are of similar size and density

3. The number of clusters (K) is known in advance

4. Data is continuous and numerical

5. Features are scaled properly

Limitations of K-Means Clustering

1. You must choose K in advance

2. Assumes spherical, equally-sized clusters

3. Sensitive to outliers

4. Can converge to local minima

5. Only works with numeric, continuous data

6. Not robust to non-globular clusters

Code example

Subscribe to my newsletter

priyanshu tiwari

priyanshu tiwari