K means session 2

Silhouette Score

Cohesion

Definition:
Cohesion measures how closely related or similar the data points within the same cluster are.

Goal:
We want points within a cluster to be as close together as possible — this means the cluster is compact and well-formed.

Ideal Scenario:

  • All points in a cluster lie very close to the centroid

  • The cluster is tight and dense

Measurement:

  • Usually measured as the average intra-cluster distance (distance between each point and its cluster center)

  • Lower cohesion (smaller distance) = better

$$\text{Cohesion} = \frac{1}{n} \sum_{i=1}^{n} \| x_i - \mu_c \|^2$$

Where:

  • xix_ixi​ is a point in cluster ccc

  • μc\mu_cμc​ is the centroid of that cluster

Separation

Definition:
Separation measures how distinct or well-separated different clusters are from each other.

Goal:
Clusters should be far away from one another — this ensures that they don’t overlap or mix.

Ideal Scenario:

  • The centroids of different clusters are far apart

  • No data points from one cluster are close to another cluster

Measurement:

  • Usually measured as the distance between cluster centroids

  • Higher separation (larger distance) = better

Where:

$$\text{Separation} = \min_{i \neq j} \| \mu_i - \mu_j \|$$

Where:

  • μi\mu_iμi​ and μj\mu_jμj​ are the centroids of two different clusters

Ideal Clustering:

  • High separation

  • Low cohesion

This means that clusters are internally tight and externally far apart, which is exactly what good clustering aims to achieve.

How Silhouette Score Relates to Cohesion and Separation


First, Recall:

  • Cohesion → how close a point is to its own cluster
    (We want it to be low )

  • Separation → how far a point is from other clusters
    (We want it to be high )

The Silhouette Score combines both into a single number that tells us:

“Is this point closer to its own cluster than to other clusters?”

$$s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}$$

Where:

$$a(i) = \text{average distance from point } i \text{ to all other points in the same cluster (cohesion)} $$

$$b(i) = \text{average distance from point } i \text{ to all points in the nearest other cluster (separation)}$$

🧠Interpretation of Silhouette Score

For each point iii, the Silhouette Score s(i)∈[−1,1]s(i) \in [-1, 1]s(i)∈[−1,1] gives us:

  • s(i)≈1s(i) approx 1s(i)≈1 →
    Point is well-clustered (close to its own cluster, far from others)

  • s(i)≈0s(i) approx 0s(i)≈0 →
    Point is on the border between two clusters (uncertain)

  • s(i)<0s(i) < 0s(i)<0 →
    Point is probably misclassified (closer to a different cluster)

Does Silhouette Score Help Choose K?

Yes — it is often used to choose the optimal number of clusters (K).

How?

  1. Run K-Means (or another clustering algorithm) for different values of KKK

  2. For each KKK, compute the average silhouette score across all points

  3. Plot K vs Silhouette Score

  4. Choose the K with the highest average silhouette score

That K gives the best balance between cohesion and separation.

Comparison: Elbow vs Silhouette for Choosing K

FeatureElbow MethodSilhouette Score
What it measuresTotal inertia (within-cluster error)Balance of cohesion and separation
GoalFind the “elbow” where improvement flattensFind the KK with the highest average score
Range of valuesNo fixed range (depends on data)Between -1 and 1
Visual clueSharp “elbow” in plotHighest peak in line plot
Best when...You want a simple view of inertia dropYou want a quality-based K selection
LimitationElbow not always obviousScore may plateau for close KK values
0
Subscribe to my newsletter

Read articles from priyanshu tiwari directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

priyanshu tiwari
priyanshu tiwari