Hyper-parameters

The hyper-parameters are from Scikit’s KMeans:

class sklearn.cluster.KMeans(n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=0.0001, precompute_distances='auto', verbose=0, random_state=None, copy_x=True, n_jobs=None, algorithm='auto')

random_state

This is setting a random seed. It is useful if we want to reproduce exact clusters over and over again. We can set it to any number we want. I set it to random_state=1234 below.

n_clusters

We need to provide the algorithm with the number of clusters that we want. Standard literature suggests we use the elbow method to determine how many clusters we need, and it works well for Scikits’ cleaned theoretical datasets. In reality, this is only an initial guess. Well, in this example, we know we have 3 clusters. So let’s try with n_clusters =3 :

km = KMeans(n_clusters=3, random_state=1234).fit(dftmp.loc[:, dftmp.columns != ‘group’])

What is going on? We predict 3 clusters, but they are nowhere near our original 3 clusters.

We do get 3 clusters, but they are very different from our original clusters. Originally, we had 2 clusters in the bottom left, but they are both grouped into one cluster, the yellow circles. This is because K-Means randomly assigns initial cluster centroids and then tries to group as many points as possible based on the points’ distance to the centroid. Sure, it repeats this process until convergence, but nothing prevents it from being stuck in a local minimum. In fact, K-Means is notoriously dependent on the centroid initialization. And here is our first clue that we can not go about blindly using K-Means. If we are asking for 3 clusters, then we need to have some idea where we expect the cluster centers to be in all of the features.

init

This is where you can set the initial cluster centroids. In our case, we have 3 clusters, so we will need 3 centroid arrays. Since we have two features, each of our arrays will be of length 2. So in our case, then, we have to have 3 pairs of cluster centers. We know our exact cluster centers since this is a simulation, so let’s try that.

centroids = np.asarray([[mu1,mu1],[mu2,mu2], [mu3a,mu3b]])
km = KMeans(n_clusters=3, init=centroids, random_state=1234).fit(dftmp.loc[:, dftmp.columns != 'group'])

And finally we have our original 3 clusters! (By initializing our cluster centers to the original values)

Ah! There we go. We have our original 3 clusters for most of the data points!

But wait! That is cheating! We will never know the exact centroids for our clusters. True. We can only speculate what our cluster centers will be approximately. So you don’t have to have the exact centers, but approximate values will help.

What else can we do to improve our clustering specially if we only have approximate cluster centers? Next we will look at another method we can get our original clusters.

Change the number of Clusters

I know! You think, but, but, but, we had 3 clusters originally and we set the number of clusters to be 3. What more can we do? How about we set the number of clusters to be twice the number that we expect. What? I know, bear with me, please.

km = KMeans(n_clusters=numclusters, random_state=1234).fit(dftmp.loc[:, dftmp.columns != 'group'])

Ah! There are the two clusters separated in the bottom left, as shown in the bottom row of plots. And we didn’t have to give initial cluster centers.

Look at the bottom right that shows original cluster 0 and 2 — they are more or less well separated. So that is a good thing. But the original cluster 1 (top right) is now separated into 4 clusters. But we know that that is all one cluster from our simulation. So it is a matter of us just consolidating those clusters together after that. Something like cluster (purple, brown, dark green and light green) will become a single cluster.

So if we don’t have an idea of where our cluster centers are, which might be the case with many features, then we can use this trick — of overestimating our number of clusters. One might say this this is overfitting — which is true for the top right cluster, but it is the only way we can get the bottom left clusters to separate, in the absence of good centroids. Even if we had approximate centroids this method will enhance the cluster separation better.

Normalizing the Data

You say: “But wait! I have read you have to normalize your data for KMeans“. Actually that is true. You do have to normalize your data so all the features are within the same range, whenever you do anything that involves Euclidean space, which KMeans does. I was able to get away with it in this case, because my x and y ranges for the original data look about the same : 0–30 (see first graphs in Sample Data section above). But let’s normalize our data to see if it makes a difference.

scl = StandardScaler()
dftmparray = (dftmp.loc[:, dftmp.columns != 'group']).values
dfnorm = scl.fit_transform(dftmparray)
dfnorm = pd.DataFrame(dfnorm)
km = KMeans(n_clusters=3, random_state=1234).fit(dfnorm)

We don’t predict separate clusters for the lower bottom coordinates. The top right shows the separation of the 2 clusters in the original space, but the bottom right shows that these 2 clusters are not separated very well in the predictions.

We see we are still not successful in separating the original cluster 0 and 2 (in the bottom left of the original data), even if we normalize the data.

Others

There are other hyperparameters like tol, max_iter that help with the computational time. These parameters become more important in a more complex problem than what is shown in this example, so I won’t attempt to show them by example.

But let’s look at what they mean:

n_init = By default is 10 and so the algorithm will initialize the centroids 10 times and will pick the most converging value as the best fit. Increase this value to scan the entire feature space. Note if we provide the centroids, then the algorithm will only run once; in fact it will warn us about this at run time. If we set initial centroids or if we set the number of clusters to be more than what we expect (with the intention of consolidating some clusters later on, as discussed above), then we can leave this at the default.

tol = If we set this to a higher value, then it implies that we are willing to tolerate a larger change in inertia, or change of loss, before we declare convergence (sort of like how fast are we converging). So if the change of inertia is less than the value specified by tol, then the algorithm will stop iterating and declare convergence even if it has completed fewer than max_iter rounds. Keep it at a low value to scan the entire feature space.

max_iter = There are n_init runs in general and each run iterates max_iter times, i.e., within a run, points will be assigned to different clusters and the loss calculated for max_iter times. If you keep max_iter at a higher value, then you are guaranteed that you have explored the entire feature space, but often this comes at the cost of diminishing returns.

The other variables decide computing efficiency — so if you have a very large dataset, it would be best that you keep them at the defaults.

Conclusion

It is usually not enough to just run an elbow method, determine the number of clusters and just run the standard KMeans with that. In general we have to explore the data and get subject experts’ opinion on how many clusters there must be and what their approximate centroids should be. Once we have those we can put them all together to tune KMeans:

By providing initial cluster centers
By asking for more clusters than is necessary so we can consolidate some of the clusters in the aftermath.
We can also increase the contribution of some features by weighting them more than the others, in an attempt to replace the Euclidean distance by the Mahlanobis distance. For example in the above example if we thought xx is 10 times more important in the separation than yy, then we would multiply xx by 10 after the normalization step. Though this is not always advisable — there are other ways to deal with this like the PCA for example.

K means session 3 hyperparameter