Mall Customer Segmentation Project


In this article, I explored a small project that introduced the basic concept of unsupervised learning, specifically focusing on K-Means Clustering. The main idea behind clustering is to divide a dataset into distinct groups or clusters, where each data point is assigned to the group it most closely belongs to based on similarity.
For this project, I used the Mall Customer Segmentation dataset, which contains seven features. However, I focused on just two features for simplicity and visualization purposes: "Annual Income" and "Spending Score".
The key takeaway from this project was understanding how K-Means identifies natural groupings in data without labelled outcomes. It was a valuable introduction to one of the core techniques in unsupervised machine learning.
I also explored and worked with a new Python library called Seaborn, which is built on top of Matplotlib and makes data visualization more intuitive and aesthetically pleasing. Seaborn provides high-level functions that allow you to create visually appealing plots with less code.
As a comparison, I plotted the same scatter plot using both Matplotlib and Seaborn to observe the visual differences. Seaborn made it easier to style the charts and highlight patterns within the data. I included the results at the end of the article, in the Evaluation section.
After analyzing the data, I selected "Annual Income" and "Spending Score" as the features for clustering.
Next, I needed to determine the optimal number of clusters (K) for the K-Means Clustering algorithm. In K-Means, K represents the number of groups or clusters we want to divide the data into.
To find the best value for K, I used the Elbow Method. This technique works by running K-Means for a range of K values and plotting the Within-Cluster Sum of Squares (WCSS). The idea is to identify the "elbow point" in the graph — the point after which increasing K yields minimal improvement. This helps estimate the optimal number of clusters by balancing accuracy and simplicity.
wcss = []
for i in range(1, 11):
km = KMeans(n_clusters = i)
km.fit(m)
wcss.append(km.inertia_)
plt.plot(range(1,11),wcss)
Here is the Elbow Point graph that I plotted to determine the optimal number of clusters for K-Means.
After determining the optimal value of K, I applied K-Means clustering to the dataset. This segmented the data into distinct groups, giving us the desired output. Below is the scatter plot visualizing the clustered groups.
This is the plot generated using Matplotlib. The purpose of including both Matplotlib and Seaborn visualizations is to highlight how each library displays the same data, showcasing the visual appeal and style differences between the two.
GitHub Repository:https://github.com/Mehar-Zaman-Ali/mall-customer-segmentation
Subscribe to my newsletter
Read articles from Zaman Ali directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
