Customer Segmentation Using Kmeans Clustering in Python

Lucy AdhiamboLucy Adhiambo
4 min read

Table of Contents

  1. Introduction

  2. Business Scenario

  3. Exploring Dataset

  4. Data Preprocessing

  5. K-Means Segmentation

  6. K-Means visualization

  7. Conclusion

    Introduction

Customer segmentation is grouping customers into distinct categories with the same features. The process makes it easier to target a particular group of clients with tailored marketing strategies, services, and products.

Customer segmentation is important since it assists in optimizing marketing strategies that enhance sales, customer retention, satisfaction, and experience.

Business Scenario

Suppose we are working as data scientists for a supermarket and would like to assist the marketing department in introducing new products and improve sales based on customer segmentation. Therefore, we will save our money and time by marketing a particular group of clients with certain products.

How was the data collected?

The data was obtained from customers' loyalty card

We will use KMeans clustering to define the new group of customers.

Understanding data is essential.

Before working on any project, it is important to understand the business problem and the dataset.

What are the variables in the dataset

CustomerID: Uniquely identifies the customer

Gender: Biological sex of a customer. In the scenario, there are two options.

Male and Female

Age: The age of the customer in years

The minimum age is 18 years, and the maximum age is 70.

Annual income: The annual income of customers in US dollars

The highest customer income is 137 US dollars while the lowest income is 15 US dollars.

Spending Score: The spending score of customers on a scale of 1 to 100

The highest customer spending score is 99, and the lowest spending score is 1.

Importing Libraries

In this project, we will need some libraries to help us along the way

## Data Analysis
import pandas as pd
## Data Visualization
import seaborn as sn
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import warnings
warnings.filterwarnings("ignore")

Exploring Dataset

df=pd.read_csv("C:/Users/User/Documents/DATA ANALYST/Datasets/Mall_Customers.csv")

The section entails data understanding to assist in descriptive analysis and data visualization.

df.head()

Describe method can also be applied to see the columns' descriptive statistics.

df.describe()
CustomerIDAgeAnnual Income (k$)Spending Score (1-100)
count200.000000200.000000200.000000200.000000
mean100.50000038.85000060.56000050.200000
std57.87918513.96900726.26472125.823522
min1.00000018.00000015.0000001.000000
25%50.75000028.75000041.50000034.750000
50%100.50000036.00000061.50000050.000000
75%150.25000049.00000078.00000073.000000
max200.00000070.000000137.00000099.000000

The describe method is used in numerical methods. The mean of Age and Annual Income is 38.85 and 60.56, respectively.

Correlation Analysis

An effective way of understanding the relationship between variables is through correlation analysis.

We determine the correlation between the variables using corr method.

sn.heatmap(dff.corr(),annot=True, cmap="coolwarm")

There is a weak negative correlation between age and the annual income of the customers.

The next step is segmentation, however, it is important to scale data first

Data Preprocessing

Standardization is important before applying K-Means

We want to treat every feature equally through feature transformation where the values fall within the [0:1] range.

The process is known as standardization.

from sklearn.preprocessing import StandardScaler
Scale= StandardScaler()
CustomerIDAgeAnnual Income (k$)Spending Score (1-100)Income ClusterSpending and Income ClusterGender_Male
0119153912True
1221158111True
232016612False
3423167711False
4531174012False

There are 5 clusters

KMeans Segmentation

inertia_scores =[]
for i in range(1,11):
    kmeans=KMeans(n_clusters = i)
    kmeans.fit(df[["Annual Income (k$)"]])
    inertia_scores.append(kmeans.inertia_)
inertia_scores

The elbow in the graph is the five-cluster mark. This is the only place until the graph steeply declines while smoothing out afterward.

Let’s perform K-Means clustering with 5 clusters.

KMeans Visualization

plt.figure(figsize=(10,8))
plt.scatter(x=centers["x"],y=centers["y"],s=100,c="black",marker="*")
sn.scatterplot(data=df, x="Annual Income (k$)", y="Spending Score (1-100)",hue= "Spending and Income Cluster",palette="tab10")
plt.savefig("clustering bivariate.png")

Conclusion

The target group would be cluster 1 which has a high spending score and high income.

54 percent of cluster 1 shoppers are women.

We should look for ways to attract these customers using a marketing campaign targeting popular items in this cluster.

Cluster 2 presents an interesting opportunity to market to the customers for sales event on popular items.

If you want to see the entire code in Jupyter Notebook, it can be found on my https://github.com/lucyadhiambo58/Shopping-Customer-Segmentation

Thank you for reading.

1
Subscribe to my newsletter

Read articles from Lucy Adhiambo directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Lucy Adhiambo
Lucy Adhiambo