Introduction

Hey everyone! 👋
I recently worked on an unsupervised machine learning project that focuses on clustering customers based on their purchase behavior using the Online Retail dataset.

The goal was to segment customers so businesses can better understand and target them through personalized marketing, improved retention strategies, and optimized customer experiences.

This post covers the process I followed, key concepts used, and the valuable insights I gained.

→About the Dataset

The dataset used is the Online Retail dataset, which contains transaction data for a UK-based online store from 2010 to 2011.

Total Rows: ~540,000
Features include:
- InvoiceNo, StockCode, Description
- Quantity, InvoiceDate, UnitPrice
- CustomerID, Country

This dataset is widely used in customer segmentation and retail analytics problems.

🔧 Data Preprocessing

Before clustering, I performed several preprocessing steps:

Removed missing values
Dropped canceled orders (Invoices starting with ‘C’)
Created an RFM (Recency, Frequency, Monetary) dataset based on customer purchase behavior
Standardized the data using StandardScaler for uniform scale across features

📊 Feature Engineering: RFM Model

I used the RFM (Recency, Frequency, Monetary) model to extract customer behavior patterns:

Recency – How recently did the customer purchase?
Frequency – How often did they purchase?
Monetary – How much did they spend?

These features are ideal for clustering customers into meaningful groups.

→Clustering with K-Means

I applied K-Means Clustering to group customers into different segments:

Used Elbow Method to find the optimal number of clusters
Evaluated clustering performance using Silhouette Score
Used PCA (Principal Component Analysis) to reduce dimensionality for 2D visualization

📈 Visualization

I visualized the clusters using PCA to get a better understanding of how customers are grouped.

Different colors represent different segments, and the spread helps identify high-value or infrequent customers.

💡 Key Insights

Cluster 0: High-frequency, high-spending customers (Loyal Customers)
Cluster 1: Low-frequency, low-spending (Potential Churn)
Cluster 2: Occasional big spenders (Opportunistic)

Such segmentation helps businesses target promotions, recommend products, and allocate marketing budgets more effectively.

→ What I Learned

Importance of feature selection in unsupervised learning
Practical application of K-Means and RFM modeling
Power of PCA in visualizing high-dimensional data
How clustering can drive data-backed business decisions

📁 Project Links

GitHub Repository

→Let’s Connect!

If you're also working on ML or interested in data science projects, let's connect and grow together!

LinkedIn – Lokesh Pat idar

Customer Segmentation using Clustering on the Online Retail Dataset