Customer Segmentation using Clustering on the Online Retail Dataset

Lokesh PatidarLokesh Patidar
2 min read

Introduction

Hey everyone! ๐Ÿ‘‹
I recently worked on an unsupervised machine learning project that focuses on clustering customers based on their purchase behavior using the Online Retail dataset.

The goal was to segment customers so businesses can better understand and target them through personalized marketing, improved retention strategies, and optimized customer experiences.

This post covers the process I followed, key concepts used, and the valuable insights I gained.


โ†’About the Dataset

The dataset used is the Online Retail dataset, which contains transaction data for a UK-based online store from 2010 to 2011.

  • Total Rows: ~540,000

  • Features include:

    • InvoiceNo, StockCode, Description

    • Quantity, InvoiceDate, UnitPrice

    • CustomerID, Country

This dataset is widely used in customer segmentation and retail analytics problems.


๐Ÿ”ง Data Preprocessing

Before clustering, I performed several preprocessing steps:

  1. Removed missing values

  2. Dropped canceled orders (Invoices starting with โ€˜Cโ€™)

  3. Created an RFM (Recency, Frequency, Monetary) dataset based on customer purchase behavior

  4. Standardized the data using StandardScaler for uniform scale across features


๐Ÿ“Š Feature Engineering: RFM Model

I used the RFM (Recency, Frequency, Monetary) model to extract customer behavior patterns:

  • Recency โ€“ How recently did the customer purchase?

  • Frequency โ€“ How often did they purchase?

  • Monetary โ€“ How much did they spend?

These features are ideal for clustering customers into meaningful groups.


โ†’Clustering with K-Means

I applied K-Means Clustering to group customers into different segments:

  • Used Elbow Method to find the optimal number of clusters

  • Evaluated clustering performance using Silhouette Score

  • Used PCA (Principal Component Analysis) to reduce dimensionality for 2D visualization


๐Ÿ“ˆ Visualization

I visualized the clusters using PCA to get a better understanding of how customers are grouped.

Different colors represent different segments, and the spread helps identify high-value or infrequent customers.


๐Ÿ’ก Key Insights

  • Cluster 0: High-frequency, high-spending customers (Loyal Customers)

  • Cluster 1: Low-frequency, low-spending (Potential Churn)

  • Cluster 2: Occasional big spenders (Opportunistic)

Such segmentation helps businesses target promotions, recommend products, and allocate marketing budgets more effectively.


โ†’ What I Learned

  • Importance of feature selection in unsupervised learning

  • Practical application of K-Means and RFM modeling

  • Power of PCA in visualizing high-dimensional data

  • How clustering can drive data-backed business decisions

GitHub Repository


โ†’Letโ€™s Connect!

If you're also working on ML or interested in data science projects, let's connect and grow together!

LinkedIn โ€“ Lokesh Patidar

0
Subscribe to my newsletter

Read articles from Lokesh Patidar directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Lokesh Patidar
Lokesh Patidar

Hey, I'm Lokesh Patidar! I'm a 2nd-year student at SATI Vidisha, passionate about AI, Machine Learning, Full-Stack Development , and DSA. What I'm Learning: Currently Exploring Machine Learning ๐Ÿค– Completed DSA & Frontend Development ๐ŸŒ Now exploring Backend Development ๐Ÿ’ก Interests: I love solving problems, building projects, and integrating AI into real-world applications. Excited to contribute to tech communities and share my learning journey! ๐Ÿ“Œ Follow my blog for insights on AI, ML, and Full-Stack projects!