From Raw Data to Customer Insights: My Journey Building a Scalable Segmentation Model Using PySpark & K-Means

Intro:
In this post, I’ll walk you through a customer segmentation project I recently completed using PySpark. The goal was to identify distinct customer groups based on behavioral and engagement features to support personalized marketing strategies. Working within a secure, enterprise-grade environment, I ensured compliance and confidentiality while delivering a data science solution that scales.
1. The Business Problem
The goal was to analyze customer behavior across bookings, engagement with campaigns, and demographic information to find actionable clusters that help tailor experiences for each group.
2. Data Sources
I worked with multiple structured datasets (CSV and SQL-based) that included:
Customer demographics
Booking history
Campaign engagement
Stay types and visit categories
These were combined into a single unified DataFrame using PySpark.
3. Feature Engineering
To prepare the dataset for customer segmentation, I focused on transforming raw behavioral and transactional data into meaningful numerical representations suitable for machine learning.
This involved:
Cleaning and Normalization: Removing inconsistencies and handling missing or ambiguous values with domain-sensitive default values.
Encoding Categorical Data: Using both binary and multi-class encoding strategies to handle different types of qualitative attributes without overfitting.
Dimensionality Awareness: Retaining only those features that demonstrated value during exploratory analysis while minimizing noise from overly sparse or correlated inputs.
Campaign & Engagement Patterns: Without revealing specifics, I derived abstract behavioral signals from how users interacted with various business touchpoints.
Final Dataset Construction: A balanced matrix of encoded features, standardized and optimized for clustering performance across a diverse customer base.
4. Clustering Approach
I used both the Elbow Method to determine the optimal number of clusters and experimented with K-Means and DBSCAN.
Why K-Means?
Scalable for large datasets (mine had ~1.3M rows)
Easy to interpret
Works well with numerical + encoded features
How I Determined K:
Used the Elbow Method with
inertia_
to find the "elbow point"Settled on K=4 based on distortion plot
5. Model Insights
After applying clustering, I analyzed cluster-wise distributions:
Revenue contributions by country, property, and month
Engagement levels by campaign type
Gender and generation skew per cluster
Visualized top-performing countries (grouped with “Other” logic).
Cluster-wise revenue distribution.
Peak revenue months extracted from booking dates.
6. Key Learnings
Feature engineering is 70% of the work
PySpark can handle large-scale clustering with efficiency
Always validate cluster quality with internal metrics (inertia, silhouette score) and business sense
Maintain security — no client info, no raw data exposure
7. Final Thoughts
This project solidified my understanding of unsupervised learning pipelines in PySpark. From complex joins and transformations to clustering and insight generation, I got hands-on experience delivering end-to-end data science in a secure environment.
Want the code?
I’ll be open-sourcing a dummy version of the notebook on GitHub soon with fake data — drop a comment if you want access or if you're curious how to scale segmentation with Spark!
Subscribe to my newsletter
Read articles from Muhammad Atif Hafeez directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
