From Raw Data to Customer Insights: My Journey Building a Scalable Segmentation Model Using PySpark & K-Means

Intro:

In this post, I’ll walk you through a customer segmentation project I recently completed using PySpark. The goal was to identify distinct customer groups based on behavioral and engagement features to support personalized marketing strategies. Working within a secure, enterprise-grade environment, I ensured compliance and confidentiality while delivering a data science solution that scales.

1. The Business Problem

The goal was to analyze customer behavior across bookings, engagement with campaigns, and demographic information to find actionable clusters that help tailor experiences for each group.

2. Data Sources

I worked with multiple structured datasets (CSV and SQL-based) that included:

  • Customer demographics

  • Booking history

  • Campaign engagement

  • Stay types and visit categories

These were combined into a single unified DataFrame using PySpark.

3. Feature Engineering

To prepare the dataset for customer segmentation, I focused on transforming raw behavioral and transactional data into meaningful numerical representations suitable for machine learning.

This involved:

  • Cleaning and Normalization: Removing inconsistencies and handling missing or ambiguous values with domain-sensitive default values.

  • Encoding Categorical Data: Using both binary and multi-class encoding strategies to handle different types of qualitative attributes without overfitting.

  • Dimensionality Awareness: Retaining only those features that demonstrated value during exploratory analysis while minimizing noise from overly sparse or correlated inputs.

  • Campaign & Engagement Patterns: Without revealing specifics, I derived abstract behavioral signals from how users interacted with various business touchpoints.

  • Final Dataset Construction: A balanced matrix of encoded features, standardized and optimized for clustering performance across a diverse customer base.

4. Clustering Approach

I used both the Elbow Method to determine the optimal number of clusters and experimented with K-Means and DBSCAN.

Why K-Means?

  • Scalable for large datasets (mine had ~1.3M rows)

  • Easy to interpret

  • Works well with numerical + encoded features

How I Determined K:

  • Used the Elbow Method with inertia_ to find the "elbow point"

  • Settled on K=4 based on distortion plot

5. Model Insights

After applying clustering, I analyzed cluster-wise distributions:

  • Revenue contributions by country, property, and month

  • Engagement levels by campaign type

  • Gender and generation skew per cluster

Visualized top-performing countries (grouped with “Other” logic).

Cluster-wise revenue distribution.

Peak revenue months extracted from booking dates.

6. Key Learnings

  • Feature engineering is 70% of the work

  • PySpark can handle large-scale clustering with efficiency

  • Always validate cluster quality with internal metrics (inertia, silhouette score) and business sense

  • Maintain security — no client info, no raw data exposure

7. Final Thoughts

This project solidified my understanding of unsupervised learning pipelines in PySpark. From complex joins and transformations to clustering and insight generation, I got hands-on experience delivering end-to-end data science in a secure environment.

Want the code?

I’ll be open-sourcing a dummy version of the notebook on GitHub soon with fake data — drop a comment if you want access or if you're curious how to scale segmentation with Spark!

0
Subscribe to my newsletter

Read articles from Muhammad Atif Hafeez directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Muhammad Atif Hafeez
Muhammad Atif Hafeez