Churn Prediction with Cat-Boost: A End-to-End Project for Telco Customers

SanjayramSanjayram
3 min read

Predicting customer churn with machine learning isn’t just a portfolio project — it’s a business-critical use case. In this post, I’ll walk you through how I built a production-grade, end-to-end churn prediction system using Cat-Boost, Stream-lit, Docker, and more.

Problem Statement: Why Does Churn Matter?

Churn — when a customer leaves a service — is one of the biggest threats to subscription-based businesses like telecom providers. Acquiring a new customer is 5–10x more expensive than retaining an existing one. My goal? Build a system that predicts which customers are likely to churn, so the business can proactively intervene and retain them.

Tools & Tech Stack

  • Data Analysis: Pandas, Matplotlib, Seaborn

  • Feature Engineering: Scikit-learn, Category Encoders

  • Model Training: CatBoost, XGBoost, Random Forest, Logistic Regression, SVM

  • Evaluation: Precision, Recall, F1, AUC

  • Deployment: Stream-lit, Docker

  • Code Structure: Production-grade with config mgmt, unit tests

Project Architecture:

Raw Telco Data → EDA & Cleaning → Feature Engineering → Train 5 ML Models → Model Selection (Cat-Boost wins) → Stream-lit Dashboard → Dockerized Deployment

EDA: Understanding the Churn Story

The Telco dataset included features like:

  • Contract, InternetService, PaymentMethod

  • tenure, MonthlyCharges, TotalCharges

  • Churn (target)

Key insights:

  • Short-tenure customers churn more

  • Month-to-month contracts have high churn

  • Fiber optic customers and those with electronic checks were more likely to churn

Feature Engineering: Turning Raw Data into Gold Steps included:

  • Handling missing values (TotalCharges)

  • Label encoding binary features

  • One-hot encoding categorical features

  • Scaling numerical values

  • Creating interaction terms (e.g., charges_per_month = TotalCharges / tenure)

Model Training & Evaluation

I trained and compared 5 models:

  1. CatBoost

  2. XGBoost

  3. Random Forest

  4. Logistic Regression

  5. SVM

Evaluation Metrics:

  • Accuracy: not enough (imbalanced data)

  • Used F1 Score, Precision, Recall, and AUC-ROC

CatBoost outperformed all others:

  • F1: 0.83

  • AUC: 0.89

  • No need for heavy encoding = simpler pipeline

Model Explainability I used:

  • Feature importance from CatBoost

  • SHAP values to visualize individual predictions

Most important features:

  • Contract

  • tenure

  • MonthlyCharges

  • InternetService

Streamlit Dashboard Built an interactive web app using Streamlit. Key features:

  • Upload a new dataset

  • Visualize churn distribution

  • Predict churn for individual users

  • Explain predictions using SHAP plots

Dockerized Deployment Wrapped the entire app in a Docker container:

  • Reproducible across machines

  • Easily deployable to Render / HuggingFace / Streamlit Cloud

Also included:

  • .env for secrets

  • config.yaml for tunables

  • pytest tests for pipelines

Key Takeaways

• Hands-on with real ML tools, not just theory

• Built a real product, not just a notebook

• Learned how to deploy ML apps for the real world

• Streamlit + Docker is a game-changer for ML portfolios

Project Links

• GitHub Repo: https://github.com/SANJAYRAM-DS/churn-ml-data-to-docker

• Live Demo: https://churn-ml-data-to-docker-kvrk8rh44zucasc3vcmx6n.streamlit.app

• Let's connect on LinkedIn: www.linkedin.com/in/sanjayram-data

Final Thoughts This project took me from data wrangling to full-stack ML deployment. I now understand what end-to-end machine learning really means. Hope this inspires you to go beyond Jupyter notebooks and build real, deployable ML systems!

Let me know your thoughts — and feel free to fork the repo to try it yourself

0
Subscribe to my newsletter

Read articles from Sanjayram directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Sanjayram
Sanjayram