Churn Prediction with Cat-Boost: A End-to-End Project for Telco Customers

Predicting customer churn with machine learning isn’t just a portfolio project — it’s a business-critical use case. In this post, I’ll walk you through how I built a production-grade, end-to-end churn prediction system using Cat-Boost, Stream-lit, Docker, and more.
Problem Statement: Why Does Churn Matter?
Churn — when a customer leaves a service — is one of the biggest threats to subscription-based businesses like telecom providers. Acquiring a new customer is 5–10x more expensive than retaining an existing one. My goal? Build a system that predicts which customers are likely to churn, so the business can proactively intervene and retain them.
Tools & Tech Stack
Data Analysis: Pandas, Matplotlib, Seaborn
Feature Engineering: Scikit-learn, Category Encoders
Model Training: CatBoost, XGBoost, Random Forest, Logistic Regression, SVM
Evaluation: Precision, Recall, F1, AUC
Deployment: Stream-lit, Docker
Code Structure: Production-grade with config mgmt, unit tests
Project Architecture:
Raw Telco Data → EDA & Cleaning → Feature Engineering → Train 5 ML Models → Model Selection (Cat-Boost wins) → Stream-lit Dashboard → Dockerized Deployment
EDA: Understanding the Churn Story
The Telco dataset included features like:
Contract, InternetService, PaymentMethod
tenure, MonthlyCharges, TotalCharges
Churn (target)
Key insights:
Short-tenure customers churn more
Month-to-month contracts have high churn
Fiber optic customers and those with electronic checks were more likely to churn
Feature Engineering: Turning Raw Data into Gold Steps included:
Handling missing values (TotalCharges)
Label encoding binary features
One-hot encoding categorical features
Scaling numerical values
Creating interaction terms (e.g., charges_per_month = TotalCharges / tenure)
Model Training & Evaluation
I trained and compared 5 models:
CatBoost
XGBoost
Random Forest
Logistic Regression
SVM
Evaluation Metrics:
Accuracy: not enough (imbalanced data)
Used F1 Score, Precision, Recall, and AUC-ROC
CatBoost outperformed all others:
F1: 0.83
AUC: 0.89
No need for heavy encoding = simpler pipeline
Model Explainability I used:
Feature importance from CatBoost
SHAP values to visualize individual predictions
Most important features:
Contract
tenure
MonthlyCharges
InternetService
Streamlit Dashboard Built an interactive web app using Streamlit. Key features:
Upload a new dataset
Visualize churn distribution
Predict churn for individual users
Explain predictions using SHAP plots
Dockerized Deployment Wrapped the entire app in a Docker container:
Reproducible across machines
Easily deployable to Render / HuggingFace / Streamlit Cloud
Also included:
.env for secrets
config.yaml for tunables
pytest tests for pipelines
Key Takeaways
• Hands-on with real ML tools, not just theory
• Built a real product, not just a notebook
• Learned how to deploy ML apps for the real world
• Streamlit + Docker is a game-changer for ML portfolios
Project Links
• GitHub Repo: https://github.com/SANJAYRAM-DS/churn-ml-data-to-docker
• Live Demo: https://churn-ml-data-to-docker-kvrk8rh44zucasc3vcmx6n.streamlit.app
• Let's connect on LinkedIn: www.linkedin.com/in/sanjayram-data
Final Thoughts This project took me from data wrangling to full-stack ML deployment. I now understand what end-to-end machine learning really means. Hope this inspires you to go beyond Jupyter notebooks and build real, deployable ML systems!
Let me know your thoughts — and feel free to fork the repo to try it yourself
Subscribe to my newsletter
Read articles from Sanjayram directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
