Predicting Customer Churn Using XGBoost: A Comprehensive Guide
Table of Contents
Introduction
Understanding the Dataset
Setting Up the Environment
Clone the GitHub Repository
Install Dependencies
Load the Dataset
Run the Jupyter Notebook
4. Data Preprocessing
Handling Missing Data and Categorical Variables
Correcting Numerical Data Formats
Feature Scaling
5. Model Building
Splitting the Data
Training the XGBoost Classifier
Evaluating the Model
6. Hyperparameter Tuning
Setting Up GridSearchCV
Evaluating the Tuned Model
7. Conclusion
8. Next Steps
Experiment with Additional Features
Try Different Algorithms
Deploy the Model
9. References
1. Introduction
In today’s highly competitive market, customer retention is as crucial as acquiring new customers. For subscription-based businesses, understanding and predicting customer churn — when a customer stops using a service — can significantly impact revenue. By leveraging machine learning techniques, companies can predict which customers are likely to churn and take proactive measures to retain them.
In this blog post, we’ll walk through a detailed process of building a machine learning model to predict customer churn using the XGBoost algorithm, known for its efficiency and performance in classification tasks. We will cover everything from data preprocessing, model building, and evaluation to hyperparameter tuning. The dataset used in this project is sourced from Kaggle, and by the end of this post, you’ll have a clear understanding of how to implement a churn prediction model for your own datasets.
2. Understanding the Dataset
The dataset for this project provides a rich set of features related to customer behavior, including:
Average Order Value: The average value of orders placed by the customer.
Discount Rates: The average discount the customer receives.
Product Views: The number of product pages viewed by the customer.
Session Details: Information about the customer’s interactions during their sessions.
The target variable in this dataset is Churn, a binary indicator (0 or 1) representing whether a customer has churned.
Dataset Overview:
File Name:
data.csv
Number of Columns: 20
Key Features:
average_order_value
,discount_rate_per_visited_product
,product_detail_view
,location_code
, etc.Target Variable:
Churn
3. Setting Up the Environment
Before we dive into the model-building process, you need to set up your Python environment. This involves installing the necessary libraries and tools required to execute the code.
3.1 Clone the GitHub Repository
The first step is to clone the repository containing all the code and data for this project.
git clone https://github.com/Gayathri-Selvaganapathi/customer_churn_prediction.git
cd customer-churn-prediction
3.2 Install Dependencies
Install the required Python packages using the requirements.txt
file.
pip install -r requirements.txt
3.3 Load the Dataset
Download the dataset from Kaggle and place the data.csv
file in the root directory of the project.
3.4 Run the Jupyter Notebook
Open the Jupyter Notebook or JupyterLab and navigate to Customer_Churn_Prediction.ipynb
. This notebook contains all the steps for data preprocessing, model building, and evaluation.
4. Data Preprocessing
Data preprocessing is a crucial step that prepares the dataset for model training. Proper preprocessing can greatly enhance model performance and ensure that the features fed into the model are relevant and correctly formatted.
4.1 Handling Missing Data and Categorical Variables
The dataset includes a variety of features, some of which are categorical and need to be converted into a format that the machine learning model can process. For example:
Location Code: Initially stored as an integer, this column represents categorical data (like postal codes). We convert it into a string and then into categorical data.
Yes/No Columns: Columns such as
credit_card_info_save
andpush_status
are binary categorical variables. These are converted to integers (0 and 1) to facilitate the model's learning process.
df['location_code'] = df['location_code'].astype(str)
df['credit_card_info_save'] = df['credit_card_info_save'].replace({'Yes': 1, 'No': 0})
df['push_status'] = df['push_status'].replace({'Yes': 1, 'No': 0})
4.2 Correcting Numerical Data Formats
Some numerical columns contain commas as thousand separators, which need to be replaced with dots to convert the data into float format. This step ensures that these values can be correctly used in mathematical operations during model training.
df['average_order_value'] = df['average_order_value'].str.replace(',', '.').astype(float)
df['discount_rate_per_visited_product'] = df['discount_rate_per_visited_product'].str.replace(',', '.').astype(float)
4.3 Feature Scaling
Feature scaling is essential in ensuring that all numerical values are within the same range. This step prevents features with larger scales from disproportionately influencing the model. We use Normalizer
to scale the numerical features.
from sklearn.preprocessing import Normalizer
scaler = Normalizer()
scaled_features = scaler.fit_transform(df[['average_order_value', 'discount_rate_per_visited_product']])
df_scaled = pd.DataFrame(scaled_features, columns=['average_order_value', 'discount_rate_per_visited_product'])
5. Model Building
With our data preprocessed and ready, we can now focus on building the model. The XGBoost classifier is a powerful tool that uses gradient boosting techniques to achieve high accuracy, especially for structured data.
5.1 Splitting the Data
Before training the model, we need to split the dataset into training and testing sets. The training set is used to train the model, while the testing set is used to evaluate its performance.
from sklearn.model_selection import train_test_split
X = df.drop('Churn', axis=1)
y = df['Churn']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
5.2 Training the XGBoost Classifier
We initialize the XGBoost classifier and train it on the training data. After training, we evaluate the model on the test set.
import xgboost as xgb
xgb_clf = xgb.XGBClassifier()
xgb_clf.fit(X_train, y_train)
y_pred = xgb_clf.predict(X_test)
5.3 Evaluating the Model
The model’s performance is evaluated using the accuracy score, which measures the proportion of correct predictions. Initially, the model achieves an accuracy of 91.54%.
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print(f"Initial Model Accuracy: {accuracy * 100:.2f}%")
6. Hyperparameter Tuning
Hyperparameter tuning involves adjusting the model’s parameters to optimize performance. XGBoost offers several hyperparameters that can be fine-tuned to improve the model’s accuracy.
6.1 Setting Up GridSearchCV
We use GridSearchCV
to systematically test different combinations of hyperparameters. The parameters tuned include max_depth
, learning_rate
, gamma
, and subsample
.
from sklearn.model_selection import GridSearchCV
param_grid = {
'max_depth': [3, 5, 7],
'learning_rate': [0.01, 0.05, 0.1],
'gamma': [0, 1, 5],
'subsample': [0.8, 1.0]
}
grid_search = GridSearchCV(estimator=xgb_clf, param_grid=param_grid, scoring='accuracy', cv=3)
grid_search.fit(X_train, y_train)
6.2 Evaluating the Tuned Model
After hyperparameter tuning, the final model’s accuracy improves to 92.72%, demonstrating the effectiveness of fine-tuning in enhancing model performance.
final_accuracy = grid_search.best_score_
print(f"Final Model Accuracy after Tuning: {final_accuracy * 100:.2f}%")
7. Conclusion
Predicting customer churn is a vital aspect of maintaining a strong customer base in subscription-based businesses. By building a machine learning model using XGBoost, we were able to predict customer churn with an accuracy of over 92%. This project highlights the importance of data preprocessing, feature scaling, and hyperparameter tuning in developing robust machine learning models.
The techniques and methods demonstrated in this project can be applied to various business cases, making XGBoost a versatile tool
8. Next Steps
If you’re interested in exploring this project further, consider the following:
Experiment with Additional Features: Incorporate more features from the dataset or external sources to improve model performance.
Try Different Algorithms: Compare XGBoost’s performance with other classification algorithms like Random Forest, SVM, or Neural Networks.
Deploy the Model: Once satisfied with the model’s performance, deploy it into a production environment using tools like Flask, Django, or FastAPI.
9. References
Subscribe to my newsletter
Read articles from Gayathri Selvaganapathi directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Gayathri Selvaganapathi
Gayathri Selvaganapathi
AI enthusiast ,working across the data spectrum. I blog about data science machine learning, and related topics. I'm passionate about building machine learning and computer vision technologies that have an impact on the "real world".