Customer Segmentation for an Indian Bank Using Machine Learning

Table of contents
- 1. Introduction
- 2. Understanding the Dataset
- 3. Exploratory Data Analysis (EDA)
- 4. Data Cleaning and Preprocessing
- 5. Feature Engineering
- 6. Dimensionality Reduction with PCA
- 7. Clustering and Customer Segmentation
- 8. Cluster Profiling and Analysis
- 9. Insights and Business Applications
- 10. Deployment and Practical Use
- 11. Conclusion and Future Work
- Appendices

1. Introduction
In the competitive banking industry, understanding customers' behavior and needs is critical for delivering personalized services, enhancing customer satisfaction, and improving profitability. By leveraging machine learning techniques, customer segmentation can be achieved efficiently, enabling targeted marketing strategies and optimized product offerings.
This project aims to segment customers based on their demographic and transactional data. The segmentation insights will help the bank identify high-value customers, understand regional trends, and tailor services to improve customer engagement and retention.
2. Understanding the Dataset
This project uses a dataset containing over 1 million transaction records from an Indian bank. The dataset includes the following columns:
Below are the initial steps to import and explore the dataset.
# Importing necessary libraries
import pandas as pd
# Load the dataset
file_path = 'bank_transactions.csv'
df = pd.read_csv(file_path)
# Display the first few rows of the dataset
print(df.head())
TransactionID CustomerID CustomerDOB CustGender CustLocation \
0 T1 C5841053 10/1/94 F JAMSHEDPUR
1 T2 C2142763 4/4/57 M JHAJJAR
2 T3 C4417068 26/11/96 F MUMBAI
3 T4 C5342380 14/9/73 F MUMBAI
4 T5 C9031234 24/3/88 F NAVI MUMBAI
CustAccountBalance TransactionDate TransactionTime \
0 17819.05 2/8/16 143207
1 2270.69 2/8/16 141858
2 17874.44 2/8/16 142712
3 866503.21 2/8/16 142714
4 6714.43 2/8/16 181156
TransactionAmount (INR)
0 25.0
1 27999.0
2 459.0
3 2060.0
4 1762.5
Next, we examine the dataset's general information to understand the structure, column types, and presence of missing values.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048567 entries, 0 to 1048566
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 TransactionID 1048567 non-null object
1 CustomerID 1048567 non-null object
2 CustomerDOB 1045170 non-null object
3 CustGender 1047467 non-null object
4 CustLocation 1048416 non-null object
5 CustAccountBalance 1046198 non-null float64
6 TransactionDate 1048567 non-null object
7 TransactionTime 1048567 non-null int64
8 TransactionAmount (INR) 1048567 non-null float64
dtypes: float64(2), int64(1), object(6)
memory usage: 72.0+ MB
None
Key Points:
The dataset includes both categorical (e.g.,
CustGender
,CustLocation
) and numerical (e.g.,CustAccountBalance
,TransactionAmount (INR)
) variables.Temporal variables (
CustomerDOB
,TransactionDate
,TransactionTime
) offer opportunities for feature engineering, such as derivingAge
andDaysSinceLastTransaction
.The variables provide a combination of demographic and transactional information, making them suitable for customer segmentation.
In the next steps, we will clean and preprocess the dataset, ensuring it is ready for clustering analysis. The detailed examination of distributions, missing values, and outliers will be included in the EDA section.
3. Exploratory Data Analysis (EDA)
Check the missing values and duplicates
Exploratory Data Analysis (EDA) is a critical step in understanding the structure and distribution of data, identifying patterns, and addressing data quality issues. Here, we examine missing values, analyze distributions of key features, and detect potential outliers.
# Checking for missing values in the dataset
missing_values = df.isnull().sum()
missing_percentage = (missing_values / len(df)) * 100
# Display missing values as a percentage
print(pd.DataFrame({'Missing Values': missing_values, 'Percentage (%)': missing_percentage}))
Missing Values Percentage (%)
TransactionID 0 0.000000
CustomerID 0 0.000000
CustomerDOB 3397 0.323966
CustGender 1100 0.104905
CustLocation 151 0.014401
CustAccountBalance 2369 0.225927
TransactionDate 0 0.000000
TransactionTime 0 0.000000
TransactionAmount (INR) 0 0.000000
# Check for duplicate rows
print(f"Number of duplicate rows: {df.duplicated().sum()}")
Number of duplicate rows: 0
Insights:
The
CustomerDOB
,CustGender
,CustLocation
, andCustAccountBalance
columns have missing values, but the percentages are small (<1%).These rows could be safely dropped or imputed based on the context of the analysis.
There are no duplicate rows in the dataset
Data Distribution
We visualize the distribution of key numerical and categorical features to understand their spread and detect potential skewness or anomalies.
a. Distribution of Customer Gender
# Distribution of customer gender
# Plotting the distribution of gender
plt.figure(figsize=(8, 6))
sns.countplot(data=df, x='CustGender', palette='viridis')
# Add the value of each column above it, separated from the columns
for p in plt.gca().patches:
plt.gca().text(p.get_x() + p.get_width() / 2, p.get_height() + 0.5, f'{p.get_height():.0f}', ha='center', va='bottom')
plt.title('Distribution of Customer Gender')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.show()
The bar plot shows the distribution of customer genders in the dataset, with a significant majority being male (M
) at 765,530 entries, followed by females (F
) at 281,936 entries. An outlier labeled as T
is present, but it only accounts for a single instance, which could represent either a data entry error or a rare gender classification. This anomaly highlights the need for further investigation and possibly treating T
as an error or missing value during preprocessing.
b. Distribution of Customer Account Balance
# Distribution of account balance
# Determine a reasonable x-axis range (e.g., 99th percentile)
x_max = df['CustAccountBalance'].dropna().quantile(0.99)
# Plotting the distribution of account balance with an adjusted x-axis range
# Adjusting the plot to display tick labels in plain format
plt.figure(figsize=(10, 6))
sns.histplot(df['CustAccountBalance'].dropna(), bins=50, kde=True, color='blue')
plt.title('Distribution of Customer Account Balance')
plt.xlabel('Account Balance (INR)')
plt.ylabel('Frequency')
plt.xlim(0, x_max)
plt.ticklabel_format(style='plain', axis='both') # Use plain number formatting
plt.show()
The plot shows a highly right-skewed distribution of customer account balances. Most customers have account balances clustered near zero, with a long tail representing a small number of customers with significantly higher balances. This skewness highlights the potential influence of outliers on subsequent analyses, such as clustering, and suggests the need for techniques like log transformation or handling extreme values during preprocessing to better understand the majority of the data.
c. Distribution of Transaction Amount
# Distribution of transaction amount
# Determine a reasonable x-axis range (e.g., 99th percentile)
x_max = df['TransactionAmount (INR)'].dropna().quantile(0.99)
# Plotting the distribution of transaction amount with adjusted x-axis range
plt.figure(figsize=(10, 6))
sns.histplot(df['TransactionAmount (INR)'].dropna(), bins=50, kde=True, color='green')
plt.title('Distribution of Transaction Amount (INR)', fontsize=14)
plt.xlabel('Transaction Amount (INR)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.xlim(0, x_max) # Adjust x-axis to exclude extreme outliers
plt.ticklabel_format(style='plain', axis='both') # Use plain number formatting for clarity
plt.grid(axis='y', linestyle='--', alpha=0.7) # Add gridlines for better readability
plt.show()
The plot shows the distribution of transaction amounts in INR, which is highly right-skewed. Most transactions are concentrated near lower amounts, with a sharp decline in frequency as the amount increases. The x-axis has been adjusted to display values up to the 99th percentile, excluding extreme outliers, making the distribution of the majority of transactions clearer. This pattern indicates a large number of small-value transactions, with only a few high-value transactions contributing to the long tail. This skewness highlights the need for handling outliers during preprocessing for a more balanced analysis.
d. Top Customer Locations
Understanding the geographical distribution of customers provides insights into the regions with the highest customer concentration. This information can be valuable for targeted marketing strategies and resource allocation.
The bar plot below shows the top 20 customer locations based on the frequency of occurrences in the dataset.
# Plotting the distribution of top 20 customer locations
plt.figure(figsize=(12, 8))
df['CustLocation'].value_counts().head(20).plot(kind='bar', color='blue')
plt.title('Top 20 Customer Locations', fontsize=14)
plt.xlabel('Location', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=45, ha='right') # Rotate x-axis labels for better readability
plt.grid(axis='y', linestyle='--', alpha=0.7) # Add gridlines for clarity
plt.show()
The bar plot shows the top 20 customer locations by count, with Mumbai leading significantly, followed by New Delhi, Bangalore, and Gurgaon. These cities dominate the dataset, reflecting a concentration of customers in major urban centers. This highlights potential regional trends, where densely populated or economically active areas may drive customer activity, making them key targets for segmentation and marketing strategies. Smaller cities, such as Mohali and Surat, have relatively fewer customers.
Outlier Detection
Outliers in numerical features can distort clustering results. We examine the boxplots of CustAccountBalance
and TransactionAmount (INR)
to identify potential outliers.
# Boxplots for outlier detection
plt.figure(figsize=(15, 6))
# Subplot for account balance
plt.subplot(1, 2, 1)
sns.boxplot(x=df['CustAccountBalance'], color='lightblue')
plt.title('Boxplot of Customer Account Balance')
# Subplot for transaction amount
plt.subplot(1, 2, 2)
sns.boxplot(x=df['TransactionAmount (INR)'], color='lightgreen')
plt.title('Boxplot of Transaction Amount (INR)')
plt.tight_layout()
plt.show()
The box plots highlight the extreme skewness in both CustAccountBalance
and TransactionAmount (INR)
. Most of the data points are concentrated near the lower end, with a significant number of outliers extending far beyond the upper whiskers. The interquartile range (IQR) is very small compared to the range of outliers, making the box (IQR) nearly invisible. This indicates the need to handle outliers, such as by capping or applying log transformations, to improve data balance and visualization.
4. Data Cleaning and Preprocessing
Data cleaning and preprocessing are crucial steps to ensure the dataset is ready for analysis. In this section, we address missing values, convert date-related columns into appropriate formats, and handle outliers to improve the data quality.
Handling Missing Values
The dataset contains missing values in a few columns, including CustomerDOB
, CustGender
, CustLocation
, and CustAccountBalance
. Since the proportion of rows with missing values is only 0.66%, we decided to drop these rows to simplify the preprocessing.
# Dropping rows with missing values
print(f"Before dropping missing values: {df.shape}")
df.dropna(inplace=True)
print(f"After dropping missing values: {df.shape}")
Before dropping missing values: (1048567, 9)
After dropping missing values: (1041614, 9)
Converting Date Columns to Datetime Format
Columns like CustomerDOB
and TransactionDate
are critical for temporal analysis but are currently stored as strings. We convert these columns to datetime format for consistency and to enable feature engineering.
# Converting CustomerDOB and TransactionDate to datetime format
df['CustomerDOB'] = pd.to_datetime(df['CustomerDOB'], format='%d/%m/%y', errors='coerce')
df['TransactionDate'] = pd.to_datetime(df['TransactionDate'], format='%d/%m/%y', errors='coerce')
Handling Outliers
The CustAccountBalance
and TransactionAmount (INR)
columns exhibit extreme skewness and a large number of high-value data points, which reflect the natural inequality in financial data distribution. These high values are not errors but are essential to understanding customer behavior. Rather than removing these "outliers," we use robust preprocessing methods to mitigate their impact while retaining their significance.
One-Hot Encoding Categorical Features
Since CustLocation
is not being encoded, we will only apply one-hot encoding to the CustGender
feature. This ensures that the clustering model can work with this categorical variable in numerical form while keeping the data manageable.
Why CustLocation
is Not Encoded
High Cardinality: The
CustLocation
feature has too many unique values, which could lead to sparsity if encoded.Focus on Financial Features: The clustering primarily focuses on
CustAccountBalance
andTransactionAmount (INR)
, makingCustLocation
less relevant to the segmentation goal.Imbalanced Distribution: The dominance of certain locations (e.g., Mumbai) could bias the clustering process, reducing the significance of smaller locations.
Encoding CustGender
The CustGender
feature is binary, with values M
(male) and F
(female). One-hot encoding converts this feature into a binary column for easier use in clustering algorithms.
# Encoding CustGender into a binary numerical column
df['CustGender'] = df['CustGender'].map({'M': 1, 'F': 0})
# Checking the updated dataset
print(df[['CustGender']].head())
CustGender
0 0.0
1 1.0
2 0.0
3 0.0
4 0.0
By encoding only CustGender
, we simplify the preprocessing pipeline and maintain alignment with the clustering objectives, avoiding unnecessary complexity from encoding CustLocation
.
5. Feature Engineering
Feature engineering is a critical step in creating meaningful variables that enhance the clustering process. By deriving new features from existing ones, we can better capture customer behavior and improve the segmentation.
Creation of New Features
- Age: Calculated from
CustomerDOB
to capture the customer's current age, which might influence financial behavior since older or younger customers may exhibit distinct financial patterns, such as transaction amounts or frequency, making it a key variable for segmentation.
from datetime import datetime
# Creating Age feature
df['Age'] = datetime.now().year - pd.to_datetime(df['CustomerDOB']).dt.year
# Identify rows with invalid or negative ages
invalid_ages = df[df['Age'] < 0]
invalid_ages_count = invalid_ages.shape[0]
# Display the count and examples of invalid ages
invalid_ages_count, invalid_ages.head()
(54111,
TransactionID CustomerID CustomerDOB CustGender CustLocation \
1 T2 C2142763 2057-04-04 1.0 JHAJJAR
15 T16 C8334633 2068-07-10 0.0 NEW DELHI
30 T31 C7648712 2067-01-21 0.0 NEW DELHI
51 T52 C6637047 2066-07-25 1.0 PUNE
79 T80 C6612422 2067-05-25 1.0 DELHI
CustAccountBalance TransactionDate TransactionTime \
1 -0.275097 2016-08-02 141858
15 -0.293837 2016-08-01 125725
30 5.345137 2016-08-03 160642
51 -0.129259 2016-08-05 133141
79 4.576746 2016-08-01 204252
TransactionAmount (INR) Age
1 26.482212 -32.0
15 -0.199519 -43.0
30 0.040865 -42.0
51 -0.300481 -41.0
79 -0.251788 -42.0 )
Since the rows with invalid CustomerDOB
are few (e.g., 54,111 out of a large dataset), dropping them is a safe approach to avoid introducing noise or skewing the clustering results. This approach ensures the dataset remains clean and reliable.
# Dropping rows with invalid ages
df = df[df['Age'] >= 0]
- DaysSinceLastTransaction: Derived from
TransactionDate
to measure recency, which is an important behavioral indicator. Customers who have not transacted recently may be less engaged, aiding in identifying inactive or at-risk segments.
# Creating DaysSinceLastTransaction feature
df['DaysSinceLastTransaction'] = (datetime.now() - pd.to_datetime(df['TransactionDate'])).dt.days
- TransactionHour: Extracted from
TransactionTime
to analyze patterns in transaction timings (e.g., peak transaction hours).
# Creating TransactionHour feature
df['TransactionHour'] = df['TransactionTime'].astype(str).str.zfill(6).str[:2].astype(int)
- Customer Lifetime Value (CLV): Calculated as the total transaction amount per customer, representing the overall value of a customer to the bank.
# Creating Customer Lifetime Value (CLV) feature
df['CLV'] = df.groupby('CustomerID')['TransactionAmount (INR)'].transform('sum')
- RFM Metrics (Recency, Frequency, Monetary):
Recency: Time since the last transaction.
Frequency: Number of transactions per customer.
Monetary: Total transaction value per customer.
RFM provides a structured approach to understanding customer behavior, dividing customers based on recency, frequency, and monetary contribution to the bank.
# Calculating RFM metrics
rfm = df.groupby('CustomerID').agg({
'TransactionDate': lambda x: (datetime.now() - pd.to_datetime(x).max()).days,
'TransactionID': 'count',
'TransactionAmount (INR)': 'sum'
}).reset_index()
rfm.rename(columns={
'TransactionDate': 'Recency',
'TransactionID': 'Frequency',
'TransactionAmount (INR)': 'Monetary'
}, inplace=True)
print(rfm.head())
CustomerID Recency Frequency Monetary
0 C1010011 3046 2 4.029808
1 C1010012 3089 1 1.001442
2 C1010014 3096 2 0.519231
3 C1010018 3057 1 -0.411058
4 C1010024 3085 1 4.367788
6. Dimensionality Reduction with PCA
Principal Component Analysis (PCA) is a powerful technique used to reduce the dimensionality of a dataset while retaining most of the variance. This step simplifies clustering by eliminating redundant information and improving computational efficiency.
Merging RFM Metrics
The RFM metrics (Recency
, Frequency
, and Monetary
) were created in a separate DataFrame. To ensure these features are included in the clustering process, we merge them into the main DataFrame (df
).
# Merge RFM metrics into the main DataFrame
df = df.merge(rfm, on='CustomerID', how='left')
# Verify that all required columns are now present
print(df.columns)
Index(['TransactionID', 'CustomerID', 'CustomerDOB', 'CustGender',
'CustLocation', 'CustAccountBalance', 'TransactionDate',
'TransactionTime', 'TransactionAmount (INR)', 'Age',
'DaysSinceLastTransaction', 'TransactionHour', 'CLV', 'Recency',
'Frequency', 'Monetary'],
dtype='object')
Standardizing the Selected Features
Before applying PCA, we scale the numerical features to ensure equal contribution to the analysis, as PCA is sensitive to feature magnitudes.
The features chosen for PCA are:
The CustGender
feature was encoded as a binary variable (M
→ 1, F
→ 0)
but **not included in PCA**, as PCA is designed for continuous numerical features. Instead, CustGender
will be retained in the dataset and used later in K-Means clustering, allowing analysis of gender distribution across customer segments.
On the other hand, CustLocation
was not encoded, as it has high cardinality, making one-hot encoding impractical. Since the clustering focuses on transactional behavior, location is not essential for segmentation.
By applying PCA only to numerical features while keeping CustGender
available for clustering, we ensure an efficient and meaningful segmentation process.
Standardizing the Selected Features
Since PCA relies on variance, scaling is necessary to ensure that features with large values (e.g., TransactionAmount (INR)
) do not dominate the principal components.
from sklearn.preprocessing import StandardScaler
# Features selected for PCA based on the original project
pca_features = ['CustAccountBalance', 'TransactionAmount (INR)', 'Age',
'DaysSinceLastTransaction', 'TransactionHour', 'CLV',
'Recency', 'Frequency', 'Monetary']
# Standardizing the selected features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df[pca_features])
Apply PCA to All Components
We first fit PCA on all components to analyze the variance explained by each.
from sklearn.decomposition import PCA
import numpy as np
import matplotlib.pyplot as plt
# Apply PCA without specifying n_components to analyze all
pca = PCA()
pca.fit(scaled_features)
# Compute cumulative explained variance
explained_variance_ratio = np.cumsum(pca.explained_variance_ratio_)
# Plot the explained variance ratio (Elbow Method for PCA)
plt.figure(figsize=(8, 6))
plt.plot(range(1, len(explained_variance_ratio) + 1), explained_variance_ratio, marker='o', linestyle='--')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Elbow Method for Optimal PCA Components')
plt.grid(True)
plt.show()
Finding the Optimal Number of Principal Components (PCs)
Using the Elbow Method, we analyzed the cumulative explained variance to determine the optimal number of principal components. From the plot:
The cumulative variance exceeds 90% by the 6th component, indicating that 6 principal components capture most of the dataset's variance.
Beyond 6 components, the additional explained variance diminishes, making the rest less impactful.
We select 6 principal components for dimensionality reduction.
# Applying PCA with 6 components
pca = PCA(n_components=6)
pca_result = pca.fit_transform(scaled_features)
# Adding PCA results to the DataFrame
df['PCA1'] = pca_result[:, 0]
df['PCA2'] = pca_result[:, 1]
df['PCA3'] = pca_result[:, 2]
df['PCA4'] = pca_result[:, 3]
df['PCA5'] = pca_result[:, 4]
df['PCA6'] = pca_result[:, 5]
7. Clustering and Customer Segmentation
This section applies K-Means clustering to segment customers into distinct groups based on their transactional and behavioral patterns.
Optimal Number of Clusters
The Elbow Method was applied to determine the optimal number of clusters. By plotting the Sum of Squared Distances (Inertia) against different values of K, we look for an "elbow point," where adding more clusters provides diminishing returns.
Code for Visualization
# Optimal number of clusters (Elbow Method)
from sklearn.cluster import KMeans
# Calculate inertia for different values of K
inertia = []
K_range = range(1, 11)
for k in K_range:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(df[['PCA1', 'PCA2', 'PCA3', 'PCA4', 'PCA5', 'PCA6']])
inertia.append(kmeans.inertia_)
# Plot the Elbow Method results
plt.figure(figsize=(10, 6))
plt.plot(K_range, inertia, marker='o', linestyle='--')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia (Sum of Squared Distances)')
plt.title('Elbow Method for Optimal K')
plt.grid(True)
plt.show()
The optimal number of clusters is K = 6, as it provides a better balance between reducing the sum of squared distances (inertia) and maintaining simplicity. This ensures that the clustering solution is both meaningful and interpretable.
K-Means Clustering
Using the optimal value of K = 6, we implemented the K-Means clustering algorithm on the PCA-transformed dataset.
# Applying K-Means clustering with 6 clusters
kmeans = KMeans(n_clusters=6, random_state=42)
cluster_labels = kmeans.fit_predict(df[['PCA1', 'PCA2', 'PCA3', 'PCA4', 'PCA5', 'PCA6']])
# Adding cluster labels to the DataFrame
df['Cluster'] = cluster_labels
Visualization of Clusters
To visualize the clustering results, we used the first two principal components (PCA1
and PCA2
) to create a scatter plot. Each point represents a customer, color-coded by its assigned cluster.
# Visualization of Clusters
# Scatter plot of clusters on the first two PCA components
plt.figure(figsize=(10, 6))
sns.scatterplot(x='PCA1', y='PCA2', hue='Cluster', palette='viridis', data=df, alpha=0.7)
plt.title('Customer Segmentation using K-Means Clustering (K=6)')
plt.xlabel('PCA1')
plt.ylabel('PCA2')
plt.legend(title='Cluster')
plt.show()
The plot visualizes customer segmentation into six clusters using the first two principal components (PCA1 and PCA2). Cluster 5 (yellow) stands out significantly, indicating a group with distinct characteristics, potentially driven by extreme values in the dataset. The other clusters (0 to 4) are closely packed, suggesting overlapping behaviors or smaller variances between those segments. This clustering structure highlights potential outliers and distinct customer groups for further analysis.
Analyzing and Adding Cluster Labels
The cluster labels were added back to the original dataset for further analysis and interpretation. This allows us to explore each segment's characteristics, such as average transaction value, frequency, and customer demographics.
# Merging cluster labels with the original dataset for interpretation
cluster_summary = df.groupby('Cluster').mean()
print(cluster_summary)
CustGender CustAccountBalance TransactionTime \
Cluster
0 0.715262 4.803408e+04 154940.111993
1 0.721161 5.906237e+04 157076.590304
2 0.754320 2.754499e+05 159823.477349
3 0.719675 5.228310e+04 159952.216517
4 0.707069 3.358953e+05 156557.757392
5 0.820513 1.741221e+06 158989.025641
TransactionAmount (INR) Age DaysSinceLastTransaction \
Cluster
0 903.011808 36.348861 3093.101986
1 1047.727929 37.425236 3078.407552
2 1979.125956 48.617109 3081.610222
3 904.313536 36.678175 3065.045802
4 44886.150211 41.442921 3079.839828
5 439452.219744 42.871795 3078.948718
TransactionHour CLV Recency Frequency \
Cluster
0 15.196049 966.221687 3092.751830 1.073175
1 15.410039 2560.410767 3067.930791 2.241514
2 15.685248 2172.280829 3080.222722 1.173310
3 15.697193 920.514399 3065.038165 1.023895
4 15.363227 55314.255951 3076.099276 1.474055
5 15.615385 525905.508718 3075.666667 1.666667
Monetary PCA1 PCA2 PCA3 PCA4 PCA5 \
Cluster
0 966.221687 -0.304102 1.383238 -0.276241 0.138544 0.100448
1 2560.410767 0.237760 -0.831150 -0.399013 -1.055146 -1.005582
2 2172.280829 0.256291 0.290182 1.443423 -0.020180 0.040478
3 920.514399 -0.219011 -1.135079 -0.127184 0.571367 0.547641
4 55314.255951 13.620303 0.598597 -0.638923 0.500619 0.600280
5 525905.508718 133.878648 6.758859 -9.890442 7.291864 7.382774
PCA6
Cluster
0 0.196405
1 0.084686
2 -0.869133
3 0.155481
4 0.110740
5 2.696470
<ipython-input-9-b04a3e48b7c3>:2: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
cluster_summary = df.groupby('Cluster').mean()
8. Cluster Profiling and Analysis
After segmenting the customers into six clusters, detailed profiling was conducted to understand the characteristics of each segment. This analysis helps to extract meaningful insights and actionable strategies.
Profiling Customer Segments
For each cluster, we analyzed key features, including demographics and transactional behavior:
Demographics:
Age: Average age of customers in each cluster to identify age-based trends.
CustGender: Gender distribution within each cluster to reveal potential biases or patterns.
Transactional Behavior:
CustAccountBalance: Average account balance per cluster, reflecting financial strength.
TransactionAmount (INR): Total and average transaction amounts to capture spending patterns.
CLV (Customer Lifetime Value): Indicates the overall value of customers within each cluster.
Summary Statistics
Summary statistics for each cluster were computed to compare key metrics across segments.
# Calculate mean values of key metrics by cluster
cluster_summary = df.groupby('Cluster')[['Age', 'CustAccountBalance',
'TransactionAmount (INR)',
'CLV', 'Recency', 'Frequency', 'Monetary']].mean()
# Display summary statistics
print(cluster_summary)
Age CustAccountBalance TransactionAmount (INR) \
Cluster
0 36.348861 4.803408e+04 903.011808
1 37.425236 5.906237e+04 1047.727929
2 48.617109 2.754499e+05 1979.125956
3 36.678175 5.228310e+04 904.313536
4 41.442921 3.358953e+05 44886.150211
5 42.871795 1.741221e+06 439452.219744
CLV Recency Frequency Monetary
Cluster
0 966.221687 3092.751830 1.073175 966.221687
1 2560.410767 3067.930791 2.241514 2560.410767
2 2172.280829 3080.222722 1.173310 2172.280829
3 920.514399 3065.038165 1.023895 920.514399
4 55314.255951 3076.099276 1.474055 55314.255951
5 525905.508718 3075.666667 1.666667 525905.508718
These summaries highlight the key characteristics of each customer segment, providing insights into their demographics and transaction behaviors.
Bar Plot for Key Metrics
This bar plot was created to compare key metrics CustAccountBalance
across clusters.
# Bar plot for average CustAccountBalance by cluster
plt.figure(figsize=(10, 6))
sns.barplot(x=cluster_summary.index, y=cluster_summary['CustAccountBalance'], palette='viridis')
plt.title('Average Account Balance by Cluster')
plt.xlabel('Cluster')
plt.ylabel('Average Account Balance')
plt.show()
The plot shows the average account balance for each cluster. Cluster 5 stands out with an exceptionally high average account balance, indicating a group of high-value customers. Clusters 2 and 4 have moderate balances, while Clusters 0, 1, and 3 have significantly lower averages, suggesting they may consist of low-value customers or those with minimal financial activity. This segmentation highlights the financial disparity across customer groups.
Pie Chart for Cluster Proportions
A pie chart was created to show the proportion of customers in each cluster.
# Pie chart for cluster proportions
plt.figure(figsize=(8, 8))
df['Cluster'].value_counts().plot(kind='pie', autopct='%1.1f%%', colors=sns.color_palette('viridis', 6))
plt.title('Proportion of Customers in Each Cluster')
plt.ylabel('')
plt.show()
The pie chart illustrates the proportion of customers in each cluster. Cluster 0 contains the largest share of customers (33.7%), followed by Cluster 3 (30.0%) and Cluster 1 (20.7%). Cluster 2 has a smaller proportion (15.1%), while Cluster 5 (0.9%) and Cluster 4 (0.6%) are significantly underrepresented, indicating these clusters consist of unique or high-value customers with distinct characteristics.
To better understand the characteristics of each cluster, bar plots were created for key metrics such as age, account balance, transaction amount, and customer lifetime value (CLV).
Average Age by Customer Segment
This bar plot shows the average age for each cluster, revealing age-related patterns in customer segmentation.
# Bar plot for average age by cluster
plt.figure(figsize=(12, 6))
ax = sns.barplot(x='Cluster', y='Age', data=df, palette='viridis')
plt.title('Average Age by Customer Segment')
plt.xlabel('Cluster')
plt.ylabel('Average Age')
# Add the value of each bar on top
for p in ax.patches:
ax.annotate(format(p.get_height(), '.1f'),
(p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center',
xytext=(0, 9),
textcoords='offset points')
plt.show()
The plot shows the average age of customers in each cluster. Cluster 2 has the oldest average age at 48.6 years, followed by Clusters 5 and 4 with averages of 42.9 and 41.4 years, respectively. Clusters 0, 1, and 3 represent younger customer segments, with Cluster 0 having the youngest average age at 36.3 years. This variation suggests age-related segmentation patterns in customer behavior.
Average Account Balance by Customer Segment
This bar plot highlights the average account balance across clusters, reflecting the financial strength of each segment.
The plot highlights the average account balance across clusters. Cluster 5 stands out with an extraordinarily high average balance of ₹1,741,221.5, indicating a group of extremely wealthy customers. However, the long black bar for Cluster 5 suggests a very high variance, indicating significant heterogeneity or the presence of outliers within this segment. Cluster 4 follows with a significantly lower average balance of ₹335,895.3, and Clusters 0, 1, 2, and 3 have much smaller average balances, all below ₹60,000.
Considerations:
High Variance in Cluster 5: The variance in Cluster 5 implies that this group includes both extremely high-value customers and potentially some with moderately high balances. Further analysis may be needed to identify subgroups within this cluster.
Outlier Impact: Outliers may inflate the mean and variance for Cluster 5. Addressing these outliers could provide a clearer understanding of the segment's core characteristics.
Cluster Strategy: Clusters 5 and 4 represent high-value customers requiring specialized strategies, while Clusters 0, 1, 2, and 3 represent lower-value segments, suggesting a focus on growth and engagement strategies for these groups.
Average Transaction Amount by Customer Segment
This bar plot illustrates the average transaction amount per cluster, capturing spending behavior.
# Bar plot for average transaction amount by cluster
plt.figure(figsize=(12, 6))
sns.barplot(x='Cluster', y='TransactionAmount (INR)', data=df, palette='viridis')
plt.title('Average Transaction Amount by Customer Segment')
plt.xlabel('Cluster')
plt.ylabel('Average Transaction Amount (INR)')
plt.show()
The plot shows the average transaction amount by customer segment. Cluster 5 has an exceptionally high average transaction amount of ₹439,452.2, with a noticeable variance, indicating a segment of high-value transactions. Cluster 4 follows with a much smaller average of ₹44,886.2, while Clusters 0, 1, 2, and 3 exhibit minimal transaction amounts, all below ₹2,000. This suggests that Clusters 5 and 4 consist of premium customers, whereas the others represent lower transaction activity.
Average CLV by Customer Segment
The final bar plot depicts the average customer lifetime value (CLV) for each cluster, summarizing long-term customer contributions.
# Bar plot for average CLV by cluster
plt.figure(figsize=(12, 6))
sns.barplot(x='Cluster', y='CLV', data=df, palette='viridis')
plt.title('Average Customer Lifetime Value (CLV) by Customer Segment')
plt.xlabel('Cluster')
plt.ylabel('Average CLV (INR)')
plt.show()
The plot illustrates the average Customer Lifetime Value (CLV) across clusters. Cluster 5 stands out with a significantly high average CLV of over ₹600,000, coupled with noticeable variance, indicating heterogeneity within this group. Cluster 4 follows at a much lower average of ₹50,000, while Clusters 0, 1, 2, and 3 show negligible CLV, all close to zero. This emphasizes that Cluster 5 consists of extremely valuable customers, requiring tailored retention and engagement strategies.
9. Insights and Business Applications
The segmentation of customers into six distinct clusters provides actionable insights for crafting tailored marketing strategies and identifying growth opportunities. Here's how each cluster aligns with specific business strategies:
Marketing Strategies for Each Customer Segment:
Cluster 0 (Low-Value Customers):
Characteristics: Lowest average account balance (~48,034 INR) and transaction amount (~903 INR).
Focus: Retention
Strategies:
Introduce loyalty programs to maintain their engagement.
Provide educational content to help them better utilize financial services.
Offer budget-friendly promotions to encourage more transactions.
Cluster 1 (Moderate-Value Customers):
Characteristics: Slightly higher account balance (~59,062 INR) and transaction amount (~1,047 INR).
Focus: Spending Growth
Strategies:
Deploy targeted promotions to boost spending.
Highlight cross-sell opportunities for services like loans or savings plans.
Provide cashback or reward-based incentives for higher transaction volumes.
Cluster 2 (Emerging Customers):
Characteristics: Moderate account balance (~275,449 INR) and transaction amount (~1,979 INR), with potential for growth.
Focus: Nurturing Growth
Strategies:
Offer personalized onboarding campaigns to introduce additional financial products.
Provide special offers to increase engagement and spending activity.
Regularly communicate to strengthen their relationship with the bank.
Cluster 3 (Another Low-Value Group):
Characteristics: Similar to Cluster 0, with slightly higher account balance (~52,283 INR) and transaction amount (~904 INR).
Focus: Retention
Strategies:
Retain engagement through loyalty programs and budget-friendly options.
Simplify their financial experience with easy-to-use tools and resources.
Cluster 4 (High-Value Customers):
Characteristics: High account balance (~335,895 INR) and transaction amount (~44,886 INR).
Focus: Premium Services
Strategies:
Offer premium financial products, such as exclusive investment plans or credit cards.
Assign dedicated account managers for personalized attention.
Provide access to VIP events and memberships.
Cluster 5 (Ultra-High-Value Customers):
Characteristics: Exceptional account balance (~1.74 million INR) and transaction amount (~439,452 INR), with high variance.
Focus: Personalized and Exclusive Services
Strategies:
Deliver one-on-one wealth management solutions.
Offer highly tailored financial products and private banking services.
Build trust through consistent and high-value customer service.
Identified Business Opportunities:
Cross-Selling Programs: Utilize segmentation insights to recommend tailored products, such as investment options for Cluster 5 or insurance for Cluster 4.
Customer Engagement Programs: Develop initiatives like targeted communication and rewards programs to deepen relationships with Clusters 1 and 2.
Revenue Maximization: Focus on retaining high-value customers in Clusters 4 and 5 while nurturing potential growth in Cluster 2.
By leveraging these targeted strategies, the business can enhance customer satisfaction, maximize revenue, and improve overall financial performance.
10. Deployment and Practical Use
To make the insights and segmentation actionable, the project includes exporting and deploying the results for practical business use. Here's how this can be achieved:
Exporting Segmented Customer Data
The segmented customer data is exported to a CSV file for easy integration with the organization's Customer Relationship Management (CRM) system. This allows business teams to access the data and implement targeted strategies for each cluster.
# Exporting segmented customer data
df['Cluster'] = clusters # Assuming 'clusters' contains the cluster labels
df.to_csv('segmented_customers.csv', index=False)
print("Segmented data exported successfully!")
Segmented data exported successfully!
This exported data can be used by marketing, sales, and customer service teams to execute the tailored strategies discussed earlier.
Deployment Idea: A Streamlit App
To facilitate interactive and dynamic visualization of the customer clusters, a Streamlit app can be developed. The app can provide the following functionalities:
Cluster Visualization:
Display 2D visualizations of clusters using PCA components.
Include interactive filters to explore specific clusters or features.
Dynamic Insights:
Summarize key metrics for each cluster, such as average account balance, transaction amounts, and customer lifetime value (CLV).
Provide actionable insights based on user-selected clusters.
Data Export Option:
- Allow business users to download filtered or customized segments directly from the app.
Sample Streamlit Code:
import streamlit as st
import pandas as pd
import matplotlib.pyplot as plt
# Load segmented customer data
data = pd.read_csv('segmented_customers.csv')
# Sidebar for cluster selection
st.sidebar.title("Customer Segmentation")
selected_cluster = st.sidebar.selectbox("Select Cluster", sorted(data['Cluster'].unique()))
# Display cluster summary
st.title(f"Cluster {selected_cluster} Overview")
cluster_data = data[data['Cluster'] == selected_cluster]
st.write(cluster_data.describe())
# Visualize selected cluster
st.title(f"Cluster {selected_cluster} Visualization")
fig, ax = plt.subplots()
ax.scatter(cluster_data['PCA1'], cluster_data['PCA2'], alpha=0.5)
ax.set_xlabel("PCA1")
ax.set_ylabel("PCA2")
ax.set_title(f"Cluster {selected_cluster} (PCA Visualization)")
st.pyplot(fig)
The result after running the app will look like this:
11. Conclusion and Future Work
Recap of Achievements
This project successfully segmented customers into six distinct clusters using K-Means clustering, enriched by Principal Component Analysis (PCA) for dimensionality reduction. The analysis provided actionable business insights tailored to each cluster, supporting personalized marketing strategies. Key achievements include:
Identification of high-value customer clusters for premium offerings.
Recognition of low-value clusters to focus on retention and engagement strategies.
Development of data-driven opportunities for cross-selling and promotional activities.
Suggestions for Future Work
While this project provided valuable segmentation, there is potential to expand and refine the approach:
Incorporating External Data: Integrating additional variables such as income levels, credit scores, or transaction types could enhance the segmentation granularity.
Churn Prediction Models: Building predictive models to anticipate customer churn can complement segmentation and aid in proactive retention strategies.
Advanced Clustering Techniques: Employing clustering methods like DBSCAN or hierarchical clustering may uncover complex patterns and better capture the diversity of customer behaviors.
Dynamic Model Deployment: Developing a fully interactive app with advanced visualizations and segmentation updates would allow for real-time business use.
These directions aim to strengthen customer insights and further empower data-driven decision-making.
Appendices
Code: https://github.com/Minhhoang2606/Bank-customer-segmentation
Data source: https://www.kaggle.com/datasets/shivamb/bank-customer-segmentation
Subscribe to my newsletter
Read articles from Henry Ha directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Henry Ha
Henry Ha
Data Scientist, write about: Tech & Business & Lifeskills