1. Introduction

In the competitive banking industry, understanding customers' behavior and needs is critical for delivering personalized services, enhancing customer satisfaction, and improving profitability. By leveraging machine learning techniques, customer segmentation can be achieved efficiently, enabling targeted marketing strategies and optimized product offerings.

This project aims to segment customers based on their demographic and transactional data. The segmentation insights will help the bank identify high-value customers, understand regional trends, and tailor services to improve customer engagement and retention.

2. Understanding the Dataset

This project uses a dataset containing over 1 million transaction records from an Indian bank. The dataset includes the following columns:

Below are the initial steps to import and explore the dataset.

# Importing necessary libraries
import pandas as pd

# Load the dataset
file_path = 'bank_transactions.csv'
df = pd.read_csv(file_path)

# Display the first few rows of the dataset
print(df.head())

  TransactionID CustomerID CustomerDOB CustGender CustLocation  \
0            T1   C5841053     10/1/94          F   JAMSHEDPUR   
1            T2   C2142763      4/4/57          M      JHAJJAR   
2            T3   C4417068    26/11/96          F       MUMBAI   
3            T4   C5342380     14/9/73          F       MUMBAI   
4            T5   C9031234     24/3/88          F  NAVI MUMBAI   

   CustAccountBalance TransactionDate  TransactionTime  \
0            17819.05          2/8/16           143207   
1             2270.69          2/8/16           141858   
2            17874.44          2/8/16           142712   
3           866503.21          2/8/16           142714   
4             6714.43          2/8/16           181156   

   TransactionAmount (INR)  
0                     25.0  
1                  27999.0  
2                    459.0  
3                   2060.0  
4                   1762.5

Next, we examine the dataset's general information to understand the structure, column types, and presence of missing values.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048567 entries, 0 to 1048566
Data columns (total 9 columns):
 #   Column                   Non-Null Count    Dtype  
---  ------                   --------------    -----  
 0   TransactionID            1048567 non-null  object 
 1   CustomerID               1048567 non-null  object 
 2   CustomerDOB              1045170 non-null  object 
 3   CustGender               1047467 non-null  object 
 4   CustLocation             1048416 non-null  object 
 5   CustAccountBalance       1046198 non-null  float64
 6   TransactionDate          1048567 non-null  object 
 7   TransactionTime          1048567 non-null  int64  
 8   TransactionAmount (INR)  1048567 non-null  float64
dtypes: float64(2), int64(1), object(6)
memory usage: 72.0+ MB
None

Key Points:

The dataset includes both categorical (e.g., CustGender, CustLocation) and numerical (e.g., CustAccountBalance, TransactionAmount (INR)) variables.
Temporal variables (CustomerDOB, TransactionDate, TransactionTime) offer opportunities for feature engineering, such as deriving Age and DaysSinceLastTransaction.
The variables provide a combination of demographic and transactional information, making them suitable for customer segmentation.

In the next steps, we will clean and preprocess the dataset, ensuring it is ready for clustering analysis. The detailed examination of distributions, missing values, and outliers will be included in the EDA section.

3. Exploratory Data Analysis (EDA)

Check the missing values and duplicates

Exploratory Data Analysis (EDA) is a critical step in understanding the structure and distribution of data, identifying patterns, and addressing data quality issues. Here, we examine missing values, analyze distributions of key features, and detect potential outliers.

# Checking for missing values in the dataset
missing_values = df.isnull().sum()
missing_percentage = (missing_values / len(df)) * 100

# Display missing values as a percentage
print(pd.DataFrame({'Missing Values': missing_values, 'Percentage (%)': missing_percentage}))

                         Missing Values  Percentage (%)
TransactionID                         0        0.000000
CustomerID                            0        0.000000
CustomerDOB                        3397        0.323966
CustGender                         1100        0.104905
CustLocation                        151        0.014401
CustAccountBalance                 2369        0.225927
TransactionDate                       0        0.000000
TransactionTime                       0        0.000000
TransactionAmount (INR)               0        0.000000

# Check for duplicate rows
print(f"Number of duplicate rows: {df.duplicated().sum()}")

Number of duplicate rows: 0

Insights:

The CustomerDOB, CustGender, CustLocation, and CustAccountBalance columns have missing values, but the percentages are small (<1%).
These rows could be safely dropped or imputed based on the context of the analysis.
There are no duplicate rows in the dataset

Data Distribution

We visualize the distribution of key numerical and categorical features to understand their spread and detect potential skewness or anomalies.

a. Distribution of Customer Gender

# Distribution of customer gender

# Plotting the distribution of gender
plt.figure(figsize=(8, 6))
sns.countplot(data=df, x='CustGender', palette='viridis')

# Add the value of each column above it, separated from the columns
for p in plt.gca().patches:
    plt.gca().text(p.get_x() + p.get_width() / 2, p.get_height() + 0.5, f'{p.get_height():.0f}', ha='center', va='bottom')

plt.title('Distribution of Customer Gender')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.show()

The bar plot shows the distribution of customer genders in the dataset, with a significant majority being male (M) at 765,530 entries, followed by females (F) at 281,936 entries. An outlier labeled as T is present, but it only accounts for a single instance, which could represent either a data entry error or a rare gender classification. This anomaly highlights the need for further investigation and possibly treating T as an error or missing value during preprocessing.

b. Distribution of Customer Account Balance

# Distribution of account balance

# Determine a reasonable x-axis range (e.g., 99th percentile)
x_max = df['CustAccountBalance'].dropna().quantile(0.99)

# Plotting the distribution of account balance with an adjusted x-axis range
# Adjusting the plot to display tick labels in plain format
plt.figure(figsize=(10, 6))
sns.histplot(df['CustAccountBalance'].dropna(), bins=50, kde=True, color='blue')
plt.title('Distribution of Customer Account Balance')
plt.xlabel('Account Balance (INR)')
plt.ylabel('Frequency')
plt.xlim(0, x_max)
plt.ticklabel_format(style='plain', axis='both')  # Use plain number formatting
plt.show()

The plot shows a highly right-skewed distribution of customer account balances. Most customers have account balances clustered near zero, with a long tail representing a small number of customers with significantly higher balances. This skewness highlights the potential influence of outliers on subsequent analyses, such as clustering, and suggests the need for techniques like log transformation or handling extreme values during preprocessing to better understand the majority of the data.

c. Distribution of Transaction Amount

# Distribution of transaction amount

# Determine a reasonable x-axis range (e.g., 99th percentile)
x_max = df['TransactionAmount (INR)'].dropna().quantile(0.99)

# Plotting the distribution of transaction amount with adjusted x-axis range
plt.figure(figsize=(10, 6))
sns.histplot(df['TransactionAmount (INR)'].dropna(), bins=50, kde=True, color='green')
plt.title('Distribution of Transaction Amount (INR)', fontsize=14)
plt.xlabel('Transaction Amount (INR)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.xlim(0, x_max)  # Adjust x-axis to exclude extreme outliers
plt.ticklabel_format(style='plain', axis='both')  # Use plain number formatting for clarity
plt.grid(axis='y', linestyle='--', alpha=0.7)  # Add gridlines for better readability
plt.show()

The plot shows the distribution of transaction amounts in INR, which is highly right-skewed. Most transactions are concentrated near lower amounts, with a sharp decline in frequency as the amount increases. The x-axis has been adjusted to display values up to the 99th percentile, excluding extreme outliers, making the distribution of the majority of transactions clearer. This pattern indicates a large number of small-value transactions, with only a few high-value transactions contributing to the long tail. This skewness highlights the need for handling outliers during preprocessing for a more balanced analysis.

d. Top Customer Locations

Understanding the geographical distribution of customers provides insights into the regions with the highest customer concentration. This information can be valuable for targeted marketing strategies and resource allocation.

The bar plot below shows the top 20 customer locations based on the frequency of occurrences in the dataset.

# Plotting the distribution of top 20 customer locations
plt.figure(figsize=(12, 8))
df['CustLocation'].value_counts().head(20).plot(kind='bar', color='blue')
plt.title('Top 20 Customer Locations', fontsize=14)
plt.xlabel('Location', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability
plt.grid(axis='y', linestyle='--', alpha=0.7)  # Add gridlines for clarity
plt.show()

The bar plot shows the top 20 customer locations by count, with Mumbai leading significantly, followed by New Delhi, Bangalore, and Gurgaon. These cities dominate the dataset, reflecting a concentration of customers in major urban centers. This highlights potential regional trends, where densely populated or economically active areas may drive customer activity, making them key targets for segmentation and marketing strategies. Smaller cities, such as Mohali and Surat, have relatively fewer customers.

Outlier Detection

Outliers in numerical features can distort clustering results. We examine the boxplots of CustAccountBalance and TransactionAmount (INR) to identify potential outliers.

# Boxplots for outlier detection
plt.figure(figsize=(15, 6))

# Subplot for account balance
plt.subplot(1, 2, 1)
sns.boxplot(x=df['CustAccountBalance'], color='lightblue')
plt.title('Boxplot of Customer Account Balance')

# Subplot for transaction amount
plt.subplot(1, 2, 2)
sns.boxplot(x=df['TransactionAmount (INR)'], color='lightgreen')
plt.title('Boxplot of Transaction Amount (INR)')

plt.tight_layout()
plt.show()

The box plots highlight the extreme skewness in both CustAccountBalance and TransactionAmount (INR). Most of the data points are concentrated near the lower end, with a significant number of outliers extending far beyond the upper whiskers. The interquartile range (IQR) is very small compared to the range of outliers, making the box (IQR) nearly invisible. This indicates the need to handle outliers, such as by capping or applying log transformations, to improve data balance and visualization.

4. Data Cleaning and Preprocessing

Data cleaning and preprocessing are crucial steps to ensure the dataset is ready for analysis. In this section, we address missing values, convert date-related columns into appropriate formats, and handle outliers to improve the data quality.

Handling Missing Values

The dataset contains missing values in a few columns, including CustomerDOB, CustGender, CustLocation, and CustAccountBalance. Since the proportion of rows with missing values is only 0.66%, we decided to drop these rows to simplify the preprocessing.

# Dropping rows with missing values
print(f"Before dropping missing values: {df.shape}")
df.dropna(inplace=True)
print(f"After dropping missing values: {df.shape}")

Before dropping missing values: (1048567, 9)
After dropping missing values: (1041614, 9)

Converting Date Columns to Datetime Format

Columns like CustomerDOB and TransactionDate are critical for temporal analysis but are currently stored as strings. We convert these columns to datetime format for consistency and to enable feature engineering.

# Converting CustomerDOB and TransactionDate to datetime format
df['CustomerDOB'] = pd.to_datetime(df['CustomerDOB'], format='%d/%m/%y', errors='coerce')
df['TransactionDate'] = pd.to_datetime(df['TransactionDate'], format='%d/%m/%y', errors='coerce')

Handling Outliers

The CustAccountBalance and TransactionAmount (INR) columns exhibit extreme skewness and a large number of high-value data points, which reflect the natural inequality in financial data distribution. These high values are not errors but are essential to understanding customer behavior. Rather than removing these "outliers," we use robust preprocessing methods to mitigate their impact while retaining their significance.

One-Hot Encoding Categorical Features

Since CustLocation is not being encoded, we will only apply one-hot encoding to the CustGender feature. This ensures that the clustering model can work with this categorical variable in numerical form while keeping the data manageable.

Why `CustLocation` is Not Encoded

High Cardinality: The CustLocation feature has too many unique values, which could lead to sparsity if encoded.
Focus on Financial Features: The clustering primarily focuses on CustAccountBalance and TransactionAmount (INR), making CustLocation less relevant to the segmentation goal.
Imbalanced Distribution: The dominance of certain locations (e.g., Mumbai) could bias the clustering process, reducing the significance of smaller locations.

Encoding `CustGender`

The CustGender feature is binary, with values M (male) and F (female). One-hot encoding converts this feature into a binary column for easier use in clustering algorithms.

# Encoding CustGender into a binary numerical column
df['CustGender'] = df['CustGender'].map({'M': 1, 'F': 0})

# Checking the updated dataset
print(df[['CustGender']].head())

   CustGender
0         0.0
1         1.0
2         0.0
3         0.0
4         0.0

By encoding only CustGender, we simplify the preprocessing pipeline and maintain alignment with the clustering objectives, avoiding unnecessary complexity from encoding CustLocation.

5. Feature Engineering

Feature engineering is a critical step in creating meaningful variables that enhance the clustering process. By deriving new features from existing ones, we can better capture customer behavior and improve the segmentation.

Creation of New Features

Age: Calculated from CustomerDOB to capture the customer's current age, which might influence financial behavior since older or younger customers may exhibit distinct financial patterns, such as transaction amounts or frequency, making it a key variable for segmentation.

from datetime import datetime

# Creating Age feature
df['Age'] = datetime.now().year - pd.to_datetime(df['CustomerDOB']).dt.year

# Identify rows with invalid or negative ages
invalid_ages = df[df['Age'] < 0]
invalid_ages_count = invalid_ages.shape[0]

# Display the count and examples of invalid ages
invalid_ages_count, invalid_ages.head()

(54111,
    TransactionID CustomerID CustomerDOB  CustGender CustLocation  \
 1             T2   C2142763  2057-04-04         1.0      JHAJJAR   
 15           T16   C8334633  2068-07-10         0.0    NEW DELHI   
 30           T31   C7648712  2067-01-21         0.0    NEW DELHI   
 51           T52   C6637047  2066-07-25         1.0         PUNE   
 79           T80   C6612422  2067-05-25         1.0        DELHI   

     CustAccountBalance TransactionDate  TransactionTime  \
 1            -0.275097      2016-08-02           141858   
 15           -0.293837      2016-08-01           125725   
 30            5.345137      2016-08-03           160642   
 51           -0.129259      2016-08-05           133141   
 79            4.576746      2016-08-01           204252   

     TransactionAmount (INR)   Age  
 1                 26.482212 -32.0  
 15                -0.199519 -43.0  
 30                 0.040865 -42.0  
 51                -0.300481 -41.0  
 79                -0.251788 -42.0  )

Since the rows with invalid CustomerDOB are few (e.g., 54,111 out of a large dataset), dropping them is a safe approach to avoid introducing noise or skewing the clustering results. This approach ensures the dataset remains clean and reliable.

# Dropping rows with invalid ages
df = df[df['Age'] >= 0]

DaysSinceLastTransaction: Derived from TransactionDate to measure recency, which is an important behavioral indicator. Customers who have not transacted recently may be less engaged, aiding in identifying inactive or at-risk segments.

# Creating DaysSinceLastTransaction feature
df['DaysSinceLastTransaction'] = (datetime.now() - pd.to_datetime(df['TransactionDate'])).dt.days

TransactionHour: Extracted from TransactionTime to analyze patterns in transaction timings (e.g., peak transaction hours).

# Creating TransactionHour feature
df['TransactionHour'] = df['TransactionTime'].astype(str).str.zfill(6).str[:2].astype(int)

Customer Lifetime Value (CLV): Calculated as the total transaction amount per customer, representing the overall value of a customer to the bank.

# Creating Customer Lifetime Value (CLV) feature
df['CLV'] = df.groupby('CustomerID')['TransactionAmount (INR)'].transform('sum')

RFM Metrics (Recency, Frequency, Monetary):

Recency: Time since the last transaction.
Frequency: Number of transactions per customer.
Monetary: Total transaction value per customer.

RFM provides a structured approach to understanding customer behavior, dividing customers based on recency, frequency, and monetary contribution to the bank.

# Calculating RFM metrics
rfm = df.groupby('CustomerID').agg({
    'TransactionDate': lambda x: (datetime.now() - pd.to_datetime(x).max()).days,
    'TransactionID': 'count',
    'TransactionAmount (INR)': 'sum'
}).reset_index()
rfm.rename(columns={
    'TransactionDate': 'Recency',
    'TransactionID': 'Frequency',
    'TransactionAmount (INR)': 'Monetary'
}, inplace=True)
print(rfm.head())

  CustomerID  Recency  Frequency  Monetary
0   C1010011     3046          2  4.029808
1   C1010012     3089          1  1.001442
2   C1010014     3096          2  0.519231
3   C1010018     3057          1 -0.411058
4   C1010024     3085          1  4.367788

6. Dimensionality Reduction with PCA

Principal Component Analysis (PCA) is a powerful technique used to reduce the dimensionality of a dataset while retaining most of the variance. This step simplifies clustering by eliminating redundant information and improving computational efficiency.

Merging RFM Metrics

The RFM metrics (Recency, Frequency, and Monetary) were created in a separate DataFrame. To ensure these features are included in the clustering process, we merge them into the main DataFrame (df).

# Merge RFM metrics into the main DataFrame
df = df.merge(rfm, on='CustomerID', how='left')

# Verify that all required columns are now present
print(df.columns)

Index(['TransactionID', 'CustomerID', 'CustomerDOB', 'CustGender',
       'CustLocation', 'CustAccountBalance', 'TransactionDate',
       'TransactionTime', 'TransactionAmount (INR)', 'Age',
       'DaysSinceLastTransaction', 'TransactionHour', 'CLV', 'Recency',
       'Frequency', 'Monetary'],
      dtype='object')

Standardizing the Selected Features

Before applying PCA, we scale the numerical features to ensure equal contribution to the analysis, as PCA is sensitive to feature magnitudes.

The features chosen for PCA are:

The CustGender feature was encoded as a binary variable (M → 1, F → 0) but **not included in PCA**, as PCA is designed for continuous numerical features. Instead, CustGender will be retained in the dataset and used later in K-Means clustering, allowing analysis of gender distribution across customer segments.

On the other hand, CustLocation was not encoded, as it has high cardinality, making one-hot encoding impractical. Since the clustering focuses on transactional behavior, location is not essential for segmentation.

By applying PCA only to numerical features while keeping CustGender available for clustering, we ensure an efficient and meaningful segmentation process.

Standardizing the Selected Features

Since PCA relies on variance, scaling is necessary to ensure that features with large values (e.g., TransactionAmount (INR)) do not dominate the principal components.

from sklearn.preprocessing import StandardScaler

# Features selected for PCA based on the original project
pca_features = ['CustAccountBalance', 'TransactionAmount (INR)', 'Age', 
                'DaysSinceLastTransaction', 'TransactionHour', 'CLV', 
                'Recency', 'Frequency', 'Monetary']

# Standardizing the selected features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df[pca_features])

Apply PCA to All Components

We first fit PCA on all components to analyze the variance explained by each.

from sklearn.decomposition import PCA
import numpy as np
import matplotlib.pyplot as plt

# Apply PCA without specifying n_components to analyze all
pca = PCA()
pca.fit(scaled_features)

# Compute cumulative explained variance
explained_variance_ratio = np.cumsum(pca.explained_variance_ratio_)

# Plot the explained variance ratio (Elbow Method for PCA)
plt.figure(figsize=(8, 6))
plt.plot(range(1, len(explained_variance_ratio) + 1), explained_variance_ratio, marker='o', linestyle='--')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Elbow Method for Optimal PCA Components')
plt.grid(True)
plt.show()

Finding the Optimal Number of Principal Components (PCs)

Using the Elbow Method, we analyzed the cumulative explained variance to determine the optimal number of principal components. From the plot:

The cumulative variance exceeds 90% by the 6th component, indicating that 6 principal components capture most of the dataset's variance.
Beyond 6 components, the additional explained variance diminishes, making the rest less impactful.

We select 6 principal components for dimensionality reduction.

# Applying PCA with 6 components
pca = PCA(n_components=6)
pca_result = pca.fit_transform(scaled_features)

# Adding PCA results to the DataFrame
df['PCA1'] = pca_result[:, 0]
df['PCA2'] = pca_result[:, 1]
df['PCA3'] = pca_result[:, 2]
df['PCA4'] = pca_result[:, 3]
df['PCA5'] = pca_result[:, 4]
df['PCA6'] = pca_result[:, 5]

7. Clustering and Customer Segmentation

This section applies K-Means clustering to segment customers into distinct groups based on their transactional and behavioral patterns.

Optimal Number of Clusters

The Elbow Method was applied to determine the optimal number of clusters. By plotting the Sum of Squared Distances (Inertia) against different values of K, we look for an "elbow point," where adding more clusters provides diminishing returns.

Code for Visualization

# Optimal number of clusters (Elbow Method)

from sklearn.cluster import KMeans

# Calculate inertia for different values of K
inertia = []
K_range = range(1, 11)
for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(df[['PCA1', 'PCA2', 'PCA3', 'PCA4', 'PCA5', 'PCA6']])
    inertia.append(kmeans.inertia_)

# Plot the Elbow Method results
plt.figure(figsize=(10, 6))
plt.plot(K_range, inertia, marker='o', linestyle='--')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia (Sum of Squared Distances)')
plt.title('Elbow Method for Optimal K')
plt.grid(True)
plt.show()

The optimal number of clusters is K = 6, as it provides a better balance between reducing the sum of squared distances (inertia) and maintaining simplicity. This ensures that the clustering solution is both meaningful and interpretable.

K-Means Clustering

Using the optimal value of K = 6, we implemented the K-Means clustering algorithm on the PCA-transformed dataset.

# Applying K-Means clustering with 6 clusters
kmeans = KMeans(n_clusters=6, random_state=42)
cluster_labels = kmeans.fit_predict(df[['PCA1', 'PCA2', 'PCA3', 'PCA4', 'PCA5', 'PCA6']])

# Adding cluster labels to the DataFrame
df['Cluster'] = cluster_labels

Visualization of Clusters

To visualize the clustering results, we used the first two principal components (PCA1 and PCA2) to create a scatter plot. Each point represents a customer, color-coded by its assigned cluster.

# Visualization of Clusters
# Scatter plot of clusters on the first two PCA components
plt.figure(figsize=(10, 6))
sns.scatterplot(x='PCA1', y='PCA2', hue='Cluster', palette='viridis', data=df, alpha=0.7)
plt.title('Customer Segmentation using K-Means Clustering (K=6)')
plt.xlabel('PCA1')
plt.ylabel('PCA2')
plt.legend(title='Cluster')
plt.show()

The plot visualizes customer segmentation into six clusters using the first two principal components (PCA1 and PCA2). Cluster 5 (yellow) stands out significantly, indicating a group with distinct characteristics, potentially driven by extreme values in the dataset. The other clusters (0 to 4) are closely packed, suggesting overlapping behaviors or smaller variances between those segments. This clustering structure highlights potential outliers and distinct customer groups for further analysis.

Analyzing and Adding Cluster Labels

The cluster labels were added back to the original dataset for further analysis and interpretation. This allows us to explore each segment's characteristics, such as average transaction value, frequency, and customer demographics.

# Merging cluster labels with the original dataset for interpretation
cluster_summary = df.groupby('Cluster').mean()
print(cluster_summary)

CustGender  CustAccountBalance  TransactionTime  \
Cluster                                                    
0          0.715262        4.803408e+04    154940.111993   
1          0.721161        5.906237e+04    157076.590304   
2          0.754320        2.754499e+05    159823.477349   
3          0.719675        5.228310e+04    159952.216517   
4          0.707069        3.358953e+05    156557.757392   
5          0.820513        1.741221e+06    158989.025641   

         TransactionAmount (INR)        Age  DaysSinceLastTransaction  \
Cluster                                                                 
0                     903.011808  36.348861               3093.101986   
1                    1047.727929  37.425236               3078.407552   
2                    1979.125956  48.617109               3081.610222   
3                     904.313536  36.678175               3065.045802   
4                   44886.150211  41.442921               3079.839828   
5                  439452.219744  42.871795               3078.948718   

         TransactionHour            CLV      Recency  Frequency  \
Cluster                                                           
0              15.196049     966.221687  3092.751830   1.073175   
1              15.410039    2560.410767  3067.930791   2.241514   
2              15.685248    2172.280829  3080.222722   1.173310   
3              15.697193     920.514399  3065.038165   1.023895   
4              15.363227   55314.255951  3076.099276   1.474055   
5              15.615385  525905.508718  3075.666667   1.666667   

              Monetary        PCA1      PCA2      PCA3      PCA4      PCA5  \
Cluster                                                                      
0           966.221687   -0.304102  1.383238 -0.276241  0.138544  0.100448   
1          2560.410767    0.237760 -0.831150 -0.399013 -1.055146 -1.005582   
2          2172.280829    0.256291  0.290182  1.443423 -0.020180  0.040478   
3           920.514399   -0.219011 -1.135079 -0.127184  0.571367  0.547641   
4         55314.255951   13.620303  0.598597 -0.638923  0.500619  0.600280   
5        525905.508718  133.878648  6.758859 -9.890442  7.291864  7.382774   

             PCA6  
Cluster            
0        0.196405  
1        0.084686  
2       -0.869133  
3        0.155481  
4        0.110740  
5        2.696470  
<ipython-input-9-b04a3e48b7c3>:2: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  cluster_summary = df.groupby('Cluster').mean()

8. Cluster Profiling and Analysis

After segmenting the customers into six clusters, detailed profiling was conducted to understand the characteristics of each segment. This analysis helps to extract meaningful insights and actionable strategies.

Profiling Customer Segments

For each cluster, we analyzed key features, including demographics and transactional behavior:

Demographics:
- Age: Average age of customers in each cluster to identify age-based trends.
- CustGender: Gender distribution within each cluster to reveal potential biases or patterns.
Transactional Behavior:
- CustAccountBalance: Average account balance per cluster, reflecting financial strength.
- TransactionAmount (INR): Total and average transaction amounts to capture spending patterns.
- CLV (Customer Lifetime Value): Indicates the overall value of customers within each cluster.

Summary Statistics

Summary statistics for each cluster were computed to compare key metrics across segments.

# Calculate mean values of key metrics by cluster
cluster_summary = df.groupby('Cluster')[['Age', 'CustAccountBalance', 
                                         'TransactionAmount (INR)', 
                                         'CLV', 'Recency', 'Frequency', 'Monetary']].mean()

# Display summary statistics
print(cluster_summary)

Age  CustAccountBalance  TransactionAmount (INR)  \
Cluster                                                           
0        36.348861        4.803408e+04               903.011808   
1        37.425236        5.906237e+04              1047.727929   
2        48.617109        2.754499e+05              1979.125956   
3        36.678175        5.228310e+04               904.313536   
4        41.442921        3.358953e+05             44886.150211   
5        42.871795        1.741221e+06            439452.219744   

                   CLV      Recency  Frequency       Monetary  
Cluster                                                        
0           966.221687  3092.751830   1.073175     966.221687  
1          2560.410767  3067.930791   2.241514    2560.410767  
2          2172.280829  3080.222722   1.173310    2172.280829  
3           920.514399  3065.038165   1.023895     920.514399  
4         55314.255951  3076.099276   1.474055   55314.255951  
5        525905.508718  3075.666667   1.666667  525905.508718

These summaries highlight the key characteristics of each customer segment, providing insights into their demographics and transaction behaviors.

Bar Plot for Key Metrics

This bar plot was created to compare key metrics CustAccountBalance across clusters.

# Bar plot for average CustAccountBalance by cluster
plt.figure(figsize=(10, 6))
sns.barplot(x=cluster_summary.index, y=cluster_summary['CustAccountBalance'], palette='viridis')
plt.title('Average Account Balance by Cluster')
plt.xlabel('Cluster')
plt.ylabel('Average Account Balance')
plt.show()

The plot shows the average account balance for each cluster. Cluster 5 stands out with an exceptionally high average account balance, indicating a group of high-value customers. Clusters 2 and 4 have moderate balances, while Clusters 0, 1, and 3 have significantly lower averages, suggesting they may consist of low-value customers or those with minimal financial activity. This segmentation highlights the financial disparity across customer groups.

Pie Chart for Cluster Proportions

A pie chart was created to show the proportion of customers in each cluster.

# Pie chart for cluster proportions
plt.figure(figsize=(8, 8))
df['Cluster'].value_counts().plot(kind='pie', autopct='%1.1f%%', colors=sns.color_palette('viridis', 6))
plt.title('Proportion of Customers in Each Cluster')
plt.ylabel('')
plt.show()

The pie chart illustrates the proportion of customers in each cluster. Cluster 0 contains the largest share of customers (33.7%), followed by Cluster 3 (30.0%) and Cluster 1 (20.7%). Cluster 2 has a smaller proportion (15.1%), while Cluster 5 (0.9%) and Cluster 4 (0.6%) are significantly underrepresented, indicating these clusters consist of unique or high-value customers with distinct characteristics.

To better understand the characteristics of each cluster, bar plots were created for key metrics such as age, account balance, transaction amount, and customer lifetime value (CLV).

Average Age by Customer Segment

This bar plot shows the average age for each cluster, revealing age-related patterns in customer segmentation.

# Bar plot for average age by cluster
plt.figure(figsize=(12, 6))
ax = sns.barplot(x='Cluster', y='Age', data=df, palette='viridis')
plt.title('Average Age by Customer Segment')
plt.xlabel('Cluster')
plt.ylabel('Average Age')

# Add the value of each bar on top
for p in ax.patches:
    ax.annotate(format(p.get_height(), '.1f'),
                (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='center',
                xytext=(0, 9),
                textcoords='offset points')

plt.show()

The plot shows the average age of customers in each cluster. Cluster 2 has the oldest average age at 48.6 years, followed by Clusters 5 and 4 with averages of 42.9 and 41.4 years, respectively. Clusters 0, 1, and 3 represent younger customer segments, with Cluster 0 having the youngest average age at 36.3 years. This variation suggests age-related segmentation patterns in customer behavior.

Average Account Balance by Customer Segment

This bar plot highlights the average account balance across clusters, reflecting the financial strength of each segment.

The plot highlights the average account balance across clusters. Cluster 5 stands out with an extraordinarily high average balance of ₹1,741,221.5, indicating a group of extremely wealthy customers. However, the long black bar for Cluster 5 suggests a very high variance, indicating significant heterogeneity or the presence of outliers within this segment. Cluster 4 follows with a significantly lower average balance of ₹335,895.3, and Clusters 0, 1, 2, and 3 have much smaller average balances, all below ₹60,000.

Considerations:

High Variance in Cluster 5: The variance in Cluster 5 implies that this group includes both extremely high-value customers and potentially some with moderately high balances. Further analysis may be needed to identify subgroups within this cluster.
Outlier Impact: Outliers may inflate the mean and variance for Cluster 5. Addressing these outliers could provide a clearer understanding of the segment's core characteristics.
Cluster Strategy: Clusters 5 and 4 represent high-value customers requiring specialized strategies, while Clusters 0, 1, 2, and 3 represent lower-value segments, suggesting a focus on growth and engagement strategies for these groups.

Average Transaction Amount by Customer Segment

This bar plot illustrates the average transaction amount per cluster, capturing spending behavior.

# Bar plot for average transaction amount by cluster
plt.figure(figsize=(12, 6))
sns.barplot(x='Cluster', y='TransactionAmount (INR)', data=df, palette='viridis')
plt.title('Average Transaction Amount by Customer Segment')
plt.xlabel('Cluster')
plt.ylabel('Average Transaction Amount (INR)')
plt.show()

The plot shows the average transaction amount by customer segment. Cluster 5 has an exceptionally high average transaction amount of ₹439,452.2, with a noticeable variance, indicating a segment of high-value transactions. Cluster 4 follows with a much smaller average of ₹44,886.2, while Clusters 0, 1, 2, and 3 exhibit minimal transaction amounts, all below ₹2,000. This suggests that Clusters 5 and 4 consist of premium customers, whereas the others represent lower transaction activity.

Average CLV by Customer Segment

The final bar plot depicts the average customer lifetime value (CLV) for each cluster, summarizing long-term customer contributions.

# Bar plot for average CLV by cluster
plt.figure(figsize=(12, 6))
sns.barplot(x='Cluster', y='CLV', data=df, palette='viridis')
plt.title('Average Customer Lifetime Value (CLV) by Customer Segment')
plt.xlabel('Cluster')
plt.ylabel('Average CLV (INR)')
plt.show()

The plot illustrates the average Customer Lifetime Value (CLV) across clusters. Cluster 5 stands out with a significantly high average CLV of over ₹600,000, coupled with noticeable variance, indicating heterogeneity within this group. Cluster 4 follows at a much lower average of ₹50,000, while Clusters 0, 1, 2, and 3 show negligible CLV, all close to zero. This emphasizes that Cluster 5 consists of extremely valuable customers, requiring tailored retention and engagement strategies.

9. Insights and Business Applications

The segmentation of customers into six distinct clusters provides actionable insights for crafting tailored marketing strategies and identifying growth opportunities. Here's how each cluster aligns with specific business strategies:

Marketing Strategies for Each Customer Segment:

Cluster 0 (Low-Value Customers):
- Characteristics: Lowest average account balance (~48,034 INR) and transaction amount (~903 INR).
- Focus: Retention
- Strategies:
  - Introduce loyalty programs to maintain their engagement.
  - Provide educational content to help them better utilize financial services.
  - Offer budget-friendly promotions to encourage more transactions.
Cluster 1 (Moderate-Value Customers):
- Characteristics: Slightly higher account balance (~59,062 INR) and transaction amount (~1,047 INR).
- Focus: Spending Growth
- Strategies:
  - Deploy targeted promotions to boost spending.
  - Highlight cross-sell opportunities for services like loans or savings plans.
  - Provide cashback or reward-based incentives for higher transaction volumes.
Cluster 2 (Emerging Customers):
- Characteristics: Moderate account balance (~275,449 INR) and transaction amount (~1,979 INR), with potential for growth.
- Focus: Nurturing Growth
- Strategies:
  - Offer personalized onboarding campaigns to introduce additional financial products.
  - Provide special offers to increase engagement and spending activity.
  - Regularly communicate to strengthen their relationship with the bank.
Cluster 3 (Another Low-Value Group):
- Characteristics: Similar to Cluster 0, with slightly higher account balance (~52,283 INR) and transaction amount (~904 INR).
- Focus: Retention
- Strategies:
  - Retain engagement through loyalty programs and budget-friendly options.
  - Simplify their financial experience with easy-to-use tools and resources.
Cluster 4 (High-Value Customers):
- Characteristics: High account balance (~335,895 INR) and transaction amount (~44,886 INR).
- Focus: Premium Services
- Strategies:
  - Offer premium financial products, such as exclusive investment plans or credit cards.
  - Assign dedicated account managers for personalized attention.
  - Provide access to VIP events and memberships.
Cluster 5 (Ultra-High-Value Customers):
- Characteristics: Exceptional account balance (~1.74 million INR) and transaction amount (~439,452 INR), with high variance.
- Focus: Personalized and Exclusive Services
- Strategies:
  - Deliver one-on-one wealth management solutions.
  - Offer highly tailored financial products and private banking services.
  - Build trust through consistent and high-value customer service.

Identified Business Opportunities:

Cross-Selling Programs: Utilize segmentation insights to recommend tailored products, such as investment options for Cluster 5 or insurance for Cluster 4.
Customer Engagement Programs: Develop initiatives like targeted communication and rewards programs to deepen relationships with Clusters 1 and 2.
Revenue Maximization: Focus on retaining high-value customers in Clusters 4 and 5 while nurturing potential growth in Cluster 2.

By leveraging these targeted strategies, the business can enhance customer satisfaction, maximize revenue, and improve overall financial performance.

10. Deployment and Practical Use

To make the insights and segmentation actionable, the project includes exporting and deploying the results for practical business use. Here's how this can be achieved:

Exporting Segmented Customer Data

The segmented customer data is exported to a CSV file for easy integration with the organization's Customer Relationship Management (CRM) system. This allows business teams to access the data and implement targeted strategies for each cluster.

# Exporting segmented customer data
df['Cluster'] = clusters  # Assuming 'clusters' contains the cluster labels
df.to_csv('segmented_customers.csv', index=False)
print("Segmented data exported successfully!")

Segmented data exported successfully!

This exported data can be used by marketing, sales, and customer service teams to execute the tailored strategies discussed earlier.

Deployment Idea: A Streamlit App

To facilitate interactive and dynamic visualization of the customer clusters, a Streamlit app can be developed. The app can provide the following functionalities:

Cluster Visualization:
- Display 2D visualizations of clusters using PCA components.
- Include interactive filters to explore specific clusters or features.
Dynamic Insights:
- Summarize key metrics for each cluster, such as average account balance, transaction amounts, and customer lifetime value (CLV).
- Provide actionable insights based on user-selected clusters.
Data Export Option:
- Allow business users to download filtered or customized segments directly from the app.

Sample Streamlit Code:

import streamlit as st
import pandas as pd
import matplotlib.pyplot as plt

# Load segmented customer data
data = pd.read_csv('segmented_customers.csv')

# Sidebar for cluster selection
st.sidebar.title("Customer Segmentation")
selected_cluster = st.sidebar.selectbox("Select Cluster", sorted(data['Cluster'].unique()))

# Display cluster summary
st.title(f"Cluster {selected_cluster} Overview")
cluster_data = data[data['Cluster'] == selected_cluster]
st.write(cluster_data.describe())

# Visualize selected cluster
st.title(f"Cluster {selected_cluster} Visualization")
fig, ax = plt.subplots()
ax.scatter(cluster_data['PCA1'], cluster_data['PCA2'], alpha=0.5)
ax.set_xlabel("PCA1")
ax.set_ylabel("PCA2")
ax.set_title(f"Cluster {selected_cluster} (PCA Visualization)")
st.pyplot(fig)

The result after running the app will look like this:

11. Conclusion and Future Work

Recap of Achievements

This project successfully segmented customers into six distinct clusters using K-Means clustering, enriched by Principal Component Analysis (PCA) for dimensionality reduction. The analysis provided actionable business insights tailored to each cluster, supporting personalized marketing strategies. Key achievements include:

Identification of high-value customer clusters for premium offerings.
Recognition of low-value clusters to focus on retention and engagement strategies.
Development of data-driven opportunities for cross-selling and promotional activities.

Suggestions for Future Work

While this project provided valuable segmentation, there is potential to expand and refine the approach:

Incorporating External Data: Integrating additional variables such as income levels, credit scores, or transaction types could enhance the segmentation granularity.
Churn Prediction Models: Building predictive models to anticipate customer churn can complement segmentation and aid in proactive retention strategies.
Advanced Clustering Techniques: Employing clustering methods like DBSCAN or hierarchical clustering may uncover complex patterns and better capture the diversity of customer behaviors.
Dynamic Model Deployment: Developing a fully interactive app with advanced visualizations and segmentation updates would allow for real-time business use.

These directions aim to strengthen customer insights and further empower data-driven decision-making.

Appendices

Code: https://github.com/Minhhoang2606/Bank-customer-segmentation

Data source: https://www.kaggle.com/datasets/shivamb/bank-customer-segmentation

Customer Segmentation for an Indian Bank Using Machine Learning

Table of contents

1. Introduction

2. Understanding the Dataset

Key Points:

3. Exploratory Data Analysis (EDA)

Check the missing values and duplicates

Insights:

Data Distribution

a. Distribution of Customer Gender

b. Distribution of Customer Account Balance

c. Distribution of Transaction Amount

d. Top Customer Locations

Outlier Detection

4. Data Cleaning and Preprocessing

Handling Missing Values

Converting Date Columns to Datetime Format

Handling Outliers

One-Hot Encoding Categorical Features

Why CustLocation is Not Encoded

Encoding CustGender

5. Feature Engineering

Creation of New Features

6. Dimensionality Reduction with PCA

Merging RFM Metrics

Standardizing the Selected Features

Standardizing the Selected Features

Apply PCA to All Components

Finding the Optimal Number of Principal Components (PCs)

7. Clustering and Customer Segmentation

Optimal Number of Clusters

K-Means Clustering

Visualization of Clusters

Analyzing and Adding Cluster Labels

8. Cluster Profiling and Analysis

Profiling Customer Segments

Summary Statistics

Bar Plot for Key Metrics

Pie Chart for Cluster Proportions

Average Age by Customer Segment

Average Account Balance by Customer Segment

Average Transaction Amount by Customer Segment

Average CLV by Customer Segment

9. Insights and Business Applications

Marketing Strategies for Each Customer Segment:

Identified Business Opportunities:

10. Deployment and Practical Use

Exporting Segmented Customer Data

Deployment Idea: A Streamlit App

11. Conclusion and Future Work

Recap of Achievements

Suggestions for Future Work

Appendices

Subscribe to my newsletter

Henry Ha

Henry Ha

Why `CustLocation` is Not Encoded

Encoding `CustGender`