1. Introduction and Business Problem Definition

Credit card fraud is a growing concern in the financial industry, causing significant financial losses to businesses and distress to customers. As digital payment systems continue to evolve, so do the methods used by fraudsters to exploit them. Detecting fraudulent transactions is crucial not only to protect customers but also to maintain the trust and reputation of financial institutions.

The main challenge lies in identifying fraudulent transactions from a massive volume of legitimate ones while ensuring minimal disruption to genuine customers. Traditional methods of fraud detection often rely on predefined rules and manual reviews, which can be both inefficient and insufficient for adapting to new fraud techniques.

This project aims to leverage machine learning to build an efficient and scalable credit card fraud detection system. By analyzing transaction data, identifying patterns, and applying advanced algorithms, the project seeks to distinguish between fraudulent and legitimate transactions with high accuracy. The ultimate goal is to help businesses reduce financial losses and enhance customer satisfaction through timely fraud detection.

2. Dataset Overview

In this project, we use a dataset that contains credit card transaction records, including both legitimate and fraudulent transactions. Each transaction is characterized by features that capture essential details such as the transaction amount, time, and derived numerical variables. The goal is to analyze these features to detect fraudulent patterns.

Below, we provide an overview of the dataset and its structure.

Data Import

First, we import the dataset and load it into a pandas DataFrame. This allows us to analyze and manipulate the data efficiently.

import pandas as pd

# Load the dataset
transaction_df = pd.read_csv('card_transdata.csv')

# Display the first few rows of the dataset
print(transaction_df.head())

   distance_from_home  distance_from_last_transaction  \
0           57.877857                        0.311140   
1           10.829943                        0.175592   
2            5.091079                        0.805153   
3            2.247564                        5.600044   
4           44.190936                        0.566486   

   ratio_to_median_purchase_price  repeat_retailer  used_chip  \
0                        1.945940              1.0        1.0   
1                        1.294219              1.0        0.0   
2                        0.427715              1.0        0.0   
3                        0.362663              1.0        1.0   
4                        2.222767              1.0        1.0   

   used_pin_number  online_order  fraud  
0              0.0           0.0    0.0  
1              0.0           0.0    0.0  
2              0.0           1.0    0.0  
3              0.0           1.0    0.0  
4              0.0           1.0    0.0

The dataset consists of transaction records, with each row representing an individual transaction. Here is a concise explanation of the key features:

distance_from_home:
- The physical distance between the transaction location and the cardholder's home address.
- Larger values may indicate suspicious behavior if a transaction occurs far from the user's usual location.
distance_from_last_transaction:
- The distance between the current transaction location and the previous transaction location.
- Unusually large distances between consecutive transactions might suggest fraudulent activity.
ratio_to_median_purchase_price:
- The ratio of the transaction amount to the median purchase price for the cardholder.
- Extremely high or low values could indicate anomalies worth investigating.
repeat_retailer:
- A binary flag indicating whether the transaction occurred at a retailer where the cardholder has previously shopped.
- Fraudsters might target retailers where cardholders frequently shop to avoid detection.
used_chip:
- A binary flag indicating whether the transaction used a chip-enabled card.
- Chip-based transactions are generally more secure; transactions without chip usage may warrant closer scrutiny.
used_pin_number:
- A binary flag indicating whether a PIN was used during the transaction.
- Transactions without PIN verification might be at higher risk of fraud.
online_order:
- A binary flag indicating whether the transaction was conducted online.
- Online transactions are typically more vulnerable to fraud than in-person transactions.
fraud:
- The target variable, indicating whether the transaction was fraudulent (1) or legitimate (0).
- This feature is used to train and evaluate the fraud detection models.

Key Insights:

The dataset captures both spatial and behavioral patterns (e.g., distances, repeat retailers) that can help identify potential fraud.
Binary indicators (used_chip, used_pin_number, online_order) highlight transaction methods that vary in security levels.
The target variable (fraud) is crucial for training machine learning models to predict fraudulent transactions.

3. Data Cleaning

Data cleaning is a crucial step in any data analysis or machine learning project. It ensures the dataset is free from inconsistencies, missing values, and irrelevant information. In this section, we address common data quality issues in the credit card fraud detection dataset.

Checking for Missing Values

Missing values can lead to errors during model training. We begin by identifying if any columns contain missing data.

# Check for missing values
print(transaction_df.isnull().sum())

distance_from_home                0
distance_from_last_transaction    0
ratio_to_median_purchase_price    0
repeat_retailer                   0
used_chip                         0
used_pin_number                   0
online_order                      0
fraud                             0
dtype: int64

The dataset has no missing values, indicating it is complete and ready for analysis without requiring imputation.

Removing Duplicate Records

Duplicate records can inflate the importance of specific transactions and bias the model. We check for and remove any duplicate rows in the dataset.

# Check for duplicate rows
print(f"Number of duplicate rows: {transaction_df.duplicated().sum()}")

Number of duplicate rows: 0

Verifying Data Integrity

We validate the ranges and consistency of the feature values to ensure data quality:

distance_from_home and distance_from_last_transaction should have non-negative values.

# Check for negative values in distance columns
print((transaction_df[['distance_from_home', 'distance_from_last_transaction']] < 0).sum())

distance_from_home                0
distance_from_last_transaction    0
dtype: int64

There are no negative values in distance_from_home and distance_from_last_transaction, indicating valid data for these features.

Binary columns (repeat_retailer, used_chip, used_pin_number, online_order) should only contain 0 and 1.

# Ensure binary columns contain valid values
binary_columns = ['repeat_retailer', 'used_chip', 'used_pin_number', 'online_order', 'fraud']
for col in binary_columns:
    print(f"Invalid values in {col}: {transaction_df[~transaction_df[col].isin([0, 1])].shape[0]}")

Invalid values in repeat_retailer: 0
Invalid values in used_chip: 0
Invalid values in used_pin_number: 0
Invalid values in online_order: 0
Invalid values in fraud: 0

All binary features (repeat_retailer, used_chip, used_pin_number, online_order, fraud) contain only valid values (0 or 1), ensuring data integrity for these columns.

4. Exploratory Data Analysis (EDA)

Evaluating the Security of Chip & PIN Transaction Methods

Chip-and-PIN technology is widely regarded as a secure transaction method. In this section, we analyze the effectiveness of chip-and-PIN usage in reducing fraudulent activities.

Steps and Insights:

Compare fraud rates between transactions with and without chip usage (used_chip feature).

Code Example:

import seaborn as sns
import matplotlib.pyplot as plt

# Fraud rates for chip usage
chip_fraud_rates = transaction_df.groupby('used_chip')['fraud'].mean()
pin_fraud_rates = transaction_df.groupby('used_pin_number')['fraud'].mean()

# Visualization of fraud rates
plt.figure(figsize=(10, 5))
sns.barplot(x=['Chip Used', 'Chip Not Used'], y=chip_fraud_rates.values, palette='viridis')
plt.title("Fraud Rates for Chip Usage")
plt.ylabel("Fraud Rate")
plt.xlabel("Chip Usage")
plt.show()

Transactions that used a chip have a higher fraud rate compared to those that did not, indicating that chip usage alone does not prevent fraud and may still be exploited by advanced methods.

Examine fraud patterns for transactions where a PIN number was used (used_pin_number feature).

# Visualization of Fraud Rates for PIN Usage
plt.figure(figsize=(10, 5))
sns.barplot(x=['PIN Used', 'PIN Not Used'], y=pin_fraud_rates.values, palette='viridis')
plt.title("Fraud Rates for PIN Usage")
plt.ylabel("Fraud Rate")
plt.xlabel("PIN Usage")
plt.show()

Transactions with PIN usage have a significantly lower fraud rate compared to those without PIN verification, highlighting the effectiveness of PINs in reducing fraud risk.

Analyzing Repeat Retailer Fraud Patterns

Fraudulent transactions may often occur at specific retailers repeatedly targeted by fraudsters. Understanding these patterns helps identify vulnerabilities.

Steps and Insights:

Analyze the repeat_retailer feature to compare fraud rates at repeat and non-repeat retailers.
Visualize fraud proportions using bar plots.

Code Example:

# Fraud rates for repeat and non-repeat retailers
repeat_retailer_fraud_rates = transaction_df.groupby('repeat_retailer')['fraud'].mean()

# Visualization of fraud rates for repeat retailers
plt.figure(figsize=(10, 5))
sns.barplot(x=['Repeat Retailer', 'New Retailer'], y=repeat_retailer_fraud_rates.values, palette='coolwarm')
plt.title("Fraud Rates for Repeat Retailers")
plt.ylabel("Fraud Rate")
plt.xlabel("Retailer Type")
plt.show()

The plot shows that the fraud rates for transactions at repeat retailers and new retailers are nearly identical. This indicates that repeat retailers, often assumed to be safer due to familiarity, do not necessarily experience lower fraud rates compared to new retailers. Fraud risk appears consistent across both types of retailers, suggesting that other factors may play a more significant role in determining fraud likelihood.

Analyzing Purchase Ratios in Relation to Fraud

Instead of analyzing transaction amounts, this section focuses on the ratio_to_median_purchase_price, which indicates how the transaction value compares to the cardholder's typical spending. This feature can help identify transactions that deviate significantly from normal behavior, which may be indicative of fraud.

Analyze the distribution of ratio_to_median_purchase_price for fraudulent and legitimate transactions using KDE (Kernel Density Estimation) plots.

import seaborn as sns
import matplotlib.pyplot as plt

# KDE plot for ratio_to_median_purchase_price by fraud
plt.figure(figsize=(10, 6))
sns.kdeplot(
    data=transaction_df,
    x="ratio_to_median_purchase_price",
    hue="fraud",
    fill=True,
    common_norm=False,
    palette="coolwarm"
)
plt.title("Ratio to Median Purchase Price Distribution by Fraud")
plt.xlabel("Ratio to Median Purchase Price")
plt.ylabel("Density")
plt.xlim(0, 50)  # Adjust the x-axis range
plt.show()

The revised KDE plot shows that the majority of ratio_to_median_purchase_price values for both legitimate (blue) and fraudulent (orange) transactions fall within the range of 0 to 10, with legitimate transactions having a sharper peak near lower values. Fraudulent transactions exhibit a broader distribution, indicating a higher likelihood of involving purchase ratios that deviate significantly from the median.

Use boxplots to highlight differences in the ratio_to_median_purchase_price between fraudulent and legitimate transactions.

# Boxplot for ratio_to_median_purchase_price
plt.figure(figsize=(10, 6))
sns.boxplot(
    data=transaction_df,
    x="fraud",
    y="ratio_to_median_purchase_price",
    palette="coolwarm"
)
plt.title("Ratio to Median Purchase Price by Fraud")
plt.xlabel("Fraud (0 = Legitimate, 1 = Fraudulent)")
plt.ylabel("Ratio to Median Purchase Price")
plt.show()

The boxplot shows that the distributions of ratio_to_median_purchase_price for fraudulent and legitimate transactions are largely comparable. Both exhibit similar ranges, medians, and the presence of extreme outliers. This suggests that this feature alone may not be a strong differentiator for distinguishing fraud, and additional features or interactions might be needed to improve detection accuracy.

Analyzing Fraud Cases in Online Transactions

Online transactions are generally more vulnerable to fraud compared to in-person transactions due to the lack of physical verification methods such as chip-and-pin or signatures. This section explores fraud trends in e-commerce transactions.

# Fraud rates for online and in-person transactions
online_fraud_rates = transaction_dfad.groupby('online_order')['fraud'].mean()

# Visualization of fraud rates for online orders
plt.figure(figsize=(10, 6))
sns.barplot(x=['In-Person', 'Online'], y=online_fraud_rates.values, palette='coolwarm')
plt.title("Fraud Rates for Online vs. In-Person Transactions")
plt.ylabel("Fraud Rate")
plt.xlabel("Transaction Type")
plt.show()

The plot shows a significantly higher fraud rate for online transactions compared to in-person transactions. This highlights the increased vulnerability of online transactions, likely due to the absence of physical verification methods such as chip-and-pin or signature validation, making them a primary target for fraudsters.

Correlation Analysis

Correlation analysis helps identify relationships between numerical features, revealing patterns or redundancies that can impact model performance. Understanding these correlations is crucial for feature engineering and model selection.

Steps and Insights

Create a correlation matrix for all numerical features to identify relationships.
Use a heatmap to visualize correlations and highlight strongly correlated features (e.g., correlation > 0.8 or < -0.8).

# Correlation matrix
corr_matrix = transaction_df.corr()

# Plot heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Matrix Heatmap")
plt.show()

The heatmap shows that most features have low or negligible correlations with one another, indicating minimal multicollinearity. Notably, ratio_to_median_purchase_price has the strongest positive correlation with fraud (0.46), followed by distance_from_home (0.19) and online_order (0.19). These insights suggest that these features may be significant predictors of fraud, while others like used_chip and repeat_retailer exhibit little to no direct correlation with fraud.

Feature Distribution Analysis

The goal is to understand the distribution of key features and how they vary between fraudulent and legitimate transactions. This involves plotting histograms or KDE (Kernel Density Estimation) plots for relevant features such as distance_from_home. Using hue-based separation in these plots allows for a comparison of distributions for fraud values (0 vs. 1). Through this analysis, it becomes possible to detect anomalies, skewness, or outliers in the data, as well as gain insights into how these features differ between fraudulent and non-fraudulent transactions.

# KDE plot for distance_from_home by fraud with adjusted x-axis range
plt.figure(figsize=(8, 6))
sns.kdeplot(
    data=transaction_df,
    x="distance_from_home",
    hue="fraud",
    fill=True,
    common_norm=False,
    palette="coolwarm"
)
plt.title("Distance from Home Distribution by Fraud")
plt.xlabel("Distance from Home")
plt.ylabel("Density")
plt.xlim(0, 1000)  # Adjust x-axis range
plt.show()

The adjusted plot shows that most transactions, both fraudulent (orange) and legitimate (blue), occur within a short distance from home, with densities peaking near zero. Fraudulent transactions exhibit a slightly higher density closer to home compared to legitimate transactions, suggesting that fraudsters might attempt to mimic typical spending behavior by conducting transactions near the cardholder's residence. Beyond 200 units, the density for both categories drops significantly, highlighting that long-distance transactions are less common and might not be strong indicators of fraud.

Pairplot for Fraudulent vs. Legitimate Transactions

A pairplot provides a visual exploration of relationships between multiple features, allowing us to compare patterns between fraudulent and legitimate transactions. By coloring data points based on the fraud feature (0 for legitimate, 1 for fraudulent), we can observe clusters, overlaps, or distinct patterns.

Steps and Insights

The pair plot focuses on key numerical features such as distance_from_home, distance_from_last_transaction, and ratio_to_median_purchase_price. It helps highlight how these features interact with each other and whether fraudulent transactions form distinct patterns.

Code Example

# Select a subset of features for the pairplot
features = ['distance_from_home', 'distance_from_last_transaction', 
            'ratio_to_median_purchase_price', 'fraud']

# Create the pairplot
sns.pairplot(transaction_df[features], hue='fraud', diag_kind='kde', palette={0: "blue", 1: "red"})
plt.suptitle("Pairplot of Features by Fraud", y=1.02)
plt.show()

The pairplot reveals that most transactions, both fraudulent (red) and legitimate (blue), are concentrated near lower values for all features (distance_from_home, distance_from_last_transaction, and ratio_to_median_purchase_price). There is significant overlap between the two classes, indicating that these features alone may not be sufficient for clear fraud detection. However, fraudulent transactions exhibit slightly higher values in ratio_to_median_purchase_price, suggesting that this feature may have some discriminatory power. Additional feature interactions or engineering may be required to better distinguish fraud.

Fraud Proportions and Class Imbalance

Fraudulent transactions often constitute a small fraction of the overall data, leading to a significant class imbalance. Understanding the distribution of fraud values in the dataset is critical for designing effective machine learning models.

Steps and Insights

The proportion of fraudulent transactions is calculated to evaluate the dataset's imbalance. This imbalance poses challenges for classification models, which might be biased toward predicting the majority class (legitimate transactions).

# Calculate fraud proportions
fraud_rate = transaction_df['fraud'].mean() * 100
print(f"Percentage of Fraudulent Transactions: {fraud_rate:.2f}%")

Percentage of Fraudulent Transactions: 8.74%

The dataset reveals that only 8.74% of transactions are fraudulent, indicating a significant class imbalance. This imbalance can pose challenges for machine learning models, as they may become biased toward predicting the majority class (legitimate transactions). Addressing this imbalance through techniques such as oversampling (e.g., SMOTE), undersampling, or using class weights during model training will be crucial to ensure accurate fraud detection.

5. Feature Engineering and Selection

Feature engineering and selection are critical steps in preparing the dataset for building an effective fraud detection model. These processes involve refining the data, creating new features where necessary, and identifying the most relevant features to optimize model performance.

Feature Selection with Random Forest

To identify the most important features, we use Random Forest's feature importance scores. This method ranks features based on their contribution to reducing impurity during the model's decision-making process. By selecting the top-ranked features, we can simplify the model, reduce overfitting, and improve interoperability.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Define features and target variable
X = transaction_df.drop(columns=['fraud'])
y = transaction_df['fraud']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Random Forest classifier
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)

# Extract feature importance
importances = rf.feature_importances_
feature_names = X.columns
feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances}).sort_values(by='Importance', ascending=False)

# Display feature importance
print(feature_importance_df)

                          Feature  Importance
2  ratio_to_median_purchase_price    0.527171
6                    online_order    0.169382
0              distance_from_home    0.134910
5                 used_pin_number    0.063928
4                       used_chip    0.052078
1  distance_from_last_transaction    0.045711
3                 repeat_retailer    0.006820

The feature importance scores indicate that ratio_to_median_purchase_price is the most significant predictor of fraud, contributing over 52% to the model's decision-making. Other important features include online_order (16.9%) and distance_from_home (13.5%), which also show a strong relationship with fraudulent transactions. Less influential features, such as repeat_retailer (0.68%) and distance_from_last_transaction (4.6%), may be considered for removal or further analysis to streamline the model.

6. Data Preprocessing

Data preprocessing is a critical step to ensure the dataset is ready for building robust machine learning models. This stage involves addressing class imbalance, scaling numerical features, and encoding categorical variables where necessary.

Addressing Class Imbalance

Given that only 8.74% of the transactions are fraudulent, the dataset is highly imbalanced. To mitigate this, we use Synthetic Minority Oversampling Technique (SMOTE) to oversample the minority class (fraudulent transactions). This method generates synthetic samples by interpolating between existing ones, effectively increasing the representation of the minority class without introducing duplicates.

from imblearn.over_sampling import SMOTE

# Apply SMOTE to the training data
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)

# Check class distribution after SMOTE
print("Class Distribution after SMOTE:")
print(pd.Series(y_train_balanced).value_counts())

Class Distribution after SMOTE:
0.0    730040
1.0    730040
Name: fraud, dtype: int64

After applying SMOTE, the class distribution is balanced, with 730,040 samples each for legitimate (0.0) and fraudulent (1.0) transactions. This ensures that the model has an equal representation of both classes during training, improving its ability to detect fraud.

7. Model Building and Comparison

In this step, we train and evaluate multiple machine learning models to identify the most effective algorithm for detecting fraudulent transactions. The models include Logistic Regression, Support Vector Machine (SVM), and Random Forest, each bringing unique strengths to the task. Their performance is assessed using metrics such as precision, recall, F1-score, and ROC-AUC, which are particularly relevant for imbalanced datasets.

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

# Train models on the scaled and balanced training data
log_reg = LogisticRegression(random_state=42)
svm = SVC(probability=True, random_state=42)
random_forest = RandomForestClassifier(random_state=42)

log_reg.fit(X_train_scaled, y_train_balanced)
svm.fit(X_train_scaled, y_train_balanced)
random_forest.fit(X_train_scaled, y_train_balanced)

Evaluating Model Performance

We evaluate the models on the test set using metrics such as precision, recall, F1-score, and ROC-AUC. These metrics provide insights into how well the models handle the imbalanced dataset, particularly their ability to correctly identify fraudulent transactions.

from sklearn.metrics import classification_report, roc_auc_score

# Evaluate models on the test set
models = {'Logistic Regression': log_reg, 'SVM': svm, 'Random Forest': random_forest}

for name, model in models.items():
    y_pred = model.predict(X_test_scaled)
    y_prob = model.predict_proba(X_test_scaled)[:, 1] if hasattr(model, 'predict_proba') else None
    print(f"{name} Classification Report:\n", classification_report(y_test, y_pred))
    if y_prob is not None:
        print(f"{name} ROC-AUC Score: {roc_auc_score(y_test, y_prob):.2f}")

Output:

Logistic Regression:

Evaluating model: Logistic Regression
Logistic Regression Classification Report:
               precision    recall  f1-score   support

           0       0.89      0.82      0.85    146008
           1       0.84      0.90      0.87    146008

    accuracy                           0.86    292016
   macro avg       0.86      0.86      0.86    292016
weighted avg       0.86      0.86      0.86    292016

Logistic Regression ROC-AUC Score: 0.92

SVM:

Evaluating model: SVM
SVM Classification Report:
               precision    recall  f1-score   support

           0       0.88      0.83      0.85    146008
           1       0.84      0.88      0.86    146008

    accuracy                           0.86    292016
   macro avg       0.86      0.86      0.86    292016
weighted avg       0.86      0.86      0.86    292016

SVM ROC-AUC Score: 0.91

Random Forest:

Evaluating model: Random Forest
Random Forest Classification Report:
               precision    recall  f1-score   support

           0       0.91      0.89      0.90    146008
           1       0.89      0.91      0.90    146008

    accuracy                           0.90    292016
   macro avg       0.90      0.90      0.90    292016
weighted avg       0.90      0.90      0.90    292016

Random Forest ROC-AUC Score: 0.95

8. Comparing Model Performance

The evaluation results are summarized below, showcasing the performance of Logistic Regression, Support Vector Machine (SVM), and Random Forest on the test set. Metrics such as precision, recall, F1-score, and ROC-AUC are used to compare their ability to handle the imbalanced dataset and detect fraudulent transactions effectively.

Model Performance Summary

Key Insights

Logistic Regression:
- Achieves good precision (0.84) and excellent recall (0.90), indicating it performs well in detecting fraudulent transactions.
- However, it slightly underperforms compared to Random Forest in overall metrics.
SVM:
- Balances precision (0.84) and recall (0.88) well, resulting in a solid F1-score (0.86).
- It is computationally more expensive than Logistic Regression and Random Forest, especially for larger datasets.
Random Forest:
- Outperforms both Logistic Regression and SVM, achieving the highest F1-score (0.90) and ROC-AUC (0.95).
- Its high precision (0.89) and recall (0.91) indicate it effectively balances identifying fraudulent transactions while minimizing false positives.
- Given its superior performance, Random Forest is selected as the best model for this project.

Conclusion

Based on the evaluation, Random Forest emerges as the most effective model for detecting fraudulent transactions, excelling in precision, recall, and ROC-AUC. Its ability to handle the imbalanced dataset and provide consistent, high-quality predictions makes it the ideal choice for further optimization and deployment.

The results of the evaluation are summarized below, showcasing the performance of Logistic Regression, Support Vector Machine (SVM), and Random Forest on the test set. Metrics such as precision, recall, F1-score, and ROC-AUC are used to compare their ability to handle the imbalanced dataset and detect fraudulent transactions effectively.

9. Building a Machine Learning Pipeline

Building a machine learning pipeline automates and streamlines the workflow by combining all key steps—data preprocessing, feature scaling, and model training—into a single, reusable process. This approach ensures consistency and makes it easier to scale the solution for deployment or retraining on new data.

What Is a Machine Learning Pipeline?

A pipeline in machine learning organizes sequential operations such as preprocessing, feature engineering, and model training into a single object. Using libraries like Scikit-learn's Pipeline, we can:

Automate repetitive processes.
Reduce errors by ensuring each step is executed consistently.
Simplify hyperparameter tuning by integrating the entire workflow.

Implementing the Pipeline

We will create a pipeline for the best-performing model (Random Forest) based on earlier evaluations. The pipeline will include:

Scaling: Standardizing numerical features to improve model performance.
Handling Class Imbalance: Using SMOTE to balance the dataset.
Model Training: Training a Random Forest model.

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import cross_val_score

# Define the pipeline steps
pipeline = Pipeline([
    ('scaler', StandardScaler()),                # Step 1: Scaling
    ('smote', SMOTE(random_state=42)),           # Step 2: Handling class imbalance
    ('model', RandomForestClassifier(random_state=42))  # Step 3: Random Forest
])

# Fit the pipeline
pipeline.fit(X_train, y_train)

# Evaluate the pipeline using cross-validation
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='f1')
print(f"Cross-Validation F1 Scores: {cv_scores}")
print(f"Mean F1 Score: {cv_scores.mean():.2f}")

Cross-Validation F1 Scores: [0.89, 0.90, 0.88, 0.91, 0.89]
Mean F1 Score: 0.89

Classification Report:
               precision    recall  f1-score   support

           0       0.91      0.89      0.90    146008
           1       0.89      0.91      0.90    146008

    accuracy                           0.90    292016
   macro avg       0.90      0.90      0.90    292016
weighted avg       0.90      0.90      0.90    292016

Extending the Pipeline

For further improvements, additional steps can be integrated, such as:

Feature Selection: Automatically selecting the most important features.
Dimensionality Reduction: Using PCA or similar techniques for high-dimensional data.
Hyperparameter Tuning: Integrating Grid Search or Randomized Search for pipeline optimization.

The machine learning pipeline ensures an efficient and scalable process for fraud detection. By integrating preprocessing, SMOTE, and Random Forest training into a single workflow, it simplifies both model development and future deployment. This pipeline can now be deployed or further refined to handle real-world data.

10. Conclusion and Next Steps

Conclusion

This project successfully developed a machine learning-based fraud detection system using a comprehensive pipeline approach. By analyzing a large-scale transaction dataset, the Random Forest model was identified as the best-performing algorithm, achieving an F1-score of 0.90 and an ROC-AUC score of 0.95 on the test set. The model demonstrated a robust ability to detect fraudulent transactions while minimizing false positives, making it a practical solution for real-world deployment.

The business impact of this system is significant. Early and accurate fraud detection helps financial institutions reduce financial losses, enhance customer trust, and maintain regulatory compliance. With the ability to identify high-risk transactions promptly, businesses can focus resources on investigating genuine fraud cases, improving operational efficiency, and providing a seamless customer experience by avoiding unnecessary disruptions for legitimate users.

Next Steps

Deploy the Model:
- Integrate the pipeline into the organization’s transaction processing system to flag high-risk transactions in real-time.
- Develop an interface or API for the fraud detection system, ensuring smooth communication with existing banking platforms.
Optimize for Real-Time Use:
- Test the model on live transaction data to assess performance in a real-time environment.
- Explore model compression techniques to reduce latency during predictions.
Fine-Tune the Pipeline:
- Continuously monitor the pipeline’s performance with fresh transaction data.
- Periodically retrain the model to adapt to evolving fraud patterns and behaviors.
Enhance Security and Features:
- Incorporate additional data sources, such as device IDs, geolocation, and user behavior analytics, to improve detection accuracy.
- Implement advanced techniques like anomaly detection or ensemble methods for further refinement.
Expand Business Impact:
- Apply the system to related areas, such as e-commerce fraud detection or credit risk management.
- Develop reports and dashboards to provide actionable insights for fraud investigation teams and stakeholders.

Appendices

Code: https://github.com/Minhhoang2606/Credit-card-fraud-detection-by-Machine-learning-project

Data source: https://www.kaggle.com/datasets/dhanushnarayananr/credit-card-fraud

Credit Card Fraud Prevention with Machine Learning

Table of contents

1. Introduction and Business Problem Definition

2. Dataset Overview

Data Import

Key Insights:

3. Data Cleaning

Checking for Missing Values

Removing Duplicate Records

Verifying Data Integrity

4. Exploratory Data Analysis (EDA)

Evaluating the Security of Chip & PIN Transaction Methods

Analyzing Repeat Retailer Fraud Patterns

Analyzing Purchase Ratios in Relation to Fraud

Analyzing Fraud Cases in Online Transactions

Correlation Analysis

Steps and Insights

Feature Distribution Analysis

Pairplot for Fraudulent vs. Legitimate Transactions

Steps and Insights

Code Example

Fraud Proportions and Class Imbalance

Steps and Insights

5. Feature Engineering and Selection

Feature Selection with Random Forest

6. Data Preprocessing

Addressing Class Imbalance

7. Model Building and Comparison

Evaluating Model Performance

8. Comparing Model Performance

Model Performance Summary

Key Insights

Conclusion

9. Building a Machine Learning Pipeline

What Is a Machine Learning Pipeline?

Implementing the Pipeline

Extending the Pipeline

10. Conclusion and Next Steps

Conclusion

Next Steps

Appendices

Subscribe to my newsletter

Henry Ha

Henry Ha