Bank Marketing Classifier Comparison – A Machine Learning Project

Introduction
This project explores the performance of various classifiers—K Nearest Neighbor, Logistic Regression, Decision Trees, and Support Vector Machines (SVM)—to predict whether a client will subscribe to a term deposit using the Bank Marketing Dataset.
You can check out work out at: https://github.com/kushchhabra0/BankMarketingClassifer
Dataset
We use the Bank Marketing Dataset from the UCI Machine Learning Repository.
This dataset contains data related to direct marketing campaigns of a Portuguese banking institution. The goal is to predict whether a client will subscribe to a term deposit.
Usage
Load the dataset:
Open and run
Bank_Marketing_Classifier_Comparison.ipynb
.Follow step-by-step:
Preprocessing
Model building
Evaluation
Visualization
Data Understanding and Preprocessing
Feature Engineering:
Handle missing values
Encode categorical variables
Scale numerical features
Train/Test Split:
Split dataset into training and testing subsets.
Baseline Model:
Use the most frequent class to set a baseline accuracy.
Logistic Regression:
Implement logistic regression and evaluate its performance.
Model Comparisons:
Compare Logistic Regression, KNN, Decision Tree, and SVM using:
Accuracy
Training time
Visualizations
To understand relationships in the dataset, we used:
Age distribution plot
Box plot of Age by Subscription Status
Categorical plots:
Job distribution
Marital status
Education levels
Subscription rates by category
Model Comparison
We compare the performance of:
K Nearest Neighbors (KNN)
Logistic Regression
Decision Tree
Support Vector Machine (SVM)
⚠️ Note: It takes approximately 5–7 minutes to run the model comparison part.
Model | Train Time (s) | Train Accuracy | Test Accuracy |
KNN | 0.0716 | 0.9143 | 0.8881 |
Logistic Regression | 0.1950 | 0.9013 | 0.8971 |
Decision Tree | 0.2355 | 0.9954 | 0.8361 |
SVM | 34.3415 | 0.9048 | 0.8969 |
Outcomes
Baseline Accuracy: 0.8876
Logistic Regression
Train Accuracy: 0.9013
Test Accuracy: 0.8971
Best Results
Best model overall: Logistic Regression
Best tuned Decision Tree achieved:
Max Depth: 5
Min Samples Split: 2
CV Score: 0.9012
Test Accuracy: 0.8967
Improving the Model
We used hyperparameter tuning to improve model performance:
Best parameters found:
{'classifier__max_depth': 5, 'classifier__min_samples_split': 2}
Best cross-validation score:
0.9012
Best Decision Tree Test Accuracy after tuning:
0.8967
Conclusion
In this project, we evaluated several machine learning models to solve a binary classification problem using the Bank Marketing Dataset. After preprocessing, visual exploration, and modeling:
Logistic Regression delivered the best balance between accuracy and training time.
SVM performed similarly in terms of test accuracy but took significantly longer to train.
Decision Trees, while achieving high training accuracy, suffered from overfitting.
Through hyperparameter tuning, the Decision Tree model's test accuracy significantly improved and approached Logistic Regression performance.
→ This shows the importance of model tuning and choosing the right algorithm based on dataset characteristics and computational resources.
Project Members
Kavish Jain (23UCS614)
Kshitij Sharma (23UCS628)
Kushal Chhabra(23UCS630)
Subscribe to my newsletter
Read articles from KUSHAL CHHABRA directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
