Introduction

This project explores the performance of various classifiers—K Nearest Neighbor, Logistic Regression, Decision Trees, and Support Vector Machines (SVM)—to predict whether a client will subscribe to a term deposit using the Bank Marketing Dataset.

You can check out work out at: https://github.com/kushchhabra0/BankMarketingClassifer

Dataset

We use the Bank Marketing Dataset from the UCI Machine Learning Repository.

This dataset contains data related to direct marketing campaigns of a Portuguese banking institution. The goal is to predict whether a client will subscribe to a term deposit.

Usage

Load the dataset:
Open and run Bank_Marketing_Classifier_Comparison.ipynb.
Follow step-by-step:
- Preprocessing
- Model building
- Evaluation
- Visualization

Data Understanding and Preprocessing

Feature Engineering:

Handle missing values
Encode categorical variables
Scale numerical features

Train/Test Split:

Split dataset into training and testing subsets.

Baseline Model:

Use the most frequent class to set a baseline accuracy.

Logistic Regression:

Implement logistic regression and evaluate its performance.

Model Comparisons:

Compare Logistic Regression, KNN, Decision Tree, and SVM using:

Accuracy
Training time

Visualizations

To understand relationships in the dataset, we used:

Age distribution plot
Box plot of Age by Subscription Status
Categorical plots:
- Job distribution
- Marital status
- Education levels
- Subscription rates by category

Model Comparison

We compare the performance of:

K Nearest Neighbors (KNN)
Logistic Regression
Decision Tree
Support Vector Machine (SVM)

⚠️ Note: It takes approximately 5–7 minutes to run the model comparison part.

Model	Train Time (s)	Train Accuracy	Test Accuracy
KNN	0.0716	0.9143	0.8881
Logistic Regression	0.1950	0.9013	0.8971
Decision Tree	0.2355	0.9954	0.8361
SVM	34.3415	0.9048	0.8969

Outcomes

Baseline Accuracy: 0.8876

Logistic Regression

Train Accuracy: 0.9013
Test Accuracy: 0.8971

Best Results

Best model overall: Logistic Regression
Best tuned Decision Tree achieved:
- Max Depth: 5
- Min Samples Split: 2
- CV Score: 0.9012
- Test Accuracy: 0.8967

Improving the Model

We used hyperparameter tuning to improve model performance:

Best parameters found:
{'classifier__max_depth': 5, 'classifier__min_samples_split': 2}
Best cross-validation score:
0.9012
Best Decision Tree Test Accuracy after tuning:
0.8967

Conclusion

In this project, we evaluated several machine learning models to solve a binary classification problem using the Bank Marketing Dataset. After preprocessing, visual exploration, and modeling:

Logistic Regression delivered the best balance between accuracy and training time.
SVM performed similarly in terms of test accuracy but took significantly longer to train.
Decision Trees, while achieving high training accuracy, suffered from overfitting.

Through hyperparameter tuning, the Decision Tree model's test accuracy significantly improved and approached Logistic Regression performance.

→ This shows the importance of model tuning and choosing the right algorithm based on dataset characteristics and computational resources.

Project Members

Kavish Jain (23UCS614)

Kshitij Sharma (23UCS628)

Kushal Chhabra(23UCS630)

Bank Marketing Classifier Comparison – A Machine Learning Project

Table of contents