Building a Bagging Model: A Complete Guide from Data to Predictions


1.Getting Over with the Jargons:
There are two technical terms in the title — Bagging and Pipeline.
Before we dive into the details, let’s quickly understand what each means.(so that the article can also be called “Beginner Friendly”). Feel free to skip this part if you are already aware of these terms.
🔹 Bagging (Bootstrap Aggregating)
Bagging is an ensemble learning technique in machine learning where multiple models are trained on different subsets of the training data (created by sampling with replacement). The predictions of these models are then combined — by majority voting for classification or averaging for regression. The main goal is to reduce variance, avoid overfitting, and improve model stability.
🔹 Pipeline
In data science, a pipeline is a systematic, step-by-step process that takes raw data and transforms it into useful predictions or insights. Think of a machine learning pipeline like a water supply system.
Just like in a water system, every step in a data science pipeline must be connected and well-maintained. If any section leaks, clogs, or adds impurities, the quality of water — or in our case, predictions — will suffer.
A pipeline in data science ensures the journey from raw data to final prediction is smooth, repeatable, and reliable.
We will now see a sample Data Science Pipeline for the Bagging Classifier .
2.Step-by-Step Bagging Pipeline with Example Real-Time Use Case: Bank Loan Default Prediction
Overview :
Category | Steps Included | Purpose |
1. Data Preparation | - Collect data - Preprocess (clean, encode, scale if needed) - Split into train-test | Ensure the data is clean, relevant, and ready for modeling. |
2. Model Building | - Choose base model - Set up bagging - Train | Select the right algorithm, configure bagging, and fit the model on training data. |
3. Model Evaluation & Optimization | - Evaluate - Tune | Measure model performance and improve it through hyperparameter tuning. |
4. Deployment | - Deploy | Integrate the trained model into a real-world application for use. |
Objective: Predict whether a loan applicant will default on their loan.
Problem Type: Classification
1. Problem Definition
Bank wants to reduce loan defaults.
Input: Customer details (income, credit score, employment history, existing debts, etc.).
Output: 1 = Will default, 0 = Will not default.
2. Data Collection
Source = Bank’s internal database: past 5 years of loan records.
Dataset columns(suggestive):
3. Data Preprocessing
a. Handle Missing Values
- Impute (mean, median, mode) or remove rows.
data['Income'].fillna(data['Income'].median(), inplace=True)
b. Encode Categorical Variables
Label Encoding or One-Hot Encoding.
That part refers to converting categorical (non-numeric) data into numbers so machine learning models can understand it.
Why is this needed?
Most ML algorithms (especially mathematical ones like logistic regression, SVM, etc.) can’t work directly with text labels like
"Male"
,"Female"
or"Red"
,"Blue"
,"Green"
.
They need numeric representations.Two common methods:
1. Label Encoding
Assigns an integer to each category.
Pros: Simple and memory-efficient.
Cons: Can accidentally imply an order (0 < 1), which might mislead models.
2. One-Hot Encoding
Creates a new column for each category with binary values (0 or 1).
Pros: No false order implication.
Cons: More memory usage (especially with many categories).💡 In Bagging with Decision Trees, encoding method choice is flexible — trees can handle label encoding without issues, but in pipelines with other models, One-Hot Encoding is often safer.
Aspect | Label Encoding | One-Hot Encoding |
Definition | Assigns a unique integer to each category. | Creates a new binary column for each category. |
Example Input | Gender: Male, Female | Color: Red, Blue, Green |
Example Output | Male → 0, Female → 1 | Red → (1,0,0), Blue → (0,1,0), Green → (0,0,1) |
Pros | Simple, memory-efficient. | Avoids implying any order between categories. |
Cons | Implies order (0 < 1) even if none exists. | Can increase dimensionality (more columns). |
When to Use | When categories have a natural order or when using tree-based models. | When categories have no order and model is sensitive to numeric magnitude (e.g., Linear Regression, Logistic Regression). |
# let us do Label Encoding here for illustration.
from sklearn.preprocessing import LabelEncoder
data['Employment_Type'] = LabelEncoder().fit_transform(data['Employment_Type'])
c. Feature Scaling (if required)
- Standardization or Min-Max scaling (often not needed for tree-based bagging models, but needed for other estimators).
d. Feature Selection
- Remove redundant columns .
4. Data Splitting
Train-test split (e.g., 80-20)
from sklearn.model_selection import train_test_split
X = data.drop('Default', axis=1)
y = data['Default']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
5. Base Model Selection
Choose a Decision Tree because it’s high variance and Bagging works well to stabilize it.
Common choices:
Decision Tree
k-Nearest Neighbors
SVM (less common with bagging)
6. Bagging Model Setup
Bagging Principle:
Create multiple bootstrap samples from training data.
Train a base model on each sample independently.
Aggregate predictions by:
Majority vote (classification)
Averaging (regression)
Example with Scikit-learn:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
base_tree = DecisionTreeClassifier(max_depth=6, random_state=42)
bag_model = BaggingClassifier(
base_estimator=base_tree,
n_estimators=100, #number of models
max_samples=0.8, #fraction of samples per bootstrap
bootstrap=True, # sampling with replacement
random_state=42
)
7. Model Training
bag_model.fit(X_train, y_train)
8. Model Evaluation
a. Predictions
y_pred = bag_model.predict(X_test)
b. Metrics
Classification: Accuracy, Precision, Recall, F1-score, ROC-AUC.
Regression: RMSE, MAE, R².
from sklearn.metrics import accuracy_score, classification_report
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
Example output:
Accuracy: 0.91 Precision: 0.88 Recall: 0.85 F1-score: 0.86
9. Hyperparameter Tuning
Hyperparameter tuning is the process of finding the best set of hyperparameters for a machine learning model so it performs optimally.
Hyperparameters
These are parameters set before training (not learned from data).
Examples:
In Decision Trees:
max_depth
,min_samples_split
In Bagging:
n_estimators
,max_samples
In kNN:
n_neighbors
In short:
Hyperparameter tuning is like adjusting the settings of a machine before you start — better settings = better performance.
- Common methods : Use GridSearchCV or RandomizedSearchCV:
Aspect | Grid Search CV | Randomized Search CV |
Definition | Tests all possible combinations of hyperparameters from the given grid. | Tests a fixed number of random combinations from the given hyperparameter space. |
Search Space Coverage | Exhaustive — covers every combination. | Partial — explores only a random subset. |
Speed | Slow for large search spaces (can be very time-consuming). | Faster — number of iterations can be controlled. |
Best For | Small search spaces where all combinations can be tested. | Large search spaces where exhaustive search is impractical. |
Risk | Risk of overfitting to CV set if space is large. | May miss the absolute best combination but finds a good one faster. |
Control Parameter | param_grid (dictionary of all possible values). | param_distributions + n_iter (number of random trials). |
Example Usage | GridSearchCV(estimator, param_grid, cv=5) | RandomizedSearchCV(estimator, param_distributions, n_iter=20, cv=5) |
from sklearn.model_selection import GridSearchCV
params = {
'n_estimators': [50, 100, 150],
'max_samples': [0.6, 0.8, 1.0]
}
grid = GridSearchCV(BaggingClassifier(base_estimator=DecisionTreeClassifier()), params, cv=5)
grid.fit(X_train, y_train)
print(grid.best_params_)
10. Deployment
- Save model and integrate into the bank’s loan approval system.
import joblib
joblib.dump(bag_model, 'loan_default_bagging.pkl')
- When a new loan application comes in, system predicts:
model = joblib.load('loan_default_bagging.pkl')
new_applicant = [[35, 50000, 700, 5, 2000, 10000]] # Example input
print(model.predict(new_applicant)) # Output: [0] → Will not default
Summary - Illustration of Bagging in This Use Case
Conclusion
A Bagging Pipeline streamlines the journey from raw data to robust predictions by combining the structure of an ML pipeline with the power of Bagging. The pipeline ensures every step — from cleaning and encoding data to model training, evaluation, and deployment — is organized and repeatable, while Bagging reduces variance and boosts accuracy.
When executed together, they turn messy data and unstable models into a reliable, production-ready ML solution.
Subscribe to my newsletter
Read articles from Krishna Dwivedi directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
