Building a Bagging Model: A Complete Guide from Data to Predictions

Krishna DwivediKrishna Dwivedi
7 min read

1.Getting Over with the Jargons:

There are two technical terms in the title — Bagging and Pipeline.
Before we dive into the details, let’s quickly understand what each means.(so that the article can also be called “Beginner Friendly”). Feel free to skip this part if you are already aware of these terms.

🔹 Bagging (Bootstrap Aggregating)
Bagging is an ensemble learning technique in machine learning where multiple models are trained on different subsets of the training data (created by sampling with replacement). The predictions of these models are then combined — by majority voting for classification or averaging for regression. The main goal is to reduce variance, avoid overfitting, and improve model stability.

🔹 Pipeline
In data science, a pipeline is a systematic, step-by-step process that takes raw data and transforms it into useful predictions or insights. Think of a machine learning pipeline like a water supply system.

Just like in a water system, every step in a data science pipeline must be connected and well-maintained. If any section leaks, clogs, or adds impurities, the quality of water — or in our case, predictions — will suffer.

A pipeline in data science ensures the journey from raw data to final prediction is smooth, repeatable, and reliable.

We will now see a sample Data Science Pipeline for the Bagging Classifier .

2.Step-by-Step Bagging Pipeline with Example Real-Time Use Case: Bank Loan Default Prediction

Overview :

CategorySteps IncludedPurpose
1. Data Preparation- Collect data - Preprocess (clean, encode, scale if needed) - Split into train-testEnsure the data is clean, relevant, and ready for modeling.
2. Model Building- Choose base model - Set up bagging - TrainSelect the right algorithm, configure bagging, and fit the model on training data.
3. Model Evaluation & Optimization- Evaluate - TuneMeasure model performance and improve it through hyperparameter tuning.
4. Deployment- DeployIntegrate the trained model into a real-world application for use.

Objective: Predict whether a loan applicant will default on their loan.
Problem Type: Classification

1. Problem Definition

  • Bank wants to reduce loan defaults.

  • Input: Customer details (income, credit score, employment history, existing debts, etc.).

  • Output: 1 = Will default, 0 = Will not default.

2. Data Collection

  • Source = Bank’s internal database: past 5 years of loan records.

  • Dataset columns(suggestive):

3. Data Preprocessing

a. Handle Missing Values

  • Impute (mean, median, mode) or remove rows.
data['Income'].fillna(data['Income'].median(), inplace=True)

b. Encode Categorical Variables

  • Label Encoding or One-Hot Encoding.

  • That part refers to converting categorical (non-numeric) data into numbers so machine learning models can understand it.

    Why is this needed?

    Most ML algorithms (especially mathematical ones like logistic regression, SVM, etc.) can’t work directly with text labels like "Male", "Female" or "Red", "Blue", "Green".
    They need numeric representations.

    Two common methods:

    1. Label Encoding

    • Assigns an integer to each category.

    • Pros: Simple and memory-efficient.
      Cons: Can accidentally imply an order (0 < 1), which might mislead models.

2. One-Hot Encoding

  • Creates a new column for each category with binary values (0 or 1).

  • Pros: No false order implication.
    Cons: More memory usage (especially with many categories).

    💡 In Bagging with Decision Trees, encoding method choice is flexible — trees can handle label encoding without issues, but in pipelines with other models, One-Hot Encoding is often safer.

AspectLabel EncodingOne-Hot Encoding
DefinitionAssigns a unique integer to each category.Creates a new binary column for each category.
Example InputGender: Male, FemaleColor: Red, Blue, Green
Example OutputMale → 0, Female → 1Red → (1,0,0), Blue → (0,1,0), Green → (0,0,1)
ProsSimple, memory-efficient.Avoids implying any order between categories.
ConsImplies order (0 < 1) even if none exists.Can increase dimensionality (more columns).
When to UseWhen categories have a natural order or when using tree-based models.When categories have no order and model is sensitive to numeric magnitude (e.g., Linear Regression, Logistic Regression).
# let us do Label Encoding here for illustration.
from sklearn.preprocessing import LabelEncoder
data['Employment_Type'] = LabelEncoder().fit_transform(data['Employment_Type'])

c. Feature Scaling (if required)

  • Standardization or Min-Max scaling (often not needed for tree-based bagging models, but needed for other estimators).

d. Feature Selection

  • Remove redundant columns .

4. Data Splitting

Train-test split (e.g., 80-20)

from sklearn.model_selection import train_test_split
X = data.drop('Default', axis=1)
y = data['Default']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

5. Base Model Selection

  • Choose a Decision Tree because it’s high variance and Bagging works well to stabilize it.

  • Common choices:

    • Decision Tree

    • k-Nearest Neighbors

    • SVM (less common with bagging)

6. Bagging Model Setup

  • Bagging Principle:

    • Create multiple bootstrap samples from training data.

    • Train a base model on each sample independently.

    • Aggregate predictions by:

      • Majority vote (classification)

      • Averaging (regression)

Example with Scikit-learn:

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

base_tree = DecisionTreeClassifier(max_depth=6, random_state=42)
bag_model = BaggingClassifier(
    base_estimator=base_tree,
    n_estimators=100, #number of models 
    max_samples=0.8, #fraction of samples per bootstrap
    bootstrap=True,  # sampling with replacement 
    random_state=42
)

7. Model Training

bag_model.fit(X_train, y_train)

8. Model Evaluation

a. Predictions

y_pred = bag_model.predict(X_test)

b. Metrics

  • Classification: Accuracy, Precision, Recall, F1-score, ROC-AUC.

  • Regression: RMSE, MAE, R².

from sklearn.metrics import accuracy_score, classification_report
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
  • Example output:

      Accuracy: 0.91
      Precision: 0.88
      Recall: 0.85
      F1-score: 0.86
    

9. Hyperparameter Tuning

Hyperparameter tuning is the process of finding the best set of hyperparameters for a machine learning model so it performs optimally.

Hyperparameters

  • These are parameters set before training (not learned from data).

  • Examples:

    • In Decision Trees: max_depth, min_samples_split

    • In Bagging: n_estimators, max_samples

    • In kNN: n_neighbors

In short:
Hyperparameter tuning is like adjusting the settings of a machine before you start — better settings = better performance.

  • Common methods : Use GridSearchCV or RandomizedSearchCV:
AspectGrid Search CVRandomized Search CV
DefinitionTests all possible combinations of hyperparameters from the given grid.Tests a fixed number of random combinations from the given hyperparameter space.
Search Space CoverageExhaustive — covers every combination.Partial — explores only a random subset.
SpeedSlow for large search spaces (can be very time-consuming).Faster — number of iterations can be controlled.
Best ForSmall search spaces where all combinations can be tested.Large search spaces where exhaustive search is impractical.
RiskRisk of overfitting to CV set if space is large.May miss the absolute best combination but finds a good one faster.
Control Parameterparam_grid (dictionary of all possible values).param_distributions + n_iter (number of random trials).
Example UsageGridSearchCV(estimator, param_grid, cv=5)RandomizedSearchCV(estimator, param_distributions, n_iter=20, cv=5)
from sklearn.model_selection import GridSearchCV

params = {
    'n_estimators': [50, 100, 150],
    'max_samples': [0.6, 0.8, 1.0]
}
grid = GridSearchCV(BaggingClassifier(base_estimator=DecisionTreeClassifier()), params, cv=5)
grid.fit(X_train, y_train)
print(grid.best_params_)

10. Deployment

  • Save model and integrate into the bank’s loan approval system.
import joblib
joblib.dump(bag_model, 'loan_default_bagging.pkl')
  • When a new loan application comes in, system predicts:
model = joblib.load('loan_default_bagging.pkl')
new_applicant = [[35, 50000, 700, 5, 2000, 10000]]  # Example input
print(model.predict(new_applicant))  # Output: [0] → Will not default

Summary - Illustration of Bagging in This Use Case

Conclusion

A Bagging Pipeline streamlines the journey from raw data to robust predictions by combining the structure of an ML pipeline with the power of Bagging. The pipeline ensures every step — from cleaning and encoding data to model training, evaluation, and deployment — is organized and repeatable, while Bagging reduces variance and boosts accuracy.

When executed together, they turn messy data and unstable models into a reliable, production-ready ML solution.

0
Subscribe to my newsletter

Read articles from Krishna Dwivedi directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Krishna Dwivedi
Krishna Dwivedi