Bagging Ensemble Learning
Bagging, which stands for Bootstrap Aggregating, is a popular ensemble learning method in machine learning.
In bagging, the ensemble is created by training multiple models on different subsets of the training data. These subsets are obtained through a process called bootstrapping, where random samples are drawn with replacements from the original training set. Each model in the ensemble is trained independently on its respective bootstrap sample. During prediction, each model makes its prediction. The final prediction is determined by aggregating the predictions of all models. Aggregation methods vary, such as using voting for classification or averaging for regression.
Given a training dataset with N examples, bootstrap sampling involves randomly selecting N examples from the dataset, with replacement.
With replacement means that each example can be chosen multiple times, while some examples may not be selected at all.
This process creates a bootstrap sample, which is a subset of the original training data of the same size.
Intuition
Bagging helps address the bias-variance tradeoff by reducing the variance of the model. Since bagging trains multiple models on different subsets of the data, each model captures different patterns and noise. When these models are aggregated, the noise cancels out to some extent, reducing the overall variance. This approach helps improve the model's generalization ability and reduces the risk of overfitting.
Use Case
When deciding whether to use bagging or not for a machine learning problem, you should consider the following factors:
Dataset Size: Bagging is particularly useful when dealing with large datasets. It can help introduce diversity and improve generalization.
Model Complexity: Bagging can be beneficial when using complex models that have a high variance, such as decision trees or neural networks.
Implementation of the Bagging Classifier
Import all necessary libraries and load the iris dataset
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
Bootstrap the dataset
Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Perform bootstrap sampling on the training data
bootstrap_indices = np.random.choice(len(X_train), size=len(X_train), replace=True)
X_train_bootstrapped = X_train[bootstrap_indices]
y_train_bootstrapped = y_train[bootstrap_indices]
To create a bagging classifier, we start by defining the base estimator. The bagging classifier is trained on the bootstrapped training data using the fit
method. This involves training each decision tree base estimator on a different subset of the bootstrapped training data. Finally, we predict and check accuracy.
base_estimator = DecisionTreeClassifier()
bagging = BaggingClassifier(estimator=base_estimator, n_estimators=10, random_state=42)
bagging.fit(X_train_bootstrapped, y_train_bootstrapped)
y_pred = bagging.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Fine-tuning the hyperparameters
Estimator: The base estimator is the individual model or classifier used as the base learner in the ensemble. It can be any supervised learning algorithm that supports the
fit
andpredict
methods. Understand the problem statement and then select the estimator.n_estimators: The number of base estimators (subsets) to use in the ensemble. Increasing the number of estimators can improve ensemble performance but may also increase computational complexity.
max_samples: The number of samples to draw from the training set for each base estimator. You will have to experiment here.
max_features: determines how many different factors (features) the models will consider when making decisions. It's like having a committee of models, and each model only looks at a subset of the available features to avoid getting overwhelmed or biased by a single feature.
bootstrap: If True, each base estimator is trained on a bootstrap sample drawn with replacement from the training set. If the dataset is highly imbalanced then you should evaluate keeping it as False
random_state: The seed value used by the random number generator to ensure reproducibility of results. It allows you to obtain the same random behavior when the same seed value is used.
Implementation of the Bagging Regressor
When using the Bagging technique for regression tasks, you can employ the BaggingRegressor class from sci-kit-learn's ensemble module.
bagging = BaggingRegressor(estimator=base_estimator, n_estimators=10, random_state=42)
Conclusion
Bagging is a valuable tool in the ensemble learning toolbox, as it improves the reliability of models and handles different types of data effectively.
Subscribe to my newsletter
Read articles from Kavita Rana directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Kavita Rana
Kavita Rana
I am currently pursuing a degree in Computer Science with a specialization in Business Analytics and Optimization.