Bagging, which stands for Bootstrap Aggregating, is a popular ensemble learning method in machine learning.

In bagging, the ensemble is created by training multiple models on different subsets of the training data. These subsets are obtained through a process called bootstrapping, where random samples are drawn with replacements from the original training set. Each model in the ensemble is trained independently on its respective bootstrap sample. During prediction, each model makes its prediction. The final prediction is determined by aggregating the predictions of all models. Aggregation methods vary, such as using voting for classification or averaging for regression.

Given a training dataset with N examples, bootstrap sampling involves randomly selecting N examples from the dataset, with replacement.

With replacement means that each example can be chosen multiple times, while some examples may not be selected at all.

This process creates a bootstrap sample, which is a subset of the original training data of the same size.

Intuition

Bagging helps address the bias-variance tradeoff by reducing the variance of the model. Since bagging trains multiple models on different subsets of the data, each model captures different patterns and noise. When these models are aggregated, the noise cancels out to some extent, reducing the overall variance. This approach helps improve the model's generalization ability and reduces the risk of overfitting.

Use Case

When deciding whether to use bagging or not for a machine learning problem, you should consider the following factors:

Dataset Size: Bagging is particularly useful when dealing with large datasets. It can help introduce diversity and improve generalization.
Model Complexity: Bagging can be beneficial when using complex models that have a high variance, such as decision trees or neural networks.

Implementation of the Bagging Classifier

Import all necessary libraries and load the iris dataset

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

Bootstrap the dataset

 Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Perform bootstrap sampling on the training data
bootstrap_indices = np.random.choice(len(X_train), size=len(X_train), replace=True)
X_train_bootstrapped = X_train[bootstrap_indices]
y_train_bootstrapped = y_train[bootstrap_indices]

To create a bagging classifier, we start by defining the base estimator. The bagging classifier is trained on the bootstrapped training data using the fit method. This involves training each decision tree base estimator on a different subset of the bootstrapped training data. Finally, we predict and check accuracy.

base_estimator = DecisionTreeClassifier()
bagging = BaggingClassifier(estimator=base_estimator, n_estimators=10, random_state=42)
bagging.fit(X_train_bootstrapped, y_train_bootstrapped)
y_pred = bagging.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Fine-tuning the hyperparameters

Estimator: The base estimator is the individual model or classifier used as the base learner in the ensemble. It can be any supervised learning algorithm that supports the fit and predict methods. Understand the problem statement and then select the estimator.
n_estimators: The number of base estimators (subsets) to use in the ensemble. Increasing the number of estimators can improve ensemble performance but may also increase computational complexity.
max_samples: The number of samples to draw from the training set for each base estimator. You will have to experiment here.
max_features: determines how many different factors (features) the models will consider when making decisions. It's like having a committee of models, and each model only looks at a subset of the available features to avoid getting overwhelmed or biased by a single feature.
bootstrap: If True, each base estimator is trained on a bootstrap sample drawn with replacement from the training set. If the dataset is highly imbalanced then you should evaluate keeping it as False
random_state: The seed value used by the random number generator to ensure reproducibility of results. It allows you to obtain the same random behavior when the same seed value is used.

Implementation of the Bagging Regressor

When using the Bagging technique for regression tasks, you can employ the BaggingRegressor class from sci-kit-learn's ensemble module.

bagging = BaggingRegressor(estimator=base_estimator, n_estimators=10, random_state=42)

Conclusion

Bagging is a valuable tool in the ensemble learning toolbox, as it improves the reliability of models and handles different types of data effectively.

Bagging Ensemble Learning

Table of contents