Implementing XGBoost using Scikit

Nitin SharmaNitin Sharma
10 min read

XGBoost, which stands for Extreme Gradient Boosting, is an advanced machine learning algorithm that is widely used for regression, classification, and ranking tasks. It is particularly known for its speed and performance, making it a popular choice in data science competitions and practical applications.

At its core, XGBoost is based on the concept of boosting, which is an ensemble learning technique. This approach combines the predictions from multiple weak learners, typically decision trees, to create a strong predictive model. The primary idea behind boosting is to focus on the instances that previous models misclassified, thereby sequentially improving the model's performance.

Key Features of XGBoost:

  1. Regularization: XGBoost includes a regularization term in its objective function, which helps to prevent overfitting. This feature differentiates it from many other boosting algorithms, as it adds both L1 (Lasso) and L2 (Ridge) penalties. This allows for more flexibility in managing model complexity and improves generalization on unseen data.

  2. Handling Missing Values: One of the standout features of XGBoost is its ability to handle missing data internally. It does this by learning the best direction to handle missing values during training, making it robust against incomplete datasets.

  3. Parallel Processing: Unlike traditional gradient boosting algorithms that build trees sequentially, XGBoost leverages parallel processing to speed up the training process. It does this by building trees one level at a time, allowing the algorithm to construct trees much more quickly than its predecessors.

  4. Tree Pruning: Instead of the standard pre-pruning method used in decision trees, XGBoost employs maximum depth for tree construction and then prunes the trees backwards (post-pruning). This helps to optimize the tree structure and improve overall performance.

  5. Scalability: XGBoost is designed to be highly scalable. It can handle large datasets and can be run on distributed systems, making it suitable for modern data processing needs.

Usage:

To use XGBoost, data scientists typically follow these steps:

  1. Data Preparation: Clean and preprocess the dataset, addressing missing values and converting categorical variables as necessary.

  2. Model Configuration: Set parameters for the XGBoost model. This includes specifying the learning rate, the number of trees to create, maximum depth, regularization parameters, and evaluation metrics.

  3. Training: Train the model on the training dataset while monitoring performance on a validation dataset to avoid overfitting.

  4. Prediction: After training, the model is used to make predictions on new data.

  5. Evaluation: Finally, the model’s predictions are evaluated using appropriate metrics (e.g., accuracy, RMSE, F1 score).

XGBoost has become a go-to algorithm due to its impressive performance across a variety of tasks, and its ability to produce predictive models that are both accurate and efficient. Given its flexibility and robustness, it has gained immense popularity in the machine learning community, making it a critical tool for practitioners.

Adult Data set

The Adult dataset, frequently referred to as the Census Income dataset, is a widely recognized collection of data utilized for binary classification tasks in machine learning. This dataset contains information from the U.S. Census and includes various attributes such as age, education level, occupation, and marital status, among others. The primary objective when working with this dataset is to predict whether an individual's income exceeds $50,000 per year based on these features. Its rich variety of demographic information and clear binary target variable make it an excellent resource for testing algorithms and exploring concepts in classification and predictive modeling.

Lets start with importing the libraries we will use in this

from sklearn.datasets import fetch_openml
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from xgboost import XGBClassifier
import numpy as np
from collections import Counter

We will obtain the Adult dataset by utilizing the fetch_openml function from the scikit-learn library. This function allows us to easily download and load the dataset from OpenML, a platform that hosts various machine learning datasets. By using this method, we can access the data in a structured format, making it convenient for further analysis and modeling tasks.

Goal

Predicting whether an individual’s income exceeds $50,000 per year.

Load the Adult dataset

adult = fetch_openml('adult', as_frame=True)
X, y = adult.data, adult.target

Lets print the shape and contents of this loaded data

Print key information about the dataset


print(f"Dataset shape: {X.shape}")
print(f"Features: {adult.feature_names}")
print(f"Target variable: {adult.target_names}")
print(f"Class distributions: {Counter(y)}")
Dataset shape: (48842, 14)
Features: ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 
'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capitalgain', 'capitalloss',
 'hoursperweek', 'native-country']
Target variable: ['class']
Class distributions: Counter({'<=50K': 37155, '>50K': 11687})

Lets look at some of the data

X.head(5)
ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapitalgaincapitallosshoursperweeknative-country
02State-gov77516Bachelors13Never-marriedAdm-clericalNot-in-familyWhiteMale102United-States
13Self-emp-not-inc83311Bachelors13Married-civ-spouseExec-managerialHusbandWhiteMale000United-States
22Private215646HS-grad9DivorcedHandlers-cleanersNot-in-familyWhiteMale002United-States
33Private23472111th7Married-civ-spouseHandlers-cleanersHusbandBlackMale002United-States
41Private338409Bachelors13Married-civ-spouseProf-specialtyWifeBlackFemale002Cuba

We will utilize the Scikit-learn library to transform categorical features into integer codes. This process involves using techniques such as label encoding or one-hot encoding, which allow us to convert string values representing categories into numerical formats. This transformation is crucial for machine learning models, as they typically perform better with numerical data. By encoding these categorical variables, we ensure that our models can effectively interpret and learn from the input data.

From above data we can see following columns are categorical

nominal = ['workclass', 'education', 'marital-status', 'occupation', 'relationship',
 'race', 'sex', 'native-country']

We will utilize the ColumnTransformer from the scikit-learn library to construct a data transformation pipeline. This pipeline will process categorical columns using the OrdinalEncoder, which will convert these categorical values into numerical format while preserving their ordinal relationships. For the remaining columns in the dataset, we will apply the 'passthrough' option, allowing those features to be retained without any transformation. This approach ensures that our preprocessing is tailored to the specific needs of both categorical and numerical data within our dataset.

transformer = ColumnTransformer(transformers=[('ordinal', OrdinalEncoder(), nominal)],
 remainder='passthrough')

Perform ordinal encoding

X = transformer.fit_transform(X)

LabelEncoder

The LabelEncoder is a utility in data preprocessing that transforms categorical target labels into a numerical format suitable for machine learning models. It converts each unique label into an integer value ranging from 0 to n_classes - 1, where n_classes represents the total number of distinct categories present in the target variable. This encoding technique is particularly useful for classification problems, as many machine learning algorithms require numerical input rather than categorical data.

When using the LabelEncoder, it is important to apply it exclusively to the target labels, not to the features. This ensures that the transformation accurately reflects the classes without altering the structure of the input data. The LabelEncoder can also assist in normalizing the labels, making them more manageable for algorithms that rely on numerical computations. By encoding the labels in this way, models can learn the underlying patterns in the data effectively, leading to better performance on predictive tasks.

y = LabelEncoder().fit_transform(y)

we will split the data into train and test sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In order to optimize the performance of an XGBoost model, it's essential to establish a detailed parameter grid that encompasses a variety of hyperparameters. This grid will allow for a systematic exploration of different configurations to identify the best combination for our specific dataset.

  1. Learning Rate (eta): This controls the contribution of each tree. Values typically range from 0.01 to 0.3.

  2. Maximum Depth (max_depth): Defines the maximum depth of a tree in the ensemble. Common values are between 3 and 10.

  3. Subsample: This parameter represents the fraction of samples to be used for each tree. It usually takes values between 0.5 and 1.0.

  4. Colsample_bytree: The fraction of features to consider when building each tree, typically ranging from 0.3 to 1.

  5. Number of Estimators (n_estimators): The number of trees to be created in the boosting process, commonly set between 100 and 1000.

By methodically defining this parameter grid, we can employ techniques like grid search or random search to uncover the optimal hyperparameter settings that enhance the model's predictive capabilities.

param_grid = {
    'max_depth': [3, 4, 5],
    'learning_rate': [0.1, 0.01, 0.05],
    'n_estimators': [50, 100, 200],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

We create XGBClassifier

model = XGBClassifier(objective='binary:logistic', random_state=42, n_jobs=1)

The "binary:logistic" objective function in XGBoost is specifically designed for binary classification tasks, where the target variable consists of two distinct classes or outcomes. In this context, it focuses on predicting which of the two classes a given instance belongs to. The optimization process targets the log loss function, which measures the performance of a classification model whose output is a probability value between 0 and 1.

By utilizing the log loss function, this objective effectively quantifies how far off the predicted probabilities are from the actual class labels. This makes "binary:logistic" particularly suitable for applications where understanding the likelihood of an instance belonging to a specific class is crucial.

To conduct a thorough optimization of our model's hyperparameters, we will utilize the GridSearchCV method from the scikit-learn library. This approach involves specifying a range of values for each hyperparameter in the param_grid dictionary. We will create an instance of GridSearchCV, passing in our model as the estimator, alongside the defined parameter grid. Additionally, we will set the cv parameter to 3 to implement three-fold cross-validation during the search process. To leverage all available CPU cores for efficiency, we will assign n_jobs a value of -1. After setting up the grid search configuration, we will fit the model on our training dataset, X_train and y_train, which will allow the algorithm to explore and identify the optimal combination of hyperparameter settings based on cross-validated performance.

grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, n_jobs=-1)
grid_search.fit(X_train, y_train)

Print best score and parameters


print(f"Best score: {grid_search.best_score_:.3f}")
print(f"Best parameters: {grid_search.best_params_}")
Best score: 0.859
Best parameters: {'colsample_bytree': 1.0, 'learning_rate': 0.1, 'max_depth': 5, 
'n_estimators': 100, 'subsample': 0.8}

Access the best model from grid_search

To obtain the optimal model from the grid search results, we can access the best estimator by referencing the best_estimator_ attribute of the grid_search object. This attribute contains the model that achieved the highest performance based on the evaluation criteria set during the grid search process.


best_model = grid_search.best_estimator_

Save the best model


best_model.save_model('best_model_adult.ubj')

Now we load the saved model


loaded_model = XGBClassifier()
loaded_model.load_model('best_model_adult.ubj')

To generate predictions using the trained model that has been previously loaded into memory, we will apply it to the test dataset. This is done by calling the predict method on the loaded model and passing in the features from the test set, denoted as X_test. The output will be a set of predictions based on the input data.

# Use loaded model for predictions
predictions = loaded_model.predict(X_test)

Print the accuracy score

To evaluate the performance of the model, we will calculate the accuracy score using the test dataset. The accuracy score is determined by comparing the predicted labels generated by the loaded model against the actual labels in the test set. We can achieve this by applying the score method on the loaded_model, passing in X_test as the input features and y_test as the corresponding true labels. This will yield a numerical value representing the proportion of correctly predicted instances in the test data.


accuracy = loaded_model.score(X_test, y_test)
print(f"Accuracy: {accuracy:.3f}")
Accuracy: 0.862

Pretty good accuracy

0
Subscribe to my newsletter

Read articles from Nitin Sharma directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Nitin Sharma
Nitin Sharma