Hyperparameter Tuning of Xgbregressor Model Using Randomized Search

Tobias AnyigorTobias Anyigor
8 min read

I. Introduction

Some machine learning models contain hyperparameters that must be fine-tuned to achieve the desired result. Hyperparameter tuning refers to the process of varying the selected hyperparameters either manually or automatically. The metric for determining if a model is the best for a given use case is dependent on the data scientist's understanding of the implication of a given metric in the business domain or experiment. Hyperparameters are a set of parameters that are external to the model and cannot be estimated from the training data. The focus of this article is not to go deeply into differentiating parameters and hyperparameters, here is an article on the difference between parameters and hyperparameters.

This article will rather focus on randomized search hyperparameter tuning with a practical example. The selected dataset is a public dataset on Nigerian Car prices obtained from Kaggle. Note that parameters and hyperparameters were used interchangeably in most cases within the article.

In the discussed example, XGBRegressor was used to build a prediction model. The model is to estimate the market value of cars when "make", "year of manufacture", "mileage", "condition", "transmission", and "fuel" are entered. There are many other algorithms for building a regression model, such as Ridge, Lasso, and Linear Regression. XGBRegressor was chosen because it combines the strengths of gradient boosting and decision trees.

II. Importance of hyperparameter tuning

Tuning the hyperparameters of a machine learning model is crucial for optimizing its performance. This process involves selecting relevant parameters based on the characteristics of the training and target data and experimenting with different parameter values to find the best configuration. While the model's documentation typically specifies the parameters to be tuned, manually tuning them can be a daunting task. To simplify this process, we can leverage powerful techniques like GridSearchCV or RandomizedSearchCV.

III. Introducing the Randomized Search for Hyperparameter Tuning

GridSearchCV is an alternative to RandomizedSearchCV. The GridSearchCV takes the hyperparameters and range of values and then performs an exhaustive search. This search makes all possible combinations of the listed hyperparameters and the range of values given, evaluates its performance on the training data, and prints out the best parameter combination. This process is time-consuming and would become so expensive to carry out when it is necessary to tune a greater number of hyperparameters of the model. Here is a guide on GridSearchCV. A randomized search is a better alternative. This is because it selects some random combinations of the hyperparameters and evaluates them in the model.

IV. Implementation

Importing requirements for data cleaning and visualization

#loading requirements for data cleaning and visualization

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib
matplotlib.rcParams['figure.figsize'] = (20,10)
import warnings
warnings.filterwarnings('ignore')

Loading the dataset

#loading dataset
df = pd.read_csv(
r'c:\Users\anuel\OneDrive\Desktop\Price_with_xgboost\Nigerian_Car_Prices.csv')

Download the dataset to your local machine and load it manually into your Jupyter Notebook environment.

Preparation of dataset

#dataset shape
df.shape
# first five rows in dataset
df.head(10)
#information on dataset
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4095 entries, 0 to 4094
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Unnamed: 0           4095 non-null   int64  
 1   Make                 4095 non-null   object 
 2   Year of manufacture  3617 non-null   float64
 3   Condition            3616 non-null   object 
 4   Mileage              4024 non-null   float64
 5   Engine Size          3584 non-null   float64
 6   Fuel                 3607 non-null   object 
 7   Transmission         4075 non-null   object 
 8   Price                4095 non-null   object 
 9   Build                1127 non-null   object 
dtypes: float64(3), int64(1), object(6)
memory usage: 320.0+ KB
#describe dataset
df.describe()
    Unnamed: 0    Year of manufacture    Mileage    Engine Size
count    4095.000000    3617.000000    4.024000e+03    3584.000000
mean    2047.000000    2007.898535    1.825337e+05    3274.976562
std    1182.269005    4.300126    2.109233e+05    7693.489588
min    0.000000    1992.000000    1.000000e+00    3.000000
25%    1023.500000    2005.000000    1.020640e+05    2000.000000
50%    2047.000000    2008.000000    1.613525e+05    2500.000000
75%    3070.500000    2011.000000    2.319522e+05    3500.000000
max    4094.000000    2021.000000    9.976050e+06    371000.000000

Price was not displayed. This is because it is regarded as an object. The values in the price column contain ",". Here is a way out.

#to remove "," in the prices
df.Price = dataset.Price.str.replace(',', '').astype(float)
#view all cases with engine size greater than or equal to 5700. 
df.loc[dataset["Engine Size"] >= 5700]

Unnamed: 0    Make    Year of manufacture    Condition    Mileage    Engine Size    Fuel    Transmission    Price    Build
54    54    Lexus    2016.0    Nigerian Used    107355.0    5700.0    Petrol    Automatic    42000000.0    SUV
81    81    Fiat    2000.0    Foreign Used    286241.0    24000.0    Diesel    Manual    3675000.0    NaN
95    95    Fiat    2000.0    Foreign Used    286241.0    24000.0    Diesel    Manual    3675000.0    NaN
122    122    Tata    2008.0    Foreign Used    NaN    371000.0    Diesel    Manual    17850000.0    NaN
176    176    Lexus    2008.0    Foreign Used    200262.0    35000.0    Petrol    Automatic    4680000.0    NaN
...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...
3415    3415    Honda    2012.0    Nigerian Used    180989.0    16000.0    Petrol    Automatic    2415000.0    NaN
3440    3440    Honda    2012.0    Nigerian Used    180989.0    16000.0    Petrol    Automatic    2415000.0    NaN
3870    3870    Honda    1996.0    Nigerian Used    234412.0    22000.0    Petrol    Automatic    735000.0    NaN
3996    3996    Acura    2002.0    Nigerian Used    236451.0    35000.0    Petrol    Automatic    1462500.0    NaN
4003    4003    Toyota    2012.0    Foreign Used    296796.0    184421.0    Petrol    Automatic    5287500.0    NaN
62 rows × 10 columns

The engine size of cars is either in cubic centimeters or liters. The values under the engine size column are not consistent. For consistency, it is necessary to choose a specific unit of measurement. We are going to maintain consistency by converting everything to cubic centimetres.

# a function to convert engine sizes to cc
def convert_to_cc(val):
    if len(str(val))-2 == 1:
        return val * 1000
    elif len(str(val)) -2 ==2:
        return val*100
    elif len(str(val))-2 == 3:
        return val * 10
    elif len(str(val)) - 2 ==5:
        return val / 10
    elif len(str(val)) - 2 == 6:
        return val/ 100
    else:
        return value
# apply the 'divide_values' function on the 'Values' column
df['Engine Size'] = df['Engine Size'].apply(convert_to_cc)
#checking for null values
df.isnull().sum()
Unnamed: 0                0
Make                      0
Year of manufacture     478
Condition               479
Mileage                  71
Engine Size             511
Fuel                    488
Transmission             20
Price                     0
Build                  2968
dtype: int64

Removing null values in the dataset

#drop na
df = df.dropna()
df.isnull().sum()

For more on the data preprocessing, view the original notebook here

The cleaned dataset will be used for the prediction. Download the cleaned dataset.

Loading Practice Dataset


#loading dataset
df = pd.read_csv('https://raw.githubusercontent.com/AnyigorTobias/Car_price_prediction_with_xgboost/main/Cleaned_Car_dataset.csv')
Y = df["Price"] #target 
del df["Price"] #features

Preprocessing the data using DictVectorizer.

The DictVectorizer transforms features into vectors. The given dataset contains categorical variables. In order to increase the efficiency or performance of estimators/models, it is necessary to transform the features in the dataset. Refer to this sklearn documentation on DictVectorizer

from sklearn.feature_extraction import DictVectorizer
#transform to dictionary

data_dict= df.to_dict(orient = "records")

dv = DictVectorizer(sparse=False)
model_data = dv.fit_transform(data_dict)

Splitting dataset


from sklearn.model_selection import train_test_split
#train test split
X_train,X_test,Y_train,Y_test=train_test_split(df,Y,test_size=0.30, random_state = 42)
print("train data length:",len(X_train))
print("test data length:",len(X_test))

Defining the XGBregressor model

import xgboost as xgb

# Create an instance of XGBRegressor
xgb_model = xgb.XGBRegressor()

To view the parameters in the XGbRegressor

for parameter in model.get_params():
    print(parameter)

Here is the output

objective
base_score
booster
callbacks
colsample_bylevel
colsample_bynode
colsample_bytree
early_stopping_rounds
enable_categorical
eval_metric
feature_types
gamma
gpu_id
grow_policy
importance_type
interaction_constraints
learning_rate
max_bin
max_cat_threshold
max_cat_to_onehot
max_delta_step
max_depth
max_leaves
min_child_weight
missing
monotone_constraints
n_estimators
n_jobs
num_parallel_tree
predictor
random_state
reg_alpha
reg_lambda
sampling_method
scale_pos_weight
subsample
tree_method
validate_parameters
verbosity

know more about what each parameter signifies read by reading the documentation on XGBRegressor

Hyperparameter Tuning

Here is a list of parameters to be tuned and their ranges:

# Define the parameter space
param_space = {
    'learning_rate': [0.01, 0.1, 0.3],
    'max_depth': [3, 5, 7],
    'n_estimators': [50, 100, 200],
    'gamma': [0, 0.1, 0.3],
    'subsample': [0.5, 0.7, 1.0],
    'colsample_bytree': [0.5, 0.7, 1.0],
    'reg_lambda': [0.1, 1.0, 10.0],
    'reg_alpha': [0, 0.1, 1.0]
}

The parameters and parameter values were chosen based on the guidance given in the documentation of XGBRegressor.

#model parameter selection
from sklearn.model_selection import RandomizedSearchCV
import xgboost as xgb

# Define the parameter space
param_space = {
    'learning_rate': [0.01, 0.1, 0.3],
    'max_depth': [3, 5, 7],
    'n_estimators': [50, 100, 200],
    'gamma': [0, 0.1, 0.3],
    'subsample': [0.5, 0.7, 1.0],
    'colsample_bytree': [0.5, 0.7, 1.0],
    'reg_lambda': [0.1, 1.0, 10.0],
    'reg_alpha': [0, 0.1, 1.0]
}

# Create an instance of XGBRegressor
xgb_model = xgb.XGBRegressor()

# Create an instance of RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=xgb_model,
    param_distributions=param_space,
    n_iter=50,  # Number of random samples to test
    scoring='neg_mean_squared_error',
    n_jobs=-1,  # Use all available cores
    cv=5,  # 5-fold cross-validation
    verbose=3,  # Print progress
    random_state=42
)

# Train the model using RandomizedSearchCV
random_search.fit(X_train, Y_train)

# Print the best hyperparameters
print(random_search.best_params_)

n_iter = 50. This means that 50 random combinations of the hyperparameter will be sampled and evaluated.

n_jobs = -1. This parameter ensures that all available CPU cores are used during the computation.

cv = 5. When the cross-validation is set to 5, the data will be divided into 5 subsets, and the model will be trained and evaluated 5 times.

random_state = 42. This ensures that results are reproducible.

verbose = 3. This prints the progress.

Here is the output after running the codes.

Fitting 5 folds for each of 50 candidates, totaling 250 fits
{'subsample': 1.0, 'reg_lambda': 0.1, 'reg_alpha': 1.0, 'n_estimators': 200, 'max_depth': 3, 'learning_rate': 0.3, 'gamma': 0, 'colsample_bytree': 0.7}

Training the model

#model training

from xgboost import XGBRegressor

# initialize the model
model = XGBRegressor(n_estimators=50, reg_alpha = 1.0, 
                     reg_lambda = 0.1,subsample = 1.0, 
                     learning_rate=0.3, max_depth=3, gamma= 0,
                     colsample_bytree=0.7)


# train the model on the training set
model.fit(X_train, Y_train)

Evaluating model performance based on R^2 score

from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

# Predict on the test data
y_pred = model.predict(X_test)

# Calculate the mean squared error
mse = mean_squared_error(Y_test, y_pred)
print("Mean Squared Error:", mse)
#
#calculate rmse
rmse = np.sqrt(mse)
print("RMSE:", rmse)

# calculate the R^2 score
r2 = r2_score(Y_test, y_pred)
print('R^2 score:', r2)
Mean Squared Error: 3964045308043.7437
RMSE: 1990991.0366558016
R^2 score: 0.831779765151191

V. Conclusion

As you can see from the example, hyperparameter tuning increases the model's performance. Here is a quick recap of how to make use of randomized search using:

  • understand the parameters of the model you want to use

  • select the parameters that will best suit your data ( read up on documentation on a preferred model to know more about this)

  • select the best values for the RandomizedSearchCV such as the n_iter, n_jobs, cv, scoring (read up documentation of RandomizedSearchCV

  • instantiate the model, add it to the randomized search, and fit the training data into it.

Note that once the RandomizedSearch CV is fitted, it can be used as any other predictor by calling 'fit', 'predict' and other predictors peculiar to the model used within the RandomizedSearchCV.

VI. References

Here is a list of helpful articles to read up

0
Subscribe to my newsletter

Read articles from Tobias Anyigor directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Tobias Anyigor
Tobias Anyigor