Hyperparameter Tuning of Xgbregressor Model Using Randomized Search
I. Introduction
Some machine learning models contain hyperparameters that must be fine-tuned to achieve the desired result. Hyperparameter tuning refers to the process of varying the selected hyperparameters either manually or automatically. The metric for determining if a model is the best for a given use case is dependent on the data scientist's understanding of the implication of a given metric in the business domain or experiment. Hyperparameters are a set of parameters that are external to the model and cannot be estimated from the training data. The focus of this article is not to go deeply into differentiating parameters and hyperparameters, here is an article on the difference between parameters and hyperparameters.
This article will rather focus on randomized search hyperparameter tuning with a practical example. The selected dataset is a public dataset on Nigerian Car prices obtained from Kaggle. Note that parameters and hyperparameters were used interchangeably in most cases within the article.
In the discussed example, XGBRegressor was used to build a prediction model. The model is to estimate the market value of cars when "make", "year of manufacture", "mileage", "condition", "transmission", and "fuel" are entered. There are many other algorithms for building a regression model, such as Ridge, Lasso, and Linear Regression. XGBRegressor was chosen because it combines the strengths of gradient boosting and decision trees.
II. Importance of hyperparameter tuning
Tuning the hyperparameters of a machine learning model is crucial for optimizing its performance. This process involves selecting relevant parameters based on the characteristics of the training and target data and experimenting with different parameter values to find the best configuration. While the model's documentation typically specifies the parameters to be tuned, manually tuning them can be a daunting task. To simplify this process, we can leverage powerful techniques like GridSearchCV or RandomizedSearchCV.
III. Introducing the Randomized Search for Hyperparameter Tuning
GridSearchCV is an alternative to RandomizedSearchCV. The GridSearchCV takes the hyperparameters and range of values and then performs an exhaustive search. This search makes all possible combinations of the listed hyperparameters and the range of values given, evaluates its performance on the training data, and prints out the best parameter combination. This process is time-consuming and would become so expensive to carry out when it is necessary to tune a greater number of hyperparameters of the model. Here is a guide on GridSearchCV. A randomized search is a better alternative. This is because it selects some random combinations of the hyperparameters and evaluates them in the model.
IV. Implementation
Importing requirements for data cleaning and visualization
#loading requirements for data cleaning and visualization
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib
matplotlib.rcParams['figure.figsize'] = (20,10)
import warnings
warnings.filterwarnings('ignore')
Loading the dataset
#loading dataset
df = pd.read_csv(
r'c:\Users\anuel\OneDrive\Desktop\Price_with_xgboost\Nigerian_Car_Prices.csv')
Download the dataset to your local machine and load it manually into your Jupyter Notebook environment.
Preparation of dataset
#dataset shape
df.shape
# first five rows in dataset
df.head(10)
#information on dataset
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4095 entries, 0 to 4094
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 4095 non-null int64
1 Make 4095 non-null object
2 Year of manufacture 3617 non-null float64
3 Condition 3616 non-null object
4 Mileage 4024 non-null float64
5 Engine Size 3584 non-null float64
6 Fuel 3607 non-null object
7 Transmission 4075 non-null object
8 Price 4095 non-null object
9 Build 1127 non-null object
dtypes: float64(3), int64(1), object(6)
memory usage: 320.0+ KB
#describe dataset
df.describe()
Unnamed: 0 Year of manufacture Mileage Engine Size
count 4095.000000 3617.000000 4.024000e+03 3584.000000
mean 2047.000000 2007.898535 1.825337e+05 3274.976562
std 1182.269005 4.300126 2.109233e+05 7693.489588
min 0.000000 1992.000000 1.000000e+00 3.000000
25% 1023.500000 2005.000000 1.020640e+05 2000.000000
50% 2047.000000 2008.000000 1.613525e+05 2500.000000
75% 3070.500000 2011.000000 2.319522e+05 3500.000000
max 4094.000000 2021.000000 9.976050e+06 371000.000000
Price was not displayed. This is because it is regarded as an object. The values in the price column contain ",". Here is a way out.
#to remove "," in the prices
df.Price = dataset.Price.str.replace(',', '').astype(float)
#view all cases with engine size greater than or equal to 5700.
df.loc[dataset["Engine Size"] >= 5700]
Unnamed: 0 Make Year of manufacture Condition Mileage Engine Size Fuel Transmission Price Build
54 54 Lexus 2016.0 Nigerian Used 107355.0 5700.0 Petrol Automatic 42000000.0 SUV
81 81 Fiat 2000.0 Foreign Used 286241.0 24000.0 Diesel Manual 3675000.0 NaN
95 95 Fiat 2000.0 Foreign Used 286241.0 24000.0 Diesel Manual 3675000.0 NaN
122 122 Tata 2008.0 Foreign Used NaN 371000.0 Diesel Manual 17850000.0 NaN
176 176 Lexus 2008.0 Foreign Used 200262.0 35000.0 Petrol Automatic 4680000.0 NaN
... ... ... ... ... ... ... ... ... ... ...
3415 3415 Honda 2012.0 Nigerian Used 180989.0 16000.0 Petrol Automatic 2415000.0 NaN
3440 3440 Honda 2012.0 Nigerian Used 180989.0 16000.0 Petrol Automatic 2415000.0 NaN
3870 3870 Honda 1996.0 Nigerian Used 234412.0 22000.0 Petrol Automatic 735000.0 NaN
3996 3996 Acura 2002.0 Nigerian Used 236451.0 35000.0 Petrol Automatic 1462500.0 NaN
4003 4003 Toyota 2012.0 Foreign Used 296796.0 184421.0 Petrol Automatic 5287500.0 NaN
62 rows × 10 columns
The engine size of cars is either in cubic centimeters or liters. The values under the engine size column are not consistent. For consistency, it is necessary to choose a specific unit of measurement. We are going to maintain consistency by converting everything to cubic centimetres.
# a function to convert engine sizes to cc
def convert_to_cc(val):
if len(str(val))-2 == 1:
return val * 1000
elif len(str(val)) -2 ==2:
return val*100
elif len(str(val))-2 == 3:
return val * 10
elif len(str(val)) - 2 ==5:
return val / 10
elif len(str(val)) - 2 == 6:
return val/ 100
else:
return value
# apply the 'divide_values' function on the 'Values' column
df['Engine Size'] = df['Engine Size'].apply(convert_to_cc)
#checking for null values
df.isnull().sum()
Unnamed: 0 0
Make 0
Year of manufacture 478
Condition 479
Mileage 71
Engine Size 511
Fuel 488
Transmission 20
Price 0
Build 2968
dtype: int64
Removing null values in the dataset
#drop na
df = df.dropna()
df.isnull().sum()
For more on the data preprocessing, view the original notebook here
The cleaned dataset will be used for the prediction. Download the cleaned dataset.
Loading Practice Dataset
#loading dataset
df = pd.read_csv('https://raw.githubusercontent.com/AnyigorTobias/Car_price_prediction_with_xgboost/main/Cleaned_Car_dataset.csv')
Y = df["Price"] #target
del df["Price"] #features
Preprocessing the data using DictVectorizer.
The DictVectorizer transforms features into vectors. The given dataset contains categorical variables. In order to increase the efficiency or performance of estimators/models, it is necessary to transform the features in the dataset. Refer to this sklearn documentation on DictVectorizer
from sklearn.feature_extraction import DictVectorizer
#transform to dictionary
data_dict= df.to_dict(orient = "records")
dv = DictVectorizer(sparse=False)
model_data = dv.fit_transform(data_dict)
Splitting dataset
from sklearn.model_selection import train_test_split
#train test split
X_train,X_test,Y_train,Y_test=train_test_split(df,Y,test_size=0.30, random_state = 42)
print("train data length:",len(X_train))
print("test data length:",len(X_test))
Defining the XGBregressor model
import xgboost as xgb
# Create an instance of XGBRegressor
xgb_model = xgb.XGBRegressor()
To view the parameters in the XGbRegressor
for parameter in model.get_params():
print(parameter)
Here is the output
objective
base_score
booster
callbacks
colsample_bylevel
colsample_bynode
colsample_bytree
early_stopping_rounds
enable_categorical
eval_metric
feature_types
gamma
gpu_id
grow_policy
importance_type
interaction_constraints
learning_rate
max_bin
max_cat_threshold
max_cat_to_onehot
max_delta_step
max_depth
max_leaves
min_child_weight
missing
monotone_constraints
n_estimators
n_jobs
num_parallel_tree
predictor
random_state
reg_alpha
reg_lambda
sampling_method
scale_pos_weight
subsample
tree_method
validate_parameters
verbosity
know more about what each parameter signifies read by reading the documentation on XGBRegressor
Hyperparameter Tuning
Here is a list of parameters to be tuned and their ranges:
# Define the parameter space
param_space = {
'learning_rate': [0.01, 0.1, 0.3],
'max_depth': [3, 5, 7],
'n_estimators': [50, 100, 200],
'gamma': [0, 0.1, 0.3],
'subsample': [0.5, 0.7, 1.0],
'colsample_bytree': [0.5, 0.7, 1.0],
'reg_lambda': [0.1, 1.0, 10.0],
'reg_alpha': [0, 0.1, 1.0]
}
The parameters and parameter values were chosen based on the guidance given in the documentation of XGBRegressor.
#model parameter selection
from sklearn.model_selection import RandomizedSearchCV
import xgboost as xgb
# Define the parameter space
param_space = {
'learning_rate': [0.01, 0.1, 0.3],
'max_depth': [3, 5, 7],
'n_estimators': [50, 100, 200],
'gamma': [0, 0.1, 0.3],
'subsample': [0.5, 0.7, 1.0],
'colsample_bytree': [0.5, 0.7, 1.0],
'reg_lambda': [0.1, 1.0, 10.0],
'reg_alpha': [0, 0.1, 1.0]
}
# Create an instance of XGBRegressor
xgb_model = xgb.XGBRegressor()
# Create an instance of RandomizedSearchCV
random_search = RandomizedSearchCV(
estimator=xgb_model,
param_distributions=param_space,
n_iter=50, # Number of random samples to test
scoring='neg_mean_squared_error',
n_jobs=-1, # Use all available cores
cv=5, # 5-fold cross-validation
verbose=3, # Print progress
random_state=42
)
# Train the model using RandomizedSearchCV
random_search.fit(X_train, Y_train)
# Print the best hyperparameters
print(random_search.best_params_)
n_iter = 50. This means that 50 random combinations of the hyperparameter will be sampled and evaluated.
n_jobs = -1. This parameter ensures that all available CPU cores are used during the computation.
cv = 5. When the cross-validation is set to 5, the data will be divided into 5 subsets, and the model will be trained and evaluated 5 times.
random_state = 42. This ensures that results are reproducible.
verbose = 3. This prints the progress.
Here is the output after running the codes.
Fitting 5 folds for each of 50 candidates, totaling 250 fits
{'subsample': 1.0, 'reg_lambda': 0.1, 'reg_alpha': 1.0, 'n_estimators': 200, 'max_depth': 3, 'learning_rate': 0.3, 'gamma': 0, 'colsample_bytree': 0.7}
Training the model
#model training
from xgboost import XGBRegressor
# initialize the model
model = XGBRegressor(n_estimators=50, reg_alpha = 1.0,
reg_lambda = 0.1,subsample = 1.0,
learning_rate=0.3, max_depth=3, gamma= 0,
colsample_bytree=0.7)
# train the model on the training set
model.fit(X_train, Y_train)
Evaluating model performance based on R^2 score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
# Predict on the test data
y_pred = model.predict(X_test)
# Calculate the mean squared error
mse = mean_squared_error(Y_test, y_pred)
print("Mean Squared Error:", mse)
#
#calculate rmse
rmse = np.sqrt(mse)
print("RMSE:", rmse)
# calculate the R^2 score
r2 = r2_score(Y_test, y_pred)
print('R^2 score:', r2)
Mean Squared Error: 3964045308043.7437
RMSE: 1990991.0366558016
R^2 score: 0.831779765151191
V. Conclusion
As you can see from the example, hyperparameter tuning increases the model's performance. Here is a quick recap of how to make use of randomized search using:
understand the parameters of the model you want to use
select the parameters that will best suit your data ( read up on documentation on a preferred model to know more about this)
select the best values for the RandomizedSearchCV such as the n_iter, n_jobs, cv, scoring (read up documentation of RandomizedSearchCV
instantiate the model, add it to the randomized search, and fit the training data into it.
Note that once the RandomizedSearch CV is fitted, it can be used as any other predictor by calling 'fit', 'predict' and other predictors peculiar to the model used within the RandomizedSearchCV.
VI. References
Here is a list of helpful articles to read up
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
https://inria.github.io/scikit-learn-mooc/python_scripts/parameter_tuning_randomized_search.html
https://inria.github.io/scikit-learn-mooc/python_scripts/metrics_regression.html
https://snyk.io/advisor/python/xgboost/functions/xgboost.XGBRegressor
Subscribe to my newsletter
Read articles from Tobias Anyigor directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by