Understanding XGBoost Algorithm

Aditya JaiswalAditya Jaiswal
6 min read

XGBoost (Extreme Gradient Boosting) is one of the most powerful and widely techniques in data science community. It dominates the competitive machine learning platforms like Kaggle and extensively use in many industrial applications ranging from healthcare to finance. This algorithm is a part of ensemble technique and specifically implements gradient boosting with advanced optimizations.

Specialty of XGBoost

Due to several key advantages it stands out from other algorithms :

  1. High performance - The algorithm delivers the superior accuracy across various datasets.

  2. Speed - The algorithm is optimized for efficient computing with parallel processing.

  3. Flexibility - It handles regression problems, classification problems and other ranking problems.

  4. Regularization - XGBoost have built-in mechanisms to prevent overfitting of data.

  5. Feature Importance - XGBoost give insights about which features drive the predictions.

Mathematics behind XGBoost

XGBoost combines the predictions of multiple decision trees sequentially. The final prediction is the sum of all the predictions from all trees.

Where:

  1. y^i\= Final prediction for sample i

  2. K = Total number of trees

  3. fk(xi) = Prediction of kth tree to sample i.

The Process

  1. Initial Prediction - The first prediction usually takes the average of all the target values.

  2. Residual Calculation - Residual (errors) are nothing but the difference between Actual value and the Predicted Value.

    Residuals = Actual Value - Predicted value

  3. Build New Tree - A new decision tree is made to predict these values instead of the target value.

  4. Update Predictions - The predictions of new trees will be added to the existing predictions. With the sum of these predictions we also include Learning Rate to prevent overfitting of the data.

New Prediction=Old Prediction+(Learning Rate×New Tree Prediction)

  1. Repeat - The process will continue until specified number of trees reached and the model stops improving.

Objective Function

XGBoost minimizes the objective function that consist of 2 main parts:

  1. Loss Function

  2. Regularization Term

Objective=Loss Function + Regularization Term

Loss Function - Tells how far the predicted values are form actual values.

  • For Classification: Log-Likelihood

  • For Regression: Mean Squared Error

Regularization - Prevents overfitting by penalizing complex trees.

Regularization=γT+2λwj2

where:

  • γ = Complexity penalty for number of leaves (T)

  • λ = L2 regularization parameter

  • wj= Weight of leaf j

Gradient & Hessian

Unlike basic gradient boosting, XGBoost use 2nd order derivatives for faster convergence.

First Derivative (Gradient)

Second Derivative (Hessian)

These derivatives helps the algorithm to find the optimal splits and leaf values more efficiently than traditional gradient boosting methods.

Real World Application Example : XGBoost House Price Prediction Example: Visual walkthrough of the algorithm process

Dataset Overview

The Synthetic data includes the features that realistically affect house prices:

FeatureDescriptionImpact on Price
AreaHouse size in square feetStrong positive correlation (0.748)
Location PremiumPremium location indicatorSignificant premium (+$50,000)
School RatingLocal school rating (1-10)Moderate positive impact
BathroomsNumber of bathroomsPositive correlation
BedroomsNumber of bedroomsPositive correlation
GarageNumber of garage spacesAdded value per space
AgeAge of house in yearsNegative correlation (depreciation)

Ro

House Price Dataset Analysis: Correlation heatmap, feature distributions, and area vs price relationship.

Why use XGBoost in this problem?

  1. Non Linear Relatioship - House Prices don’t follow the simple linear patterns.

  2. Feature Extraction - Location and area might interact to determine the price.

  3. Mixed Data Types - Both continuous and categorical features are there in the dataset.

  4. Robustness - The real estate data often contains noise and outliers in the data.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import xgboost as xgb

# Load the house price dataset
data = pd.read_csv('house_price_data.csv')

# Separate features and target
X = data.drop('price', axis=1)
y = data['price']

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create XGBoost regressor with optimized parameters
model = xgb.XGBRegressor(
    objective='reg:squarederror',  # For regression tasks
    n_estimators=100,              # Number of trees
    max_depth=6,                   # Maximum tree depth
    learning_rate=0.1,             # Step size shrinkage
    subsample=0.8,                 # Fraction of samples used
    colsample_bytree=0.8,          # Fraction of features used
    random_state=42
)

# Fit the model to training data
model.fit(X_train, y_train)

# Make predictions on test set
y_pred = model.predict(X_test)

# Calculate evaluation metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Absolute Error: ${mae:,.2f}")
print(f"Mean Squared Error: ${mse:,.2f}")
print(f"R² Score: {r2:.3f}")

# Feature importance analysis
importance_scores = model.feature_importances_
feature_names = X.columns

importance_df = pd.DataFrame({
    'feature': feature_names,
    'importance': importance_scores
}).sort_values('importance', ascending=False)

print("\nFeature Importance Ranking:")
for idx, row in importance_df.iterrows():
    print(f"{row['feature']}: {row['importance']:.3f}")
Mean Absolute Error: $19,480.69
Mean Squared Error: $600,705,017.13
R² Score: 0.821

Feature Importance Ranking:
location_premium: 0.434
area: 0.253
garage: 0.088
bathrooms: 0.066
age: 0.059
bedrooms: 0.052
school_rating: 0.048

XGBoost HyperParameters

According to the research, these are the most important HyperParameters for XGBoost Performance:

ParameterDescriptionTypical RangeImpact
max_depthMaximum depth of trees3-10Controls model complexity
learning_rateStep size shrinkage0.01-0.3Affects convergence speed
n_estimatorsNumber of boosting rounds100-1000More trees = better fit
subsampleFraction of samples per tree0.6-1.0Prevents overfitting
colsample_bytreeFraction of features per tree0.6-1.0Adds randomness

HyperParameter Tuning Example

from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'max_depth': [3, 4, 5, 6],
    'learning_rate': [0.01, 0.1, 0.2],
    'n_estimators': [50, 100, 200]
}

# Perform grid search
grid_search = GridSearchCV(
    xgb.XGBRegressor(random_state=42),
    param_grid,
    cv=5,
    scoring='neg_mean_absolute_error',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)
print("Best parameters:", grid_search.best_params_)
Best parameters: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 200}

Advanced Features and Best Practices

Early Stopping for Optimal Perfromance

model2 = xgb.XGBRegressor(
    objective='reg:squarederror',
    n_estimators=1000,  # Set high number
    max_depth=6,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    early_stopping_rounds=10,  # Move here
    random_state=42
)

# Fit with evaluation set
model2.fit(
    X_train, y_train,
    eval_set=[(X_test, y_test)],
    verbose=False
)

Our model

Visualization of Early Stopping model

import matplotlib.pyplot as plt

xgb.plot_importance(model, max_num_features=10)
plt.tight_layout()
plt.show()

Other Real World Applications

XGBoost is use extensively across various domains:

DomainApplicationsBenefits
FinanceCredit risk assessment, fraud detection, stock predictionHigh accuracy with complex financial data
E-commerceRecommendation systems, customer churn predictionHandles large-scale user behavior data
HealthcareDisease diagnosis, patient risk prediction, drug discoveryManages complex medical relationships
MarketingCustomer segmentation, conversion predictionEffective with mixed data types

Common Pitfalls and their Solutions

  1. Overfitting Prevention
  • Problem: Model performs well on training but poorly on test data.

  • Solution: Reduce max_depth, increase regularization, use cross-validation.

  1. Performance Optimization
  • Slow Training: Use subsample < 1.0, increase learning_rate, enable GPU

  • Memory Issues: Reduce max_depth, use XGBoost's DMatrix format

Conclusion

XGBoost represents the advancement of gradient boosting algorithms, combining mathematical sophistication with practical usability. Its success stems from:

  1. Mathematical Foundation: Uses second-order derivatives for faster optimization.

  2. Regularization: Built-in overfitting prevention while maintaining performance.

  3. Efficiency: Optimized implementation for large datasets.

  4. Flexibility: Adapts to various problem types and data characteristics.

  5. Interpretability: Provides feature importance and model explanation tools.

0
Subscribe to my newsletter

Read articles from Aditya Jaiswal directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Aditya Jaiswal
Aditya Jaiswal