Day 16: XGBoost (Extreme Gradient Boosting) : Ultimate Deep Dive

Saket KhopkarSaket Khopkar
7 min read

XGBoost → a GBM but a faster version!!

XGBoost is an upgraded, battle-hardened, production-ready version of Gradient Boosting. Just imagine:

  • GBM is a smart but slow marathon runner.

  • XGBoost is that same runner after years of training, with rocket shoes, optimized diet, and laser focus.

In short: XGBoost learns from mistakes (like GBM) but 10x faster, 10x smarter, and with fewer errors.


The Need for XGBoost

No doubt GBM was strong, but it had it’s flaws such as:

ProblemReal Issue
Slow TrainingTrees were built sequentially and inefficiently
Overfitting RiskNo regularization; it just kept fitting harder and harder
Poor Handling of Missing DataNeeded manual cleaning and filling

So, if the dataset was huge (millions of rows, 100+ features), classic GBM would collapse:

  • It took too long

  • It overfit

  • It couldn’t handle sparse or missing data

There was a need for a change, and that change was XGBoost.

💡
Nearly every winning solution on Kaggle competitions from 2015–2020 used XGBoost.

XGBoost was revolutionary. It fixed all the flaws we discussed above by:

ProblemHow XGBoost Solved It
Slow TrainingParallelize tree construction using multiple cores
OverfittingAdd regularization (L1 + L2 penalties)
Missing DataSmartly handle missing values by learning best splits

That's why XGBoost = "Extreme" Gradient Boosting.


A Practical view

We will get to know more about the need of XGBoost in place of GBM by having a look at the code example below.

import time
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from sklearn.ensemble import GradientBoostingClassifier
import xgboost as xgb

Yeah, don’t forget this line before:

pip install xgboost

We will generate a fake dataset for our demonstration.

from sklearn.datasets import make_classification

# Generate synthetic classification data
X, y = make_classification(n_samples=10000, n_features=20, 
                            n_informative=15, n_redundant=5, 
                            random_state=42)

# Split into train/test
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)

Now, time to begin the show. We will first train it with GBM followed by XGBoost. We will subsequently compare their accuracy as well as training time.

# Train GBM
gbm = GradientBoostingClassifier(
    n_estimators=300,
    learning_rate=0.1,
    max_depth=5
)

start_time = time.time()
gbm.fit(X_train, y_train)
gbm_time = time.time() - start_time

# Predict
y_pred_gbm = gbm.predict(X_val)
gbm_acc = accuracy_score(y_val, y_pred_gbm)

print(f"GBM Accuracy: {gbm_acc:.4f}")
print(f"GBM Training Time: {gbm_time:.2f} seconds")
GBM Accuracy: 0.9543
GBM Training Time: 28.75 seconds

Now time for the other one.

# Train XGBoost
xgb_model = xgb.XGBClassifier(
    n_estimators=300,
    learning_rate=0.1,
    max_depth=5,
    subsample=0.8,
    colsample_bytree=0.8,
    use_label_encoder=False,
    eval_metric='logloss'
)

start_time = time.time()
xgb_model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    verbose=False
)
xgb_time = time.time() - start_time

# Predict
y_pred_xgb = xgb_model.predict(X_val)
xgb_acc = accuracy_score(y_val, y_pred_xgb)

print(f"XGBoost Accuracy: {xgb_acc:.4f}")
print(f"XGBoost Training Time: {xgb_time:.2f} seconds")
XGBoost Accuracy: 0.9153
XGBoost Training Time: 0.61 seconds

Drastic difference isn’t is? Well, you will notice it better visually.

# Plotting comparison
models = ['GBM', 'XGBoost']
accuracy = [gbm_acc, xgb_acc]
training_time = [gbm_time, xgb_time]

fig, ax1 = plt.subplots(figsize=(10, 5))

color = 'tab:blue'
ax1.set_xlabel('Model')
ax1.set_ylabel('Accuracy', color=color)
ax1.bar(models, accuracy, color=color, alpha=0.6, label='Accuracy')
ax1.tick_params(axis='y', labelcolor=color)

ax2 = ax1.twinx()
color = 'tab:red'
ax2.set_ylabel('Training Time (seconds)', color=color)
ax2.plot(models, training_time, color=color, marker='o', label='Training Time')
ax2.tick_params(axis='y', labelcolor=color)

plt.title('GBM vs XGBoost: Accuracy and Training Time')
fig.tight_layout()
plt.grid(True)
plt.show()

Even though XGBoost has a similar accuracy, but the training time is quite low; which is one of the heaviest factor which leads to the preference of this algorithm over GBM.

Here is the final interpretation:

AspectGBMXGBoost
Training SpeedSlowerFaster (parallelized)
AccuracyGoodEqual or slightly better
OverfittingHigher riskControlled with regularization
Missing Data HandlingManualAutomatic
💡
XGBoost was invented because GBM was too slow and fragile for modern large datasets

Early Stopping and Learning Curves

Imagine you're preparing for an exam:

  • At first, studying more improves your knowledge.

  • But after 6–8 hours, you’re tired, studying more doesn’t help anymore.

Smart students stop early and rest, same with smart models.

If we put forth the concept of early stopping then it’s simply:

Stop training the model before overfitting begins.

Normally, if you keep training XGBoost:

  • It keeps adding more trees

  • But after a point, it starts fitting the noise, not the real patterns → Overfitting

What Early stopping will save me from?

  • Watching a validation set

  • If the validation error doesn’t improve after N rounds → it stops automatically

Let’s have a look at what’s a learning curve. A learning curve typically plots:

  • X-axis = Number of trees (iterations)

  • Y-axis = Error (loss) on training set and validation set

It shows how your model is learning:

  • If both errors are decreasing → Great! Keep going.

  • If training error decreases but validation error increases → Overfitting!

We will get more idea about the same by having a look at the practical example.

First we will setup necessary things:

import xgboost as xgb
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

# Sample dataset
df = pd.DataFrame({
    'Square_Feet': [1000, 1500, 2000, 1200, 2500, 1800],
    'Bedrooms': [2, 3, 4, 2, 5, 3],
    'Location_Score': [3, 4, 5, 3, 5, 4],
    'High_Price': [0, 0, 1, 0, 1, 1]
})

X = df[['Square_Feet', 'Bedrooms', 'Location_Score']]
y = df['High_Price']

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)
print(df)
        Square_Feet  Bedrooms  Location_Score  High_Price
0         1000         2               3           0
1         1500         3               4           0
2         2000         4               5           1
3         1200         2               3           0
4         2500         5               5           1
5         1800         3               4           1

Now here we will train the model with early stopping enabled:

model = xgb.XGBClassifier(
    n_estimators=1000,         # Very large number
    learning_rate=0.1,
    max_depth=3,
    subsample=0.8,
    colsample_bytree=0.8,
    use_label_encoder=False,
    eval_metric='logloss',
    early_stopping_rounds=10   # This works though!!
)

# Train with early stopping
model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    # early_stopping_rounds=10,   # Stop if no improvement in 10 rounds - this parameter is causing me problems, so commenting out
    verbose=True
)
[0]    validation_0-logloss:1.38629
[1]    validation_0-logloss:1.38629
[2]    validation_0-logloss:1.38629
[3]    validation_0-logloss:1.38629
[4]    validation_0-logloss:1.38629
[5]    validation_0-logloss:1.38629
[6]    validation_0-logloss:1.38629
[7]    validation_0-logloss:1.38629
[8]    validation_0-logloss:1.38629
[9]    validation_0-logloss:1.38629

At this point, XGBoost is adding trees after trees. But after 10 trees in a row with no improvement on validation set, it will stop. This saves a great amount of time and also avoids overfitting. Or else it would have created 1000 trees!!

Here is how our learning curve will look like based on above example:

results = model.evals_result()

# Plot
plt.figure(figsize=(10,6))
epochs = len(results['validation_0']['logloss'])
x_axis = range(0, epochs)

plt.plot(x_axis, results['validation_0']['logloss'], label='Validation Log Loss')
plt.xlabel('Number of Trees')
plt.ylabel('Log Loss')
plt.title('XGBoost Validation Error Over Trees')
plt.legend()
plt.grid(True)
plt.show()

Uh! you may question me that it does not resemble to a curve, right? Well not necessarily it is bent all the times, here is detailed interpretation:

BehaviorWhat it Means
Loss decreases steadilyModel is learning well
Loss flattensModel stopped learning; better to stop
Loss increasesOverfitting danger!

In our case, the Loss is flattened, so it is better to stop. Early stopping saves huge training time (instead of blindly training 500–1000 trees). Always remember that "Good enough" model > Overtrained model in production settings.


Wrap Up

Okay that is the time to wrap things up. We have covered the use cases of XGBoost and why it is knon as advanced version of Gradient Boosting. Here are a few key insights tabulated:

TopicKey Insights
Gradient BoostingAn ensemble method that builds weak learners (usually decision trees) sequentially, correcting previous errors using gradients of a loss function.
Loss FunctionsYou explored MSE, MAE, and Log Loss, learned how each one behaves, and saw gradient visualizations to understand how they guide learning.
Why Gradient Boosting WorksIt focuses on "what went wrong" in earlier trees and makes future trees better, like a teacher correcting homework line by line.
XGBoostAn optimized, scalable, regularized version of Gradient Boosting, much faster, more accurate, and more tunable.

Simply put it like this:

  • Gradient = direction to minimize error

  • Boosting = build trees that follow this gradient

Ahh I almost forgot about a bonus. From now on following days, we will be focusing on being interview ready, so after the end expect these bonuses:

QuestionWhat to Focus On
What is Gradient Boosting?Talk about sequential learning and residual correction
How does it differ from Random Forest?Sequential vs parallel trees
What is the role of learning rate?Smaller rate = slower but more accurate
Why is XGBoost so powerful?Regularization, speed, missing value handling, built-in CV
How does early stopping work?Validation-based halting of training to prevent overfitting

Ciao!!

0
Subscribe to my newsletter

Read articles from Saket Khopkar directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Saket Khopkar
Saket Khopkar

Developer based in India. Passionate learner and blogger. All blogs are basically Notes of Tech Learning Journey.