XGBoost → a GBM but a faster version!!

XGBoost is an upgraded, battle-hardened, production-ready version of Gradient Boosting. Just imagine:

GBM is a smart but slow marathon runner.
XGBoost is that same runner after years of training, with rocket shoes, optimized diet, and laser focus.

In short: XGBoost learns from mistakes (like GBM) but 10x faster, 10x smarter, and with fewer errors.

The Need for XGBoost

No doubt GBM was strong, but it had it’s flaws such as:

Problem	Real Issue
Slow Training	Trees were built sequentially and inefficiently
Overfitting Risk	No regularization; it just kept fitting harder and harder
Poor Handling of Missing Data	Needed manual cleaning and filling

So, if the dataset was huge (millions of rows, 100+ features), classic GBM would collapse:

It took too long
It overfit
It couldn’t handle sparse or missing data

There was a need for a change, and that change was XGBoost.

💡

Nearly every winning solution on Kaggle competitions from 2015–2020 used XGBoost.

XGBoost was revolutionary. It fixed all the flaws we discussed above by:

Problem	How XGBoost Solved It
Slow Training	Parallelize tree construction using multiple cores
Overfitting	Add regularization (L1 + L2 penalties)
Missing Data	Smartly handle missing values by learning best splits

That's why XGBoost = "Extreme" Gradient Boosting.

A Practical view

We will get to know more about the need of XGBoost in place of GBM by having a look at the code example below.

import time
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from sklearn.ensemble import GradientBoostingClassifier
import xgboost as xgb

Yeah, don’t forget this line before:

pip install xgboost

We will generate a fake dataset for our demonstration.

from sklearn.datasets import make_classification

# Generate synthetic classification data
X, y = make_classification(n_samples=10000, n_features=20, 
                            n_informative=15, n_redundant=5, 
                            random_state=42)

# Split into train/test
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)

Now, time to begin the show. We will first train it with GBM followed by XGBoost. We will subsequently compare their accuracy as well as training time.

# Train GBM
gbm = GradientBoostingClassifier(
    n_estimators=300,
    learning_rate=0.1,
    max_depth=5
)

start_time = time.time()
gbm.fit(X_train, y_train)
gbm_time = time.time() - start_time

# Predict
y_pred_gbm = gbm.predict(X_val)
gbm_acc = accuracy_score(y_val, y_pred_gbm)

print(f"GBM Accuracy: {gbm_acc:.4f}")
print(f"GBM Training Time: {gbm_time:.2f} seconds")

GBM Accuracy: 0.9543
GBM Training Time: 28.75 seconds

Now time for the other one.

# Train XGBoost
xgb_model = xgb.XGBClassifier(
    n_estimators=300,
    learning_rate=0.1,
    max_depth=5,
    subsample=0.8,
    colsample_bytree=0.8,
    use_label_encoder=False,
    eval_metric='logloss'
)

start_time = time.time()
xgb_model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    verbose=False
)
xgb_time = time.time() - start_time

# Predict
y_pred_xgb = xgb_model.predict(X_val)
xgb_acc = accuracy_score(y_val, y_pred_xgb)

print(f"XGBoost Accuracy: {xgb_acc:.4f}")
print(f"XGBoost Training Time: {xgb_time:.2f} seconds")

XGBoost Accuracy: 0.9153
XGBoost Training Time: 0.61 seconds

Drastic difference isn’t is? Well, you will notice it better visually.

# Plotting comparison
models = ['GBM', 'XGBoost']
accuracy = [gbm_acc, xgb_acc]
training_time = [gbm_time, xgb_time]

fig, ax1 = plt.subplots(figsize=(10, 5))

color = 'tab:blue'
ax1.set_xlabel('Model')
ax1.set_ylabel('Accuracy', color=color)
ax1.bar(models, accuracy, color=color, alpha=0.6, label='Accuracy')
ax1.tick_params(axis='y', labelcolor=color)

ax2 = ax1.twinx()
color = 'tab:red'
ax2.set_ylabel('Training Time (seconds)', color=color)
ax2.plot(models, training_time, color=color, marker='o', label='Training Time')
ax2.tick_params(axis='y', labelcolor=color)

plt.title('GBM vs XGBoost: Accuracy and Training Time')
fig.tight_layout()
plt.grid(True)
plt.show()

Even though XGBoost has a similar accuracy, but the training time is quite low; which is one of the heaviest factor which leads to the preference of this algorithm over GBM.

Here is the final interpretation:

Aspect	GBM	XGBoost
Training Speed	Slower	Faster (parallelized)
Accuracy	Good	Equal or slightly better
Overfitting	Higher risk	Controlled with regularization
Missing Data Handling	Manual	Automatic

💡

XGBoost was invented because GBM was too slow and fragile for modern large datasets

Early Stopping and Learning Curves

Imagine you're preparing for an exam:

At first, studying more improves your knowledge.
But after 6–8 hours, you’re tired, studying more doesn’t help anymore.

Smart students stop early and rest, same with smart models.

If we put forth the concept of early stopping then it’s simply:

Stop training the model before overfitting begins.

Normally, if you keep training XGBoost:

It keeps adding more trees
But after a point, it starts fitting the noise, not the real patterns → Overfitting

What Early stopping will save me from?

Watching a validation set
If the validation error doesn’t improve after N rounds → it stops automatically

Let’s have a look at what’s a learning curve. A learning curve typically plots:

X-axis = Number of trees (iterations)
Y-axis = Error (loss) on training set and validation set

It shows how your model is learning:

If both errors are decreasing → Great! Keep going.
If training error decreases but validation error increases → Overfitting!

We will get more idea about the same by having a look at the practical example.

First we will setup necessary things:

import xgboost as xgb
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

# Sample dataset
df = pd.DataFrame({
    'Square_Feet': [1000, 1500, 2000, 1200, 2500, 1800],
    'Bedrooms': [2, 3, 4, 2, 5, 3],
    'Location_Score': [3, 4, 5, 3, 5, 4],
    'High_Price': [0, 0, 1, 0, 1, 1]
})

X = df[['Square_Feet', 'Bedrooms', 'Location_Score']]
y = df['High_Price']

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)
print(df)

        Square_Feet  Bedrooms  Location_Score  High_Price
0         1000         2               3           0
1         1500         3               4           0
2         2000         4               5           1
3         1200         2               3           0
4         2500         5               5           1
5         1800         3               4           1

Now here we will train the model with early stopping enabled:

model = xgb.XGBClassifier(
    n_estimators=1000,         # Very large number
    learning_rate=0.1,
    max_depth=3,
    subsample=0.8,
    colsample_bytree=0.8,
    use_label_encoder=False,
    eval_metric='logloss',
    early_stopping_rounds=10   # This works though!!
)

# Train with early stopping
model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    # early_stopping_rounds=10,   # Stop if no improvement in 10 rounds - this parameter is causing me problems, so commenting out
    verbose=True
)

[0]    validation_0-logloss:1.38629
[1]    validation_0-logloss:1.38629
[2]    validation_0-logloss:1.38629
[3]    validation_0-logloss:1.38629
[4]    validation_0-logloss:1.38629
[5]    validation_0-logloss:1.38629
[6]    validation_0-logloss:1.38629
[7]    validation_0-logloss:1.38629
[8]    validation_0-logloss:1.38629
[9]    validation_0-logloss:1.38629

At this point, XGBoost is adding trees after trees. But after 10 trees in a row with no improvement on validation set, it will stop. This saves a great amount of time and also avoids overfitting. Or else it would have created 1000 trees!!

Here is how our learning curve will look like based on above example:

results = model.evals_result()

# Plot
plt.figure(figsize=(10,6))
epochs = len(results['validation_0']['logloss'])
x_axis = range(0, epochs)

plt.plot(x_axis, results['validation_0']['logloss'], label='Validation Log Loss')
plt.xlabel('Number of Trees')
plt.ylabel('Log Loss')
plt.title('XGBoost Validation Error Over Trees')
plt.legend()
plt.grid(True)
plt.show()

Uh! you may question me that it does not resemble to a curve, right? Well not necessarily it is bent all the times, here is detailed interpretation:

Behavior	What it Means
Loss decreases steadily	Model is learning well
Loss flattens	Model stopped learning; better to stop
Loss increases	Overfitting danger!

In our case, the Loss is flattened, so it is better to stop. Early stopping saves huge training time (instead of blindly training 500–1000 trees). Always remember that "Good enough" model > Overtrained model in production settings.

Wrap Up

Okay that is the time to wrap things up. We have covered the use cases of XGBoost and why it is knon as advanced version of Gradient Boosting. Here are a few key insights tabulated:

Topic	Key Insights
Gradient Boosting	An ensemble method that builds weak learners (usually decision trees) sequentially, correcting previous errors using gradients of a loss function.
Loss Functions	You explored MSE, MAE, and Log Loss, learned how each one behaves, and saw gradient visualizations to understand how they guide learning.
Why Gradient Boosting Works	It focuses on "what went wrong" in earlier trees and makes future trees better, like a teacher correcting homework line by line.
XGBoost	An optimized, scalable, regularized version of Gradient Boosting, much faster, more accurate, and more tunable.

Simply put it like this:

Gradient = direction to minimize error
Boosting = build trees that follow this gradient

Ahh I almost forgot about a bonus. From now on following days, we will be focusing on being interview ready, so after the end expect these bonuses:

Question	What to Focus On
What is Gradient Boosting?	Talk about sequential learning and residual correction
How does it differ from Random Forest?	Sequential vs parallel trees
What is the role of learning rate?	Smaller rate = slower but more accurate
Why is XGBoost so powerful?	Regularization, speed, missing value handling, built-in CV
How does early stopping work?	Validation-based halting of training to prevent overfitting

Ciao!!

Day 16: XGBoost (Extreme Gradient Boosting) : Ultimate Deep Dive

Table of contents

The Need for XGBoost

A Practical view

Early Stopping and Learning Curves

Wrap Up

Subscribe to my newsletter

Saket Khopkar

Saket Khopkar