Day 16: XGBoost (Extreme Gradient Boosting) : Ultimate Deep Dive


XGBoost → a GBM but a faster version!!
XGBoost is an upgraded, battle-hardened, production-ready version of Gradient Boosting. Just imagine:
GBM is a smart but slow marathon runner.
XGBoost is that same runner after years of training, with rocket shoes, optimized diet, and laser focus.
In short: XGBoost learns from mistakes (like GBM) but 10x faster, 10x smarter, and with fewer errors.
The Need for XGBoost
No doubt GBM was strong, but it had it’s flaws such as:
Problem | Real Issue |
Slow Training | Trees were built sequentially and inefficiently |
Overfitting Risk | No regularization; it just kept fitting harder and harder |
Poor Handling of Missing Data | Needed manual cleaning and filling |
So, if the dataset was huge (millions of rows, 100+ features), classic GBM would collapse:
It took too long
It overfit
It couldn’t handle sparse or missing data
There was a need for a change, and that change was XGBoost.
XGBoost was revolutionary. It fixed all the flaws we discussed above by:
Problem | How XGBoost Solved It |
Slow Training | Parallelize tree construction using multiple cores |
Overfitting | Add regularization (L1 + L2 penalties) |
Missing Data | Smartly handle missing values by learning best splits |
That's why XGBoost = "Extreme" Gradient Boosting.
A Practical view
We will get to know more about the need of XGBoost in place of GBM by having a look at the code example below.
import time
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import GradientBoostingClassifier
import xgboost as xgb
Yeah, don’t forget this line before:
pip install xgboost
We will generate a fake dataset for our demonstration.
from sklearn.datasets import make_classification
# Generate synthetic classification data
X, y = make_classification(n_samples=10000, n_features=20,
n_informative=15, n_redundant=5,
random_state=42)
# Split into train/test
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)
Now, time to begin the show. We will first train it with GBM followed by XGBoost. We will subsequently compare their accuracy as well as training time.
# Train GBM
gbm = GradientBoostingClassifier(
n_estimators=300,
learning_rate=0.1,
max_depth=5
)
start_time = time.time()
gbm.fit(X_train, y_train)
gbm_time = time.time() - start_time
# Predict
y_pred_gbm = gbm.predict(X_val)
gbm_acc = accuracy_score(y_val, y_pred_gbm)
print(f"GBM Accuracy: {gbm_acc:.4f}")
print(f"GBM Training Time: {gbm_time:.2f} seconds")
GBM Accuracy: 0.9543
GBM Training Time: 28.75 seconds
Now time for the other one.
# Train XGBoost
xgb_model = xgb.XGBClassifier(
n_estimators=300,
learning_rate=0.1,
max_depth=5,
subsample=0.8,
colsample_bytree=0.8,
use_label_encoder=False,
eval_metric='logloss'
)
start_time = time.time()
xgb_model.fit(
X_train, y_train,
eval_set=[(X_val, y_val)],
verbose=False
)
xgb_time = time.time() - start_time
# Predict
y_pred_xgb = xgb_model.predict(X_val)
xgb_acc = accuracy_score(y_val, y_pred_xgb)
print(f"XGBoost Accuracy: {xgb_acc:.4f}")
print(f"XGBoost Training Time: {xgb_time:.2f} seconds")
XGBoost Accuracy: 0.9153
XGBoost Training Time: 0.61 seconds
Drastic difference isn’t is? Well, you will notice it better visually.
# Plotting comparison
models = ['GBM', 'XGBoost']
accuracy = [gbm_acc, xgb_acc]
training_time = [gbm_time, xgb_time]
fig, ax1 = plt.subplots(figsize=(10, 5))
color = 'tab:blue'
ax1.set_xlabel('Model')
ax1.set_ylabel('Accuracy', color=color)
ax1.bar(models, accuracy, color=color, alpha=0.6, label='Accuracy')
ax1.tick_params(axis='y', labelcolor=color)
ax2 = ax1.twinx()
color = 'tab:red'
ax2.set_ylabel('Training Time (seconds)', color=color)
ax2.plot(models, training_time, color=color, marker='o', label='Training Time')
ax2.tick_params(axis='y', labelcolor=color)
plt.title('GBM vs XGBoost: Accuracy and Training Time')
fig.tight_layout()
plt.grid(True)
plt.show()
Even though XGBoost has a similar accuracy, but the training time is quite low; which is one of the heaviest factor which leads to the preference of this algorithm over GBM.
Here is the final interpretation:
Aspect | GBM | XGBoost |
Training Speed | Slower | Faster (parallelized) |
Accuracy | Good | Equal or slightly better |
Overfitting | Higher risk | Controlled with regularization |
Missing Data Handling | Manual | Automatic |
Early Stopping and Learning Curves
Imagine you're preparing for an exam:
At first, studying more improves your knowledge.
But after 6–8 hours, you’re tired, studying more doesn’t help anymore.
Smart students stop early and rest, same with smart models.
If we put forth the concept of early stopping then it’s simply:
Stop training the model before overfitting begins.
Normally, if you keep training XGBoost:
It keeps adding more trees
But after a point, it starts fitting the noise, not the real patterns → Overfitting
What Early stopping will save me from?
Watching a validation set
If the validation error doesn’t improve after N rounds → it stops automatically
Let’s have a look at what’s a learning curve. A learning curve typically plots:
X-axis = Number of trees (iterations)
Y-axis = Error (loss) on training set and validation set
It shows how your model is learning:
If both errors are decreasing → Great! Keep going.
If training error decreases but validation error increases → Overfitting!
We will get more idea about the same by having a look at the practical example.
First we will setup necessary things:
import xgboost as xgb
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
# Sample dataset
df = pd.DataFrame({
'Square_Feet': [1000, 1500, 2000, 1200, 2500, 1800],
'Bedrooms': [2, 3, 4, 2, 5, 3],
'Location_Score': [3, 4, 5, 3, 5, 4],
'High_Price': [0, 0, 1, 0, 1, 1]
})
X = df[['Square_Feet', 'Bedrooms', 'Location_Score']]
y = df['High_Price']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)
print(df)
Square_Feet Bedrooms Location_Score High_Price
0 1000 2 3 0
1 1500 3 4 0
2 2000 4 5 1
3 1200 2 3 0
4 2500 5 5 1
5 1800 3 4 1
Now here we will train the model with early stopping enabled:
model = xgb.XGBClassifier(
n_estimators=1000, # Very large number
learning_rate=0.1,
max_depth=3,
subsample=0.8,
colsample_bytree=0.8,
use_label_encoder=False,
eval_metric='logloss',
early_stopping_rounds=10 # This works though!!
)
# Train with early stopping
model.fit(
X_train, y_train,
eval_set=[(X_val, y_val)],
# early_stopping_rounds=10, # Stop if no improvement in 10 rounds - this parameter is causing me problems, so commenting out
verbose=True
)
[0] validation_0-logloss:1.38629
[1] validation_0-logloss:1.38629
[2] validation_0-logloss:1.38629
[3] validation_0-logloss:1.38629
[4] validation_0-logloss:1.38629
[5] validation_0-logloss:1.38629
[6] validation_0-logloss:1.38629
[7] validation_0-logloss:1.38629
[8] validation_0-logloss:1.38629
[9] validation_0-logloss:1.38629
At this point, XGBoost is adding trees after trees. But after 10 trees in a row with no improvement on validation set, it will stop. This saves a great amount of time and also avoids overfitting. Or else it would have created 1000 trees!!
Here is how our learning curve will look like based on above example:
results = model.evals_result()
# Plot
plt.figure(figsize=(10,6))
epochs = len(results['validation_0']['logloss'])
x_axis = range(0, epochs)
plt.plot(x_axis, results['validation_0']['logloss'], label='Validation Log Loss')
plt.xlabel('Number of Trees')
plt.ylabel('Log Loss')
plt.title('XGBoost Validation Error Over Trees')
plt.legend()
plt.grid(True)
plt.show()
Uh! you may question me that it does not resemble to a curve, right? Well not necessarily it is bent all the times, here is detailed interpretation:
Behavior | What it Means |
Loss decreases steadily | Model is learning well |
Loss flattens | Model stopped learning; better to stop |
Loss increases | Overfitting danger! |
In our case, the Loss is flattened, so it is better to stop. Early stopping saves huge training time (instead of blindly training 500–1000 trees). Always remember that "Good enough" model > Overtrained model in production settings.
Wrap Up
Okay that is the time to wrap things up. We have covered the use cases of XGBoost and why it is knon as advanced version of Gradient Boosting. Here are a few key insights tabulated:
Topic | Key Insights |
Gradient Boosting | An ensemble method that builds weak learners (usually decision trees) sequentially, correcting previous errors using gradients of a loss function. |
Loss Functions | You explored MSE, MAE, and Log Loss, learned how each one behaves, and saw gradient visualizations to understand how they guide learning. |
Why Gradient Boosting Works | It focuses on "what went wrong" in earlier trees and makes future trees better, like a teacher correcting homework line by line. |
XGBoost | An optimized, scalable, regularized version of Gradient Boosting, much faster, more accurate, and more tunable. |
Simply put it like this:
Gradient = direction to minimize error
Boosting = build trees that follow this gradient
Ahh I almost forgot about a bonus. From now on following days, we will be focusing on being interview ready, so after the end expect these bonuses:
Question | What to Focus On |
What is Gradient Boosting? | Talk about sequential learning and residual correction |
How does it differ from Random Forest? | Sequential vs parallel trees |
What is the role of learning rate? | Smaller rate = slower but more accurate |
Why is XGBoost so powerful? | Regularization, speed, missing value handling, built-in CV |
How does early stopping work? | Validation-based halting of training to prevent overfitting |
Ciao!!
Subscribe to my newsletter
Read articles from Saket Khopkar directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Saket Khopkar
Saket Khopkar
Developer based in India. Passionate learner and blogger. All blogs are basically Notes of Tech Learning Journey.