Not everything perfect happens in the first attempt itself, Right? There might be mistakes on each and every step to the perfection. Correcting those mistakes eventually will lead to your success. This concept is something that revolves around our today’s topic Gradient Boosting.

Imagine you’re a chef perfecting a recipe.

Attempt 1: It's okay but bland.
Attempt 2: You fix the flavor by adding lemon.
Attempt 3: Still not spicy, so you add chili.
Attempt 4: Still needs freshness, so you add herbs.

Each attempt corrects the specific flaws of the last, that’s how Gradient Boosting works.

Now imagine instead of fixing soup, you're fixing predictions. Gradient Boosting:

Starts with a weak model (bad soup).
Measures where it went wrong (missing flavor).
Builds another model to fix just those mistakes (adds chili).
Repeats until the dish (or prediction) is excellent.

In machine learning terms:

You want to reduce the loss function (error) as much as possible.
But instead of fixing everything at once, you fix it in small pieces.
Each model (tree) focuses on what the previous model did wrong.

What’s a Gradient?

The gradient is the direction of steepest descent (aka, how to reduce error fastest).

A gradient is a fancy name for:

“The direction and rate of fastest increase (or decrease) in a function.”

But in machine learning, we often flip it:

We use the negative gradient to find the fastest way to decrease error.

Think of it like walking down a hill:

The slope tells you which way is downhill.
If you take a big step in that direction, you’ll get closer to the bottom, which means lower error.

In Gradient Boosting, the idea is:

You start with a bad model.
You look at the mistakes it made.
Then you ask:
“How can I move closer to the correct answers?”

The gradient of the loss function tells you exactly that, how much and in which direction you should adjust your prediction to reduce error.

Why GBM stands unique?

Feature of GBM	Why It’s Special
Sequential	Learners are added one after another
Gradient-based	Uses gradients to minimize error
Residual-driven	Each learner learns the “mistakes”
Controlled learning	Uses learning rate to prevent overfitting

Coded example of a Gradient

Let us look at a code example for what a Gradient should look like:

# Concept of Gradient
import numpy as np
import matplotlib.pyplot as plt

# Generate a simple grid of predictions and actuals
y_true = 1.0  # the true value (like 1)
y_preds = np.linspace(-2, 4, 100)  # range of model predictions

# Calculate Mean Squared Error (MSE) and its gradient
mse_loss = (y_preds - y_true)**2
mse_gradient = 2 * (y_preds - y_true)

# Plotting
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.plot(y_preds, mse_loss, label='MSE Loss')
plt.title("Loss Curve (MSE)")
plt.xlabel("Predicted Value")
plt.ylabel("Loss")
plt.grid(True)
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(y_preds, mse_gradient, label='Gradient', color='red')
plt.axhline(0, color='gray', linestyle='--')
plt.title("Gradient Direction")
plt.xlabel("Predicted Value")
plt.ylabel("Gradient (slope)")
plt.grid(True)
plt.legend()

plt.tight_layout()
plt.show()

Left plot shows the loss depending on your prediction.
Right plot shows the gradient (slope) i.e. how wrong you are and which direction to move to reduce the loss.
GBM learns in the opposite direction of the gradient.

A simple example

Suppose you are predicting someone’s exam score. Your prediction is 70. But the actual result came out to be 85.

Therefore, error = 85 - 70 = 15 (Residual).

That 15 is what is known as a mistake / error. Our model was too low by 15 points. So the Gradient of Mean Squared Error (MSE) is:

Gradient Boosting uses the negative of this gradient as the target for the next weak model to learn.

Loss Functions: their types and their Gradients

A loss function is a mathematical way to measuring how bad your model’s predictions are. It basically points out where our model is lagging behind and suggesting it a feww ways on how it should improve (with the help of gradient).You can think of it as the compass that guides learning. Without it, your model has no idea whether it’s improving or not.

Every time your model makes a prediction, the loss function:

Measures how wrong the prediction was
Uses that info to adjust the model (via gradients)
Repeats until predictions improve

This is called “optimization”, and the goal is to find the model parameters that minimize the loss.

Here are the types of loss functions:

Loss Function	Shape	Behavior
MSE	U-shaped curve	Penalizes big errors heavily
MAE	V-shaped line	Penalizes all errors equally
Log Loss	Sharp steep curve	Punishes confident wrong answers

The Mean Squared Error (MSE) is the blue line in first graph. It tends to punish big errors harshly (because the error is squared). The further you are from the predicted value, the higher the loss is. This is usually best for continuous regression problems. To make things simple, consider MSE as an angry coach, where if you mess up badly, you will get a loud shout.

The Mean Absolute Error (MAE) is the yellow line in the first graph. The punishment in this case is quite linear, i.e. every step away from the prediction increases the loss at the same rate. It is not as smooth curve though. As far as real example is concerned, think of MAE like a strict teacher; fair, but not too harsh.

The log loss however is more stricter. It is the red curve from the second graph. This only works within the probabilities of 0 and 1. If your model is very confident but wrong (predicts p=0.01), the penalty is HUGE. However, If you're confident and right (predicts p=0.99), the loss is tiny. Think of Log Loss like a poker coach, it doesn’t mind small mistakes, but if you bet big and lose, it really punishes you.

But ever wondered how the gradients would be placed for each of these graphs? Here is your answer.

Plot 1 : MSE

As you can see, Gradient grows larger as you move away from the correct answer. Model corrects more aggressively the further off it is. This is used in GBM when you want smooth correction

Plot 2 : MAE

In here, Gradient is always +1 or -1, no matter how far off the prediction is. This makes MAE more robust to outliers, but less smooth for gradient-based methods. GBM can use this with some tricks, but it’s not as common as MSE.

Plot 3 : Log Loss

Okay so now things get intresting, notice how low is the amrgin of error in the log loss function. As your predicted probability approaches 0, the gradient becomes massive (you were confidently wrong!). As your prediction approaches 1 (correct), the gradient becomes tiny — no need to correct much. This is ideal for GBM in classification, the reason being it gently adjusts correct guesses, and harshly punishes bad guesses

Finishing things off

In Gradient Boosting, the model learns where to go next by following these gradients. Each tree is literally trained to "follow the arrows" of these curves. We analaysed different loss functions, and had a look, how gradients play a key role in each of the loss functions. In fact, it shows us the way how our model is going wrong and how we may correct it for future improvisations. In the end, your goal should be minimizing the loss, isn’t it?

Play around the values a bit, work with bigger data for more practical approaches or to experience real world scenarios. That way, you will gain much more practical knowledge; and as I always believe, practical learning is a best way to etch the concept in mind for much longer period.

Happy Coding, Ciao!!

Day 15: Gradient Boosting (GBM) – The Layered Learner

Table of contents