3. Gradient Descent: Exploring Momentum

3 min read

Implementing Momentum-based Optimizer

🚀 Context

Last time, we implemented a simple learning rate scheduler and identified a major flaw: over time, it becomes unreliable for optimizing the training process. So, we decided to upgrade it.

I had always heard that the Adam optimizer (which I used blindly before) uses momentum to overcome local minima. It seemed genius, but I never really explored how it worked under the hood.

So, I dedicated a full day to discovering this concept myself. Instead of researching Adam in detail, I wanted to build something with a similar working principle on my own.

💡 Main Idea

Momentum-based optimization can be imagined as a ball rolling down a hill, overcoming small pits and stopping at the deepest point.

Normally, gradients don’t overshoot because each iteration pulls the parameters coser to the minimum. However, adding momentum introduces a velocity effect, allowing:
✅ Faster descent on steep slopes (gains speed going downhill)
✅ Slower ascent when moving against the gradient (reduces unnecessary oscillations)

The formula is simple:

$$M = \beta \cdot M_{\text{prev}} + \alpha \cdot G$$


$$\beta - \text{Momentum coefficient}$$

$$M_{\text{prev}} - \text{Previous momentum}$$

$$\alpha - \text{Learning rate}$$

$$G - \text{Gradient}$$

β (beta) controls how much momentum is retained. A higher β keeps more past information, leading to smoother updates.

Instead of updating parameters directly with the gradient, we use momentum:

$$k = k - M$$

This smooths the gradient updates, acting like a moving average that prevents wild swings.

📊 Momentum and Gradient Behavior

EpochLossk Gradientk Momentum

Notice how momentum smooths updates, even after the gradient changes sign at epoch 13. The optimizer still moves in the previous direction until epoch 22, making transitions smoother.

🛠️ Implementation

class MomentumOptimizer:
    def __init__(self, param, lr=0.001, beta=0.93):
        self.param = param
        self.lr = lr
        self.beta = beta
        self.momentum = 0
        self.prev_momentum = 0

    def step(self, gradient, epoch):
        self.prev_momentum = self.momentum
        self.momentum = self.beta * self.prev_momentum + self.lr * gradient
        self.param -= self.momentum

    def get_param(self):
        return self.param

I tested this on a simple linear function:

$$f(x) = kx + b$$

Of course, a linear function doesn’t have complex minima, but it's a good starting point.

⚠️ Challenges

This approach improved stability, but it’s not perfect:

  • Momentum can overshoot if β is too high.

  • Incorrect learning rates (too large) make gradients explode, leading to instability.

This can be seen in the training process:

📊 Results

Training was successful, but expected loss fluctuations occurred due to the intentional noise in data (std=1, mean=0).


  • MAX_EPOCHS = 450

  • k_lr = 0.01

  • b_lr = 0.01

  • beta = 0.93

Final Loss:
Train Loss: 0.8027
Test Loss: 0.6844

📉 Training Progress:

📊 Loss Function with Respect to Parameters

🔜 What’s Next?

While this is a step forward, momentum alone isn’t enough.

🚧 Problems to fix:

  • Unstable gradients when learning rate is too large

  • Exploding updates in complex functions

So, in the next post, we’ll refine this further—stay tuned! 🎯

Subscribe to my newsletter

Read articles from Viole directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by
