3. Gradient Descent: Exploring Momentum


Implementing Momentum-based Optimizer
🚀 Context
Last time, we implemented a simple learning rate scheduler and identified a major flaw: over time, it becomes unreliable for optimizing the training process. So, we decided to upgrade it.
I had always heard that the Adam optimizer (which I used blindly before) uses momentum to overcome local minima. It seemed genius, but I never really explored how it worked under the hood.
So, I dedicated a full day to discovering this concept myself. Instead of researching Adam in detail, I wanted to build something with a similar working principle on my own.
💡 Main Idea
Momentum-based optimization can be imagined as a ball rolling down a hill, overcoming small pits and stopping at the deepest point.
Normally, gradients don’t overshoot because each iteration pulls the parameters coser to the minimum. However, adding momentum introduces a velocity effect, allowing:
✅ Faster descent on steep slopes (gains speed going downhill)
✅ Slower ascent when moving against the gradient (reduces unnecessary oscillations)
The formula is simple:
$$M = \beta \cdot M_{\text{prev}} + \alpha \cdot G$$
Where:
$$\beta - \text{Momentum coefficient}$$
$$M_{\text{prev}} - \text{Previous momentum}$$
$$\alpha - \text{Learning rate}$$
$$G - \text{Gradient}$$
β (beta) controls how much momentum is retained. A higher β keeps more past information, leading to smoother updates.
Instead of updating parameters directly with the gradient, we use momentum:
$$k = k - M$$
This smooths the gradient updates, acting like a moving average that prevents wild swings.
📊 Momentum and Gradient Behavior
Epoch | Loss | k Gradient | k Momentum |
11 | 1.503 | 1.495 | 0.45 |
12 | 0.914 | 0.609 | 0.425 |
13 | 0.849 | -0.226 | 0.393 |
14 | 1.239 | -0.998 | 0.355 |
15 | 1.763 | -1.697 | 0.314 |
16 | 2.279 | -2.313 | 0.268 |
17 | 2.743 | -2.841 | 0.221 |
18 | 3.133 | -3.276 | 0.173 |
19 | 3.437 | -3.616 | 0.125 |
20 | 3.656 | -3.862 | 0.077 |
21 | 3.790 | -4.014 | 0.032 |
22 | 3.841 | -4.077 | -0.011 |
23 | 3.814 | -4.055 | -0.051 |
Notice how momentum smooths updates, even after the gradient changes sign at epoch 13. The optimizer still moves in the previous direction until epoch 22, making transitions smoother.
🛠️ Implementation
class MomentumOptimizer:
def __init__(self, param, lr=0.001, beta=0.93):
self.param = param
self.lr = lr
self.beta = beta
self.momentum = 0
self.prev_momentum = 0
def step(self, gradient, epoch):
self.prev_momentum = self.momentum
self.momentum = self.beta * self.prev_momentum + self.lr * gradient
self.param -= self.momentum
def get_param(self):
return self.param
I tested this on a simple linear function:
$$f(x) = kx + b$$
Of course, a linear function doesn’t have complex minima, but it's a good starting point.
⚠️ Challenges
This approach improved stability, but it’s not perfect:
Momentum can overshoot if β is too high.
Incorrect learning rates (too large) make gradients explode, leading to instability.
This can be seen in the training process:
📊 Results
Training was successful, but expected loss fluctuations occurred due to the intentional noise in data (std=1, mean=0).
Hyperparameters:
MAX_EPOCHS = 450
k_lr = 0.01
b_lr = 0.01
beta = 0.93
Final Loss:
✅ Train Loss: 0.8027
✅ Test Loss: 0.6844
📉 Training Progress:
📊 Loss Function with Respect to Parameters
🔜 What’s Next?
While this is a step forward, momentum alone isn’t enough.
🚧 Problems to fix:
Unstable gradients when learning rate is too large
Exploding updates in complex functions
So, in the next post, we’ll refine this further—stay tuned! 🎯
Subscribe to my newsletter
Read articles from Viole directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
