Implementing Momentum-based Optimizer

🚀 Context

Last time, we implemented a simple learning rate scheduler and identified a major flaw: over time, it becomes unreliable for optimizing the training process. So, we decided to upgrade it.

I had always heard that the Adam optimizer (which I used blindly before) uses momentum to overcome local minima. It seemed genius, but I never really explored how it worked under the hood.

So, I dedicated a full day to discovering this concept myself. Instead of researching Adam in detail, I wanted to build something with a similar working principle on my own.

💡 Main Idea

Momentum-based optimization can be imagined as a ball rolling down a hill, overcoming small pits and stopping at the deepest point.

Normally, gradients don’t overshoot because each iteration pulls the parameters coser to the minimum. However, adding momentum introduces a velocity effect, allowing:
✅ Faster descent on steep slopes (gains speed going downhill)
✅ Slower ascent when moving against the gradient (reduces unnecessary oscillations)

The formula is simple:

$$M = \beta \cdot M_{\text{prev}} + \alpha \cdot G$$

Where:

$$\beta - \text{Momentum coefficient}$$

$$M_{\text{prev}} - \text{Previous momentum}$$

$$\alpha - \text{Learning rate}$$

$$G - \text{Gradient}$$

β (beta) controls how much momentum is retained. A higher β keeps more past information, leading to smoother updates.

Instead of updating parameters directly with the gradient, we use momentum:

$$k = k - M$$

This smooths the gradient updates, acting like a moving average that prevents wild swings.

📊 Momentum and Gradient Behavior

Epoch	Loss	k Gradient	k Momentum
11	1.503	1.495	0.45
12	0.914	0.609	0.425
13	0.849	-0.226	0.393
14	1.239	-0.998	0.355
15	1.763	-1.697	0.314
16	2.279	-2.313	0.268
17	2.743	-2.841	0.221
18	3.133	-3.276	0.173
19	3.437	-3.616	0.125
20	3.656	-3.862	0.077
21	3.790	-4.014	0.032
22	3.841	-4.077	-0.011
23	3.814	-4.055	-0.051

Notice how momentum smooths updates, even after the gradient changes sign at epoch 13. The optimizer still moves in the previous direction until epoch 22, making transitions smoother.

🛠️ Implementation

class MomentumOptimizer:
    def __init__(self, param, lr=0.001, beta=0.93):
        self.param = param
        self.lr = lr
        self.beta = beta
        self.momentum = 0
        self.prev_momentum = 0

    def step(self, gradient, epoch):
        self.prev_momentum = self.momentum
        self.momentum = self.beta * self.prev_momentum + self.lr * gradient
        self.param -= self.momentum

    def get_param(self):
        return self.param

I tested this on a simple linear function:

$$f(x) = kx + b$$

Of course, a linear function doesn’t have complex minima, but it's a good starting point.

⚠️ Challenges

This approach improved stability, but it’s not perfect:

Momentum can overshoot if β is too high.
Incorrect learning rates (too large) make gradients explode, leading to instability.

This can be seen in the training process:

📊 Results

Training was successful, but expected loss fluctuations occurred due to the intentional noise in data (std=1, mean=0).

Hyperparameters:

MAX_EPOCHS = 450
k_lr = 0.01
b_lr = 0.01
beta = 0.93

Final Loss:
✅ Train Loss: 0.8027
✅ Test Loss: 0.6844

📉 Training Progress:

📊 Loss Function with Respect to Parameters

🔜 What’s Next?

While this is a step forward, momentum alone isn’t enough.

🚧 Problems to fix:

Unstable gradients when learning rate is too large
Exploding updates in complex functions

So, in the next post, we’ll refine this further—stay tuned! 🎯

3. Gradient Descent: Exploring Momentum

Table of contents