3. Gradient Descent: Exploring Momentum

VioleViole
3 min read

Implementing Momentum-based Optimizer

🚀 Context

Last time, we implemented a simple learning rate scheduler and identified a major flaw: over time, it becomes unreliable for optimizing the training process. So, we decided to upgrade it.

I had always heard that the Adam optimizer (which I used blindly before) uses momentum to overcome local minima. It seemed genius, but I never really explored how it worked under the hood.

So, I dedicated a full day to discovering this concept myself. Instead of researching Adam in detail, I wanted to build something with a similar working principle on my own.

💡 Main Idea

Momentum-based optimization can be imagined as a ball rolling down a hill, overcoming small pits and stopping at the deepest point.

Normally, gradients don’t overshoot because each iteration pulls the parameters coser to the minimum. However, adding momentum introduces a velocity effect, allowing:
✅ Faster descent on steep slopes (gains speed going downhill)
✅ Slower ascent when moving against the gradient (reduces unnecessary oscillations)

The formula is simple:

$$M = \beta \cdot M_{\text{prev}} + \alpha \cdot G$$

Where:

$$\beta - \text{Momentum coefficient}$$

$$M_{\text{prev}} - \text{Previous momentum}$$

$$\alpha - \text{Learning rate}$$

$$G - \text{Gradient}$$

β (beta) controls how much momentum is retained. A higher β keeps more past information, leading to smoother updates.

Instead of updating parameters directly with the gradient, we use momentum:

$$k = k - M$$

This smooths the gradient updates, acting like a moving average that prevents wild swings.

📊 Momentum and Gradient Behavior

EpochLossk Gradientk Momentum
111.5031.4950.45
120.9140.6090.425
130.849-0.2260.393
141.239-0.9980.355
151.763-1.6970.314
162.279-2.3130.268
172.743-2.8410.221
183.133-3.2760.173
193.437-3.6160.125
203.656-3.8620.077
213.790-4.0140.032
223.841-4.077-0.011
233.814-4.055-0.051

Notice how momentum smooths updates, even after the gradient changes sign at epoch 13. The optimizer still moves in the previous direction until epoch 22, making transitions smoother.

🛠️ Implementation

class MomentumOptimizer:
    def __init__(self, param, lr=0.001, beta=0.93):
        self.param = param
        self.lr = lr
        self.beta = beta
        self.momentum = 0
        self.prev_momentum = 0

    def step(self, gradient, epoch):
        self.prev_momentum = self.momentum
        self.momentum = self.beta * self.prev_momentum + self.lr * gradient
        self.param -= self.momentum

    def get_param(self):
        return self.param

I tested this on a simple linear function:

$$f(x) = kx + b$$

Of course, a linear function doesn’t have complex minima, but it's a good starting point.

⚠️ Challenges

This approach improved stability, but it’s not perfect:

  • Momentum can overshoot if β is too high.

  • Incorrect learning rates (too large) make gradients explode, leading to instability.

This can be seen in the training process:

📊 Results

Training was successful, but expected loss fluctuations occurred due to the intentional noise in data (std=1, mean=0).

Hyperparameters:

  • MAX_EPOCHS = 450

  • k_lr = 0.01

  • b_lr = 0.01

  • beta = 0.93

Final Loss:
Train Loss: 0.8027
Test Loss: 0.6844


📉 Training Progress:


📊 Loss Function with Respect to Parameters


🔜 What’s Next?

While this is a step forward, momentum alone isn’t enough.

🚧 Problems to fix:

  • Unstable gradients when learning rate is too large

  • Exploding updates in complex functions

So, in the next post, we’ll refine this further—stay tuned! 🎯


0
Subscribe to my newsletter

Read articles from Viole directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Viole
Viole