Adam Optimizer


You’ve probably used Adam as your go to optimizer. But do you know why it works? In this article, we’ll unpack the Adam optimizer introduced in this paper. This post is for anyone who wants to deeply understand what is happening when they use Adam.
Adam is an optimizer that helps us efficiently converge on a set of parameters for a stochastic function that we want to minimize. In machine learning, we often use Adam to find the parameters of a model represented by a stochastic function that minimizes a loss function.
The motivation
We already have many variations of optimizers, therefore it is good to question why we need another. Here we will outline the core problems Adam addresses and the consequences of not addressing them.
Classic optimizers such as Stochastic Gradient Descent (SGD) only have one learning rate. Therefore when we have several parameters, the gradient of a parameter at a particular point in time can be vastly different from other parameters. In this situation when using one global learning rate, we take the same step size when updating a parameter, no matter what the gradient is. This means when some parameters could receive a more confident, larger update, we might be held back by needing to have smaller updates on more sensitive parameters. Of course, this is scaled by the gradient itself as it is multiplied by the learning rate, but this places an upper limit on what our global learning rate can be, as it needs to account for sensitive parameters that should receive smaller updates.
Whilst problem 1 highlights the desire to have multiple learning rates to allow parameters to individually update more confidently or more sensitively depending on their gradient at a point in time, there’s also a related issue over time for a single parameter. The gradient for each parameter will change over our training run, as expected, and we want to take large steps when the gradient is large and smaller steps when approaching a minimum. As previously discussed, having multiple learning rates allows this to vary across the model for each parameter, but it does not vary over time within one parameter, and we still have an upper bound on individual learning rates.
As previously noted with some optimizers like SGD we have to choose a learning rate, and its upper bound is dictated by small gradient step sizes. We can also select learning rates using trial and error, grid search and using learning rate schedules which are a predetermined schedule for adjusting the learning rate over time. All of these require some thought into selecting the learning rate.
Due to memory constraints, we might not load entire datasets and therefore use batches of data as seen in SGD. However, each batch allows the gradient to diverge and could be a noisy outlier rather than representative of the overall gradient of the training batch.
Optimizers like Adagrad help with sparse gradients by retaining a high learning rate for infrequently updated parameters. We might come across sparse gradients in NLP tasks where words are infrequently used and ideally should receive meaningful updates.
SGD suffers from slowing down during flat regions, meaning updates are small when we might want to speed up and get through the region. Other optimizers have adjusted for this, such as SGD with momentum, where gradients are accumulated. This means once we are in a flat region, we are carrying on the accumulation of past gradients allowing us to push through flat regions rather than computing our update on the current gradient.
Step by step: inside the Adam optimizer
Now that we've seen the key challenges, let's walk through how Adam actually works step by step using the original algorithm from the paper.
$$\begin{alignedat}{2} (1) \quad & \textbf{Require: } \alpha \text{ (Stepsize)} \\ (2) \quad & \textbf{Require: } \beta_1, \beta_2 \in [0, 1) \text{ (Exponential decay rates)} \\ (3) \quad & \textbf{Require: } f(\theta) \text{ (Stochastic objective function)} \\ (4) \quad & \textbf{Require: } \theta_0 \text{ (Initial parameter vector)} \\ (5) \quad & m_0 \leftarrow 0 \quad \text{(Initialize 1st moment vector)} \\ (6) \quad & v_0 \leftarrow 0 \quad \text{(Initialize 2nd moment vector)} \\ (7) \quad & t \leftarrow 0 \quad \text{(Initialize timestep)} \\ (8) \quad & \textbf{while } \theta_t \text{ not converged do} \\ (9) \quad & \quad t \leftarrow t + 1 \\ (10) \quad & \quad g_t \leftarrow \nabla_\theta f_t(\theta_{t-1}) \quad \text{(Compute gradients)} \\ (11) \quad & \quad m_t \leftarrow \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t \quad \text{(Update biased 1st moment)} \\ (12) \quad & \quad v_t \leftarrow \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2 \quad \text{(Update biased 2nd moment)} \\ (13) \quad & \quad \hat{m}_t \leftarrow m_t / (1 - \beta_1^t) \quad \text{(Bias-corrected 1st moment)} \\ (14) \quad & \quad \hat{v}_t \leftarrow v_t / (1 - \beta_2^t) \quad \text{(Bias-corrected 2nd moment)} \\ (15) \quad & \quad \theta_t \leftarrow \theta{t-1} - \alpha \cdot \hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon) \quad \text{(Update parameters)} \\ (16) \quad & \textbf{end while} \\ (17) \quad & \textbf{return } \theta_t \quad \text{(Resulting parameters)} \end{alignedat}$$
Initialization
Lines 1-2 are simply setting the hyperparameters of the algorithm. α is a normal learning rate as seen in SGD, and β1/β2 are values both between 0 and 1. β1 controls how much of a past gradient is added to each update step, creating a weighted average. e.g. If β1 is set to 0.9, we retain 90% of the previous moment estimate and blend in 10% of the current gradient. β2 is a similar parameter but covers an accumulated penalty value that we will see later, along with a deeper explanation of the weighted average of the gradient.
Line 3 is setting up our stochastic function to optimize e.g. a loss like mean squared error
Line 4 initializes the parameter vector, just like other optimizers.
Line 5 creates a vector to store the first moment (also known as the mean) of every parameter’s gradient. Therefore the vector should have the same length as the number of parameters we need to optimize. Each value is initialized to zero. When these values are updated β1 is used to scale how much we update the running mean with the previously seen gradient.
Line 6 creates a vector similar to what is seen in line 5 and is also initialized to zero. However we will store the second moment for each parameter in this vector as the algorithm progresses. The second moment is the running average of the gradient squared, which reflects the average magnitude (squared) of the gradient over time. It helps to assess the stability or noisiness of the gradient signal. As β1 is used to scale the first moment, β2 is used with this vector to scale the second moment at each update.
Training loop
Line 7-9 sets up and starts our training loop, with an initial time step t.
Line 10 computes gradients for the loss function with respect to model parameters θ in the previous time step t-1. These gradients are stored in the variable g.
Moment updates
Line 11 updates our vector of first moments. We initially set this to zero, therefore for the first update, we will be adding in 1-β1 of each parameters current gradient.
e.g. Given t=1, β1=0.9 and the gradient of the current parameter is 1.5.
We will keep 90% (when β1=0.9) of the previous gradient, and blend in 10% (1-β1=0.1) of our current gradient (1.5) resulting in 0.15 being stored in our vector for this parameter.
On the next loop given t=2 and the gradient of current parameter is now 1.4. We will update our first moment for the current parameter with 90% of its current value (0.15 from the last update), which results in 0.135 and add this to 0.1 of the current gradient (1.4). The result is 0.135+0.14=0.275
This process, blending in previous gradients rather than taking only the current gradient allows us to build an average gradient value and provides a signal if we are on a consistent gradient rather than a fluctuating one which might require us to back off from large updates.
Line 12 uses a similar mechanism to line 11 for updating a value for each parameter. However it stores the second moment, which is the exponential moving average of the squared gradients. Squaring the gradients ensures the second moment remains positive, acting as a penalty term. Since the gradients are squared, the values are always non-negative, therefore the second moment keeps accumulating regardless of the gradient’s direction. The calculation of the first moment does not detect oscillating gradients well as they flip around a local minima as the negative and positive gradients will cancel each other out. However the calculation of the second moment will keep increasing as it oscillates around a local minima.
Line 13-14 are bias correction steps for the first and second moments. Since we initialize both moment vectors to zero, early values are biased toward zero, even if the true gradients are not.
To correct for this, we divide by a factor that compensates for how little history we've accumulated so far. In the first few steps, this correction has a large effect, later on it fades away as the moment estimates become more accurate on their own.
For example, at time step t = 1, the correction for the first moment is:
$$\frac{m_t}{1-\beta_1^1}$$
And at t = 10:
$$\frac{m_t}{1-\beta_1^{10}}$$
Since β1 is a number less than 1 (e.g, 0.9), raising it to higher powers brings it closer to 0. Therefore the denominator grows closer to 1 over time. That means in early steps, we divide by a small number (amplifying the estimate), and later we divide by something close to 1 (leaving it mostly unchanged).
Line 15 is where we update our parameters, where, similar to SGD, we adjust by a learning rate or step size as it is known here, however with Adam we can scale our step size based on the first and second moment. We can think of the first moment as our signal and the second moment as our noise. Therefore when we have a high signal to noise ratio, we are confident we can take a large step and allow us to take a large step size.
e.g. If our first moment = 5, second moment = 25
$$\frac{5}{\sqrt{25}}=1$$
This results in 1, which, when multiplied by our step size returns the full step size, therefore we take a large parameter update.
If our second moment is high compared to the first moment, this is a signal that we are not confident in the average gradient being reported by the first moment. This could be for a few reasons, such as being on an oscillating surface. If we frequently flip between negative and positive gradients, the second moment will capture all of these as positive values, therefore it keeps building up rather than negative values canceling out previous positive values as in our the first moment.
e.g. If our first moment = 5, second moment = 100
$$\frac{5}{\sqrt{100}}=0.5$$
This will result in reducing our step size by half, signalling a lack of confidence, therefore take smaller steps.
ε is a small constant added to prevent division by zero.
Algorithm Overview
Zooming out from the pseudocode, we can see how the various steps help to address the problems listed earlier. First we track the first and second moment for every parameter we want to train. Whilst this does use more memory than one global learning rate, it should result in better use of our resources as we converge faster.
Taking the first moment which is an average of gradients we become resistant to sudden gradient spikes allowing us to smooth out the convergence. If we take a stable downward path toward a local minimum, at first the gradients will be large, with the first moment retaining a high value. If we ignore the second moment, this large first moment will not be scaling the step size back much, signalling we are confident and can take these large steps. As we get closer to the local minima, our first moment should start converging towards zero, which starts to scale back our step size. This prevents overshooting the local minima and oscillating around it.
However, not all regions of the loss surface are smooth. In more chaotic areas, the second moment plays a larger role in stabilizing updates. The second moment will come more into play, where we are not on a consistent, smooth part to a local minima. If the gradient repeatedly enters small troughs, the gradient will flip between negative and positive values. With only the first moment, we have smoothed out gradients however these could be big shifts leading us to scale the step size erratically. With the second moment, regardless of a positive or negative gradient we are increasing the value due to the squaring of the gradient which always yields a non-negative result for any real number. As the second moment grows, the larger value to be divided by results in a smaller scaling factor. The algorithm becomes more conservative due to low confidence.
Ultimately, Adam determines how much to update a parameter by looking at the ratio between the first moment (our directional signal) and the square root of the second moment (our measure of noise or instability). Adam adapts its learning rate dynamically, growing cautious in noisy or uncertain regions and moving decisively when gradients are stable.
Here’s how Adam addresses the challenges introduced earlier:
One global learning rate
Adam maintains per-parameter learning rates using first and second moment estimates, allowing different update magnitudes for each parameter.
Fixed learning rate over time
Moment estimates adapt over time, allowing the step sizes to shrink or grow as gradients evolve.
Manual learning rate tuning
Step sizes are adjusted automatically, often reducing the need for manual learning rate schedules or tuning.
Small batches are noisy
Exponential moving averages smooth gradient estimates, helping Adam stay stable even with noisy mini-batch gradients.
Sparse gradients
Like Adagrad, Adam reduces updates for frequently active parameters, while allowing relatively larger updates for rarely used ones, making it well-suited for sparse gradients.
Flat regions and momentum
The first moment acts like momentum, helping push through flat or ambiguous regions of the loss surface.
Summary
Optimizers preceding Adam used parts of the concepts it brings together. For example, SGD was extended with momentum, which averages gradients over time. This helps reduce oscillation and allows the optimizer to follow a more stable path, which is conceptually similar to Adam’s first moment estimate.
Adagrad introduced per-parameter learning rates by accumulating the square of past gradients. This allows large updates for infrequently updated parameters and smaller updates for frequently updated ones. However, because the accumulation grows without decay, the learning rate can become excessively small later in training.
RMSProp improved on this by using an exponential moving average of squared gradients instead of a cumulative sum. This enabled the learning rate to adapt more flexibly to recent gradient behavior rather than shrinking predictably over time.
Adam combines these ideas. It uses momentum like first moment estimates and RMSProp style second moment estimates to scale the step size based on both direction and the reliability of the gradients. Adam also introduces bias correction, which improves early training by compensating for the initial zero values in the moving averages.
Adam is widely used because of its robustness and adaptability. It often converges faster than optimizers like SGD. However, it does not always generalize as well. Because Adam closely follows the gradient signal for each parameter, especially in models with many parameters, it can overfit or settle into sharp minima. In contrast, SGD with momentum tends to average out gradient noise more effectively, which helps it find flatter minima that often lead to better generalization.
Below is a PyTorch implementation of Adam's core logic.
Subscribe to my newsletter
Read articles from Jessen directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
