In the heart of modern neural networks capable of understanding the sentence semantics and generating thousands of words per second, Transformers, as we call them, lie 2 core mathematical operations that often go unnoticed, namely the Softmax and Negative Log-Likelihood. While the overall Transformer architecture and attention mechanisms get most of the spotlight, its these foundational functions that translate raw model outputs, which cant be decoded statistically, into structured probability distributions and guide the model’s learning through the desired gradient based optimization.

This blog aims to demystify the roles of Softmax and NLL, tracing how they work, why they matter, and how they affect the behavior of models, specifically autoregressive models.

The Softmax activation function

Deriving Softmax

Softmax, put simply, takes the output of a neural network, often called logits, and converts them into a probability distribution. This means each value is transformed into a probability between 0 and 1, and all the resulting probabilities add up to 1. In that sense, Softmax creates a normalized distribution over the possible output classes.

But here's the catch: logits can be positive, negative, or close to zero. So how do we convert these arbitrary values into something non-negative and meaningful as a probability? The answer lies in the exponential function. No matter what number you input, negative, zero, or positive, the exponential function always outputs a positive value.

Now that we’ve used the exponential function to make all the logits positive, the next step is to normalize them.

In the formula for Softmax, the denominator consists of a summation term that sums up all the exponential values across all classes. This step is what turns a bunch of unbounded numbers into a proper probability distribution. Hence, it is safe to think of the denominator as a kind of normalization constant. it balances the scores so that the largest exponentiated value doesn’t just dominate, and every class gets its fair share in the final distribution.

By examining the y-coordinate of each point on both graphs above, you can see the difference between them. The y-coordinates of the points on the second graph add up to 1, while those on the first graph are not normalized.

Now, we can construct the Softmax function from the above points

$$\text{Softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{k} e^{x_j}}$$

where xi = logit at index i, xj = logit at index j, k = number of classes

Temperature

Since Softmax transforms raw logits into probabilities relative to each other, we can scale them arbitrarily by multiplying all the probabilities with a scaling factor. This scaling can change the model behaviour by changing how deterministic or random the outputs can be.

Mathematically, the Softmax with temperature T is written as:

$$\text{Softmax}(x_i; T) = \frac{e^{x_i / T}}{\sum_{j=1}^{k} e^{x_j / T}}$$

When T<1, the logits are divided by a small number, making the exponentials more extreme. This sharpens the probability distribution, making the model more confident and deterministic.

When T>1 the logits are divided by a larger number, flattening the distribution. The model becomes less confident and more random in its choices.

This technique is usually seen in generative models like LLMs where temperature helps balance between creativity and accuracy during generations.

Why Softmax?

You might wonder: why not just use the raw logits directly? Why bother with Softmax at all?

The reason is tied to what a classification task actually requires. To decide which class the model believes is the most likely, we need a way to compare the relative importance of each output. Raw logits don't give us a clear sense of how confident the model is. Softmax helps by scaling the values in a way that reflects their importance, turning them into probabilities that are easy to interpret and compare.

In a transformer model, the goal is simple, to come up with a probability distribution, which says, at each time step, how much importance or attention should be paid to the input words. Hence, Softmax is typically applied to raw attention scored obtained from dot product of Query and Key vectors in the self-attention mechanism.

Negative Log-Likelihood Loss

In classification or token-prediction tasks, we pair the Softmax output with a loss function (Negative Log-Likelihood) that penalizes incorrect probabilities and pushes up the probability of the true class while pushing down the others. This involves maximizing the likelihood of seeing correct output from the model.

Deriving Negative Log-Likelihood Loss

Likelihood refers to the chances of some calculated parameters producing some known data.

We can define Likelihood l(θ) as

$$l(\theta) = p(x_1, \ldots, x_n \mid \theta) = \prod_{i=1}^n p(x_i \mid \theta).$$

where x1, x2, … , xn are individual observations and n is the number of independent observations.

To maintain efficiency in computation and ease in differentiation, we can introduce Logarithm to our expression of l(θ)

$$\log l(\theta) = \sum_{i=1}^n p(x_i \mid \theta)$$

Now, our goal is to maximize this log-likelihood for each observation.

It is safe to say that our goal of maximizing log-likelihood is equivalent to minimizing the loss.

Why? Because maximizing a function f(θ) is the same as minimizing any monotonically decreasing transformation of it.

Hence, we can define our Loss, L(θ) as

$$L(\theta) = -l(\theta) = -\sum_{i=1}^n \log p(x_i \mid \theta)$$

Now, we require the θ that makes this loss, L(θ) as small as possible

$$\hat{\theta} = \arg \min_{\theta} L(\theta)$$

Because L(θ) = -l(θ), flipping the sign swaps max for min

$$\arg \max_{\theta} l(\theta) = \arg \min_\theta[-l(\theta)] = \arg \min_\theta L(\theta)$$

Hence, finding the θ that maximizes our Likelihood, or Log-Likelihood, is exactly the same as finding the θ that minimizes NLL loss.

Therefore, the final expression for Negative Log-Likelihood Loss is

$$L(\theta) = - \sum_{i=1}^N \log p(label_i \mid input_i, \theta)$$

For a Softmax setting where true labels are one-hot encoded vectors, we can derive NLL loss as follows

Let true class label be y and logit(s) be z then the probability assigned to that correct class by the model is

$$p_y = \frac{e^{z_y}}{\sum_je^{z_j}}$$

$$ \log p_y = z_y - \log\sum_je^{z_j}$$

Hence, our loss L(θ) for a Softmax setting will be

$$L(z, y) = -z_y + \log\sum_je^{z_j}$$

Minimizing L(z, y) or NLL loss simultaneously raises z_y and lowers other logits so as to maximize likelihood of desired output.

Binding Softmax and NLL together

Derivative of NLL with respect to Logit(s)

We know that

$$L = -\sum_{i=1}y_i\log p_i$$

Therefore

$$\frac{\partial L}{\partial p_i}=-\frac{y_i}{p_i}$$

Our end goal is to find how much Loss changes if Logits also change by infinitesimal values, hence, we use chain rule as follows

$$\frac{\partial L}{\partial z_j}=\sum^C_{i=1}\frac{\partial L}{\partial p_i}\frac{\partial p_i}{\partial z_j}$$

From the Softmax jacobian

$$= \sum^C_{i=1}(-\frac{y_i}{p_i})p_i(\delta_{ij}-p_j)$$

$$= -\sum_iy_i\delta_{ij} + \sum_iy_ip_j$$

$$= -y_j + p_j\sum_{i}y_i$$

$$ = p_j - y_j$$

where i, j are class-indices from 1 to C(total number of classes), z_j is logit for class j, p_j is softmax probability of class j, y is true label under one-hot encoding.

When true class j = y, y_j = y_y = 1

$$\frac{\partial L}{\partial z_j} = -1 +p_y$$

Whenever p_y < 0, gradient descent will increase the logit z_j, pushing up the predicted probability for the true class.

When other classes j ≠ y, y_j = 0

$$\frac{\partial L}{\partial z_j}=p_j -0 =p_j$$

Since p_j > 0, these gradients are positive, so gradient descent will decrease non-true logits z_j, pushing down the predicted probability for non-true classes.

Conclusion

Together, Softmax and Negative Log-Likelihood form the mathematical backbone of almost every modern classifier and autoregressive model. By understanding not just what Softmax + NLL does, but why it works, from the probabilistic foundations to the gradient derivation, you’ll be equipped to diagnose training issues, innovate new loss formulations, and appreciate the statistical guarantees underpinning your models.

Happy modeling!

Logits and Likelihoods