This is a cool, gentle intro to the Softmax Activation function.

Softmax in Machine Learning is used for multi-class classification. Before it, we used logistic regression to classify an example into one of the two classes.

But logistic regression was not enough because we can have more than one class, and an example can belong to any of these classes. In logistic regression, we had limited choices, but here we have more than just 2 choices.

Upper View and Depth of Softmax

You might know the upper view of how the input vector passes through the layers of the neural network. We pass in the input vector, it gets transformed by the weight matrix of each layer into some other vector, and then it is passed through an activation function, which makes some modification to the transformed vector.

Before this, we had some simple activation functions like linear activation, sigmoid, of ReLu, but now we are dealing with this new activation function.

It is generally used in the last layer of the neural network, having units equal to the number of classes in which you want to classify your input.

Don’t get confused by the math here.

Focus on the Neural Network. You can look at the last layer of the neural network, it has units (denoted by circles) equal to the output classes. In this image, it is 10.

So, when the input vector transformed through different layers of the NN reaches the last layer and gets transformed into a new vector, that new vector has elements equal to the units in the output layer.

Now that the vector is put inside the softmax activation function and you get another vector with transformed values.

This vector is the output of the neural network.

Let’s see how this vector gets its elements.

Focus on the last layer of the net, and when you get the transformed vector, which is not passed through the activation function.

To calculate the activation values, we use this formula.

$$\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{n} e^{z_j}}$$

This is simple, the transformed output vector has different values, z(i), where i is the length of the column vector.

Here, softmax is calculating a(i) for each z(i) with the formula given above.

The cool thing about this formula is that it gives the probability of the example belonging to some class. For example, if you are getting a(1) = 0.6, then it means that P(y=1 | x) = 0.6. To elaborate further, it means that the probability of the example belonging to class 1 is 0.6, which is quite high.

Now, when we get the final predictions, we put them through the loss function, which calculates the error in our predictions for that example. All these errors are aggregated by the cost function, and the goal of our optimization algorithm (like Gradient descent) is to minimize these errors.

An optimization algorithm might tweak different weight or bias matrices of our neural net to minimize the error, making our predictions more accurate.

In Short

The softmax function is like any other activation function. In some sense, it’s just an extended version of logistic regression.

The mathematical aspect of this function might seem a bit hard at the beginning, but you will understand it over time.

This was it for today.

Softmax Activation Function

Table of contents

Upper View and Depth of Softmax

In Short

Subscribe to my newsletter

Kartavay

Kartavay