Understanding Gradients and Optimization Techniques , Loss Functions in Neural Networks

Ashok VangaAshok Vanga
4 min read

What is a Gradient?

A gradient measures how much a small change in a parameter (like a weight) affects the model's output. In neural networks, gradients guide how to adjust weights to minimize the error (loss).

Intuition: Imagine standing on a hill—gradients tell you which direction to walk to reach the bottom (minimum loss).

Mathematics Behind Gradients (Chain Rule in Backpropagation)

In backpropagation, gradients are calculated using the chain rule from calculus, which helps to compute how the loss changes with respect to each weight.

How to Set the Learning Rate

  1. Manual Setting: Choose a learning rate through experimentation (e.g., 0.001, 0.01, 0.1).

  2. Adaptive Optimizers: Use advanced optimizers like Adam, RMSprop, or Adagrad which adjust the learning rate during training.

  3. Learning Rate Schedulers: Gradually decrease the learning rate during training for better convergence.

  4. Effects of Different Learning Rates

    | Learning Rate | Effect | | --- | --- | | Too Small | Slow convergence, longer training time | | Too Large | Overshooting the minimum, unstable training | | Optimal | Balanced speed and stability for best results |

from tensorflow.keras.optimizers import Adam

model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy')

###############################################

import torch.optim as optim

optimizer = optim.SGD(model.parameters(), lr=0.01)

Gradient Descent Optimization Techniques :

Stochastic Gradient Descent (SGD)

  • Updates weights after each sample.

  • Faster but noisier—good for large dataset

from tensorflow.keras.optimizers import SGD

model.compile(optimizer=SGD(learning_rate=0.01), loss='binary_crossentropy', metrics=['accuracy'])

Mini-Batch Gradient Descent

  • Updates weights after a small batch of samples (e.g., 32 or 64).

  • Balanced approach—faster convergence with less noise.

model.fit(X_train, y_train, epochs=100, batch_size=32)

Adam Optimizer (Adaptive Moment Estimation)

  • Combines momentum and adaptive learning rates.

  • Most popular—faster convergence and less hyperparameter tuning.

How it works:

  1. Momentum: Keeps track of past gradients for smoother updates.

  2. Adaptive Learning Rate: Adjusts learning rates for each parameter.

from tensorflow.keras.optimizers import Adam

model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'])

D. RMSprop (Root Mean Square Propagation)

  • Adapts learning rates for each parameter using the moving average of squared gradients.

  • Works well for recurrent neural networks (RNNs).

from tensorflow.keras.optimizers import RMSprop

model.compile(optimizer=RMSprop(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'])

Which Optimizer Should You Choose?

OptimizerBest Use CaseProsCons
SGDLarge datasets, simple modelsEfficient for large-scale problemsSlow convergence, noisy
Mini-Batch SGDGeneral-purpose, balanced datasetsFaster convergence, less noiseRequires tuning batch size
AdamMost tasks (CNNs, Transformers, etc.)Adaptive, works well out-of-the-boxMay overfit in some cases
RMSpropRecurrent networks (RNN, LSTM)Handles non-stationary data wellSensitive to learning rate
  • Forward Propagation: Predicts outputs by passing inputs through layers.

  • Backward Propagation: Calculates gradients and updates weights.

  • Gradient Descent Variants:

    • SGD: Simple, noisy but works.

    • Adam: Best for most models.

    • RMSprop: Good for recurrent models.

Understanding Loss Function

What is a Loss Function?

A loss function measures how well a neural network’s predictions match the actual labels. The goal during training is to minimize the loss using gradient descent.

  • Low loss → Good model predictions

  • High loss → Poor model predictions

Types of Loss Functions

For Classification Problems (Discrete Outputs)

  1. Binary Cross-Entropy

  2. Categorical Cross-Entropy

B. For Regression Problems (Continuous Outputs)

  1. Mean Squared Error (MSE)

  2. Mean Absolute Error (MAE)

Practical :

In a neural network, the goal is to make accurate predictions. However, when the network is first trained, it makes errors. To measure how far off these predictions are from the true values, we use a loss function. Minimizing the loss helps the model improve and make better predictions.

Task

The objective is to find the optimal weights (W) that minimize the loss function, meaning we want the model to be as accurate as possible.

In practice, during model training:

  1. Forward Propagation: The input data flows through the network to produce a prediction.

  2. Loss Calculation: The loss function computes the difference between predicted output and actual labels.

  3. Backward Propagation: The model adjusts the weights using optimization algorithms (like Gradient Descent) to reduce the loss iteratively.


Result

By continuously updating the weights to minimize the loss function:

  1. The model becomes more accurate.

  2. Lower loss means better predictions.

  3. This optimization helps the network generalize well to new data.


Key Takeaway: The loss function is a mathematical measure of model error. Minimizing it through weight optimization improves the model's performance.

0
Subscribe to my newsletter

Read articles from Ashok Vanga directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Ashok Vanga
Ashok Vanga

Golang Developer and Blockchain certified professional