Understanding Gradients and Optimization Techniques , Loss Functions in Neural Networks

What is a Gradient?
A gradient measures how much a small change in a parameter (like a weight) affects the model's output. In neural networks, gradients guide how to adjust weights to minimize the error (loss).
Intuition: Imagine standing on a hill—gradients tell you which direction to walk to reach the bottom (minimum loss).
Mathematics Behind Gradients (Chain Rule in Backpropagation)
In backpropagation, gradients are calculated using the chain rule from calculus, which helps to compute how the loss changes with respect to each weight.
How to Set the Learning Rate
Manual Setting: Choose a learning rate through experimentation (e.g., 0.001, 0.01, 0.1).
Adaptive Optimizers: Use advanced optimizers like Adam, RMSprop, or Adagrad which adjust the learning rate during training.
Learning Rate Schedulers: Gradually decrease the learning rate during training for better convergence.
Effects of Different Learning Rates
| Learning Rate | Effect | | --- | --- | | Too Small | Slow convergence, longer training time | | Too Large | Overshooting the minimum, unstable training | | Optimal | Balanced speed and stability for best results |
from tensorflow.keras.optimizers import Adam
model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy')
###############################################
import torch.optim as optim
optimizer = optim.SGD(model.parameters(), lr=0.01)
Gradient Descent Optimization Techniques :
Stochastic Gradient Descent (SGD)
Updates weights after each sample.
Faster but noisier—good for large dataset
from tensorflow.keras.optimizers import SGD
model.compile(optimizer=SGD(learning_rate=0.01), loss='binary_crossentropy', metrics=['accuracy'])
Mini-Batch Gradient Descent
Updates weights after a small batch of samples (e.g., 32 or 64).
Balanced approach—faster convergence with less noise.
model.fit(X_train, y_train, epochs=100, batch_size=32)
Adam Optimizer (Adaptive Moment Estimation)
Combines momentum and adaptive learning rates.
Most popular—faster convergence and less hyperparameter tuning.
How it works:
Momentum: Keeps track of past gradients for smoother updates.
Adaptive Learning Rate: Adjusts learning rates for each parameter.
from tensorflow.keras.optimizers import Adam
model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'])
D. RMSprop (Root Mean Square Propagation)
Adapts learning rates for each parameter using the moving average of squared gradients.
Works well for recurrent neural networks (RNNs).
from tensorflow.keras.optimizers import RMSprop
model.compile(optimizer=RMSprop(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'])
Which Optimizer Should You Choose?
Optimizer | Best Use Case | Pros | Cons |
SGD | Large datasets, simple models | Efficient for large-scale problems | Slow convergence, noisy |
Mini-Batch SGD | General-purpose, balanced datasets | Faster convergence, less noise | Requires tuning batch size |
Adam | Most tasks (CNNs, Transformers, etc.) | Adaptive, works well out-of-the-box | May overfit in some cases |
RMSprop | Recurrent networks (RNN, LSTM) | Handles non-stationary data well | Sensitive to learning rate |
Forward Propagation: Predicts outputs by passing inputs through layers.
Backward Propagation: Calculates gradients and updates weights.
Gradient Descent Variants:
SGD: Simple, noisy but works.
Adam: Best for most models.
RMSprop: Good for recurrent models.
Understanding Loss Function
What is a Loss Function?
A loss function measures how well a neural network’s predictions match the actual labels. The goal during training is to minimize the loss using gradient descent.
Low loss → Good model predictions
High loss → Poor model predictions
Types of Loss Functions
For Classification Problems (Discrete Outputs)
Binary Cross-Entropy
Categorical Cross-Entropy
B. For Regression Problems (Continuous Outputs)
Mean Squared Error (MSE)
Mean Absolute Error (MAE)
Practical :
In a neural network, the goal is to make accurate predictions. However, when the network is first trained, it makes errors. To measure how far off these predictions are from the true values, we use a loss function. Minimizing the loss helps the model improve and make better predictions.
Task
The objective is to find the optimal weights (W) that minimize the loss function, meaning we want the model to be as accurate as possible.
In practice, during model training:
Forward Propagation: The input data flows through the network to produce a prediction.
Loss Calculation: The loss function computes the difference between predicted output and actual labels.
Backward Propagation: The model adjusts the weights using optimization algorithms (like Gradient Descent) to reduce the loss iteratively.
Result
By continuously updating the weights to minimize the loss function:
The model becomes more accurate.
Lower loss means better predictions.
This optimization helps the network generalize well to new data.
✅ Key Takeaway: The loss function is a mathematical measure of model error. Minimizing it through weight optimization improves the model's performance.
Subscribe to my newsletter
Read articles from Ashok Vanga directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Ashok Vanga
Ashok Vanga
Golang Developer and Blockchain certified professional