Gradient Descent Simplified

What is Gradient Descent
Batch Gradient Descent
Stochastic Gradient Descent
Mini-batch Gradient Descent
What is Gradient Descent?
At its core, gradient descent is an optimization algorithm. It helps our machine learning model improve by minimizing the loss.
Imagine you’re on a hill and it’s foggy. You want to get to the bottom (lowest point). But you can’t see very far. So you feel the slope under your feet and take one step in the direction that goes downhill.
Do this repeatedly, and eventually — you reach the bottom. That’s gradient descent!
1. Batch Gradient Descent
Imagine our hiker is very careful. Before taking a single step, they survey the entire valley.
How it works: Batch Gradient Descent calculates the loss for all the training examples in our dataset. It then averages these losses to get a single, precise gradient. Only then does it update the model's parameters.
Pros:
Stable and precise: Since it uses all data, the updates are very stable and lead to a smooth convergence to the optimal solution.
Great for parallel processing: The gradient calculation can be done in parallel across all data points.
Cons:
Slow: It can be extremely slow and computationally expensive, especially with large datasets, because it needs to process the entire dataset for just one single update.
Memory-hungry: It requires storing the entire dataset in memory to perform the calculations.
The implementation of BGD begins with the initialization of parameters, often starting with weights set to zero or small random values. This initial guess is the first step on the journey towards the lowest possible error.
Step 1: Initialize the model parameters, typically weights w and bias b.
Step 2: Choose a learning rate α that is neither too large (to avoid overshooting) nor too small (to prevent slow convergence).
Step 3: Determine the convergence criteria, which could be a threshold for the cost function decrease or a maximum number of iterations.
The gradient, denoted as ∇C, is the vector of all partial derivatives of the cost function C with respect to each parameter.
To find this gradient, one must calculate the average rate of change of the cost function across the entire dataset for each parameter.
Python Implementation:
2. Stochastic Gradient Descent (SGD)
Our second hiker is the complete opposite. They're impatient and take a step after seeing just a tiny piece of the landscape.
How it works: Stochastic Gradient Descent (SGD) calculates the loss and updates the model's parameters for one single training example at a time.
Pros:
Super fast: It makes updates very quickly, which is great for huge datasets where a batch approach would be too slow.
Less memory-intensive: It only needs to store one data point at a time.
Helps escape local minima: The "noisy" updates can sometimes help the model jump out of a suboptimal solution (a small dip in the valley) and find a better one.
Cons:
Noisy and erratic: The path to the bottom is not smooth. It can be very bumpy and jumpy, making it harder to converge to the exact minimum.
Harder to parallelize: Updates are sequential, making it less suitable for parallel computation.
Pseudocode:
Choose an initial vector of parameters w
and learning rate η
Repeat until an approximate minimum is obtained:
Randomly shuffle samples in the training set.
For i=1,2,...,n
do: w:=w−η∇Qi(w).
Trade-offs Between BGD and SGD
The comparison between BGD and Stochastic Gradient Descent (SGD) highlights a trade-off between computational efficiency and convergence quality.
Training time: BGD often requires more time due to processing the entire dataset in each iteration, whereas SGD updates parameters more frequently using individual examples.
Convergence quality: BGD offers a more precise and consistent convergence, albeit at the cost of increased computational load.
Python Implementation:
3. Mini-Batch Gradient Descent
This is the most popular and practical approach, balancing the best of both worlds. Our third hiker surveys a small group of the landscape before taking a step.
How it works: Mini-Batch Gradient Descent is a compromise. It calculates the loss and updates the parameters using a small, randomly selected subset of the data called a mini-batch. The size of this mini-batch is a hyperparameter we can tune (e.g., 32, 64, 128 data points).
Pros:
Balanced speed and stability: It's much faster than Batch GD and more stable than SGD. It gives a good trade-off.
Efficient and practical: Modern hardware (like GPUs) are highly optimized for matrix operations, which mini-batches are perfect for.
Less prone to local minima: The noise from the mini-batch updates helps it avoid getting stuck.
Cons:
Requires careful tuning: The mini-batch size is a hyperparameter that needs to be chosen carefully.
It's a mix of the problems of Batch and SGD, but typically the most effective in practice.
Python Implementation:
🗣️ Final Thoughts
Gradient descent is like your personal trainer — guiding your model step-by-step to become better. Whether it's taking advice from everyone (batch), from individuals (SGD), or from small groups (mini-batch), the goal is the same: reduce errors and get smarter.
Let me know in the comments:
➡️ Which gradient descent type did you use last?
➡️ Want me to explain backpropagation next?
Thanks for reading! 😊
If this helped, follow me for more machine learning made simple.
Subscribe to my newsletter
Read articles from Jubaer directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
