Optimizing Machine Learning: The Many Flavors of Gradient Descent


"Ever wondered why your machine learning model is slow to train—or worse, stuck going nowhere?"
It’s not just the data or the architecture. It might be your optimizer’s dirty little secret: how it descends.
Welcome to the world of gradient descent—the engine that drives learning in most ML models. But did you know there’s not just one gradient descent? In fact, how you descend—in batches, one step at a time, or somewhere in between—can massively impact your model’s speed, accuracy, and stability.
🧠 Why are we doing all this?
At its core, training a machine learning model means finding the best possible parameters (like weights and bias) that minimize error — or in math terms, minimize a loss function.
But here's the twist:
We’re often dealing with complex, high-dimensional loss landscapes — full of hills, valleys, and tricky curves — and we don’t know where the lowest point is.
Enter Gradient Descent.
It’s our compass in this landscape. It tells us:
“Hey, if you move in this direction (the negative gradient), the loss will decrease.”
So we take small steps — guided by the slope of the loss — and gradually descend to the lowest valley we can find. That’s where the model performs best.
It's not just optimization.
👉 It's how your model learns.
Let’s go over the most popular forms of gradient descent:
🧱 1. Batch Gradient Descent
Batch Gradient Descent is the old-school, all-in kind of learner. It looks at the entire dataset at once to compute the gradient and then updates the model. Think of it as your super thorough friend who refuses to take a step before checking every single detail.
✅ How it works:
Calculate the gradient using all training examples.
Perform one big, carefully calculated update per iteration.
⚙️ Formula:
$$\begin{aligned} w &:= w - \alpha \cdot dw \\ b &:= b - \alpha \cdot db \end{aligned}$$
Here dw and db are calculated over the entire dataset hence the batch version.
$$\begin{aligned} \hat{y}^{(i)} &= wx^{(i)} + b \\ dw &= \frac{1}{m} \sum_{i=1}^{m} \left( \hat{y}^{(i)} - y^{(i)} \right) x^{(i)} \\ db &= \frac{1}{m} \sum_{i=1}^{m} \left( \hat{y}^{(i)} - y^{(i)} \right) \end{aligned}$$
🔍 Pros:
🧠 Stable convergence — each update is based on the true gradient.
📉 Smooth loss curve — no noisy jumps in training.
⚠️ Cons:
🐢 Slow with large datasets — one update = one full pass.
🧠 Memory-heavy — needs the whole dataset in memory.
❌ Not ideal for real-time or streaming data
🧪 Best for:
Small to medium datasets
Offline training scenarios
When precision matters more than speed
💬 "Talk is cheap. Show me the code."
Alright, alright — let’s walk the walk. Here's how you can implement Batch Gradient Descent from scratch in Python, using nothing but NumPy and a little math magic.
import numpy as np
# === Step 1: Sample Dataset ===
# 3 samples, 2 features each
X = np.array([
[1, 2],
[2, 3],
[3, 4]
], dtype=float)
# Target values (y = 2*x1 + 1*x2 + 1)
y = np.array([5, 8, 11], dtype=float)
m, n = X.shape # m = number of samples, n = number of features
# === Step 2: Initialize Parameters ===
w = np.zeros(n) # Weight vector (n-dim)
b = 0.0 # Bias term
alpha = 0.01 # Learning rate
epochs = 1000 # Number of iterations
# === Step 3: Gradient Descent Loop ===
for epoch in range(epochs):
# Step 3.1: Make predictions using current weights
y_pred = np.dot(X, w) + b # shape: (m,)
# Step 3.2: Compute gradients
dw = (1 / m) * np.dot(X.T, (y_pred - y)) # shape: (n,)
db = (1 / m) * np.sum(y_pred - y) # scalar
# Step 3.3: Update weights and bias
w -= alpha * dw
b -= alpha * db
# Step 3.4: Print loss every 100 epochs
if epoch % 100 == 0:
loss = (1 / (2 * m)) * np.sum((y_pred - y) ** 2)
print(f"Epoch {epoch}: Loss = {loss:.4f}, w = {w}, b = {b:.4f}")
# === Step 4: Final Output ===
print("\nFinal learned weights and bias:")
print(f"w = {w}")
print(f"b = {b:.4f}")
⚡ 2. Stochastic Gradient Descent (SGD)
If Batch Gradient Descent is the methodical student, SGD is the caffeine-fueled speed-runner. Instead of waiting to see the entire dataset, SGD updates the model using just one training example at a time.
✅ How it works:
Randomly shuffle the dataset
For each training example, compute the gradient and update the weights immediately
⚙️ Formula (Per Sample):
$$\begin{aligned} w &:= w - \alpha \cdot dw^{(i)} \\ b &:= b - \alpha \cdot db^{(i)} \end{aligned}$$
Where:
$$\begin{aligned} \hat{y}^{(i)} &= wx^{(i)} + b \\ dw^{(i)} &= \left( \hat{y}^{(i)} - y^{(i)} \right) x^{(i)} \\ db^{(i)} &= \left( \hat{y}^{(i)} - y^{(i)} \right) \end{aligned}$$
Gradients are computed on just one sample .
🔍 Pros:
⚡️ Very fast updates — ideal for large or streaming datasets
🌪️ Can escape local minima due to its noisy path
🔄 Works well in online learning scenarios
⚠️ Cons:
🌊 Noisy convergence — loss fluctuates a lot
❌ May overshoot the minimum or bounce around it
🛠️ Needs techniques like momentum or learning rate decay to stabilize
🧪 Best for:
Extremely large datasets
Real-time / online learning
When fast, rough updates are acceptable
import numpy as np
# === Step 1: Sample Dataset ===
X = np.array([
[1, 2],
[2, 3],
[3, 4]
], dtype=float)
# Target: y = 2*x1 + 1*x2 + 1
y = np.array([5, 8, 11], dtype=float)
m, n = X.shape # m = samples, n = features
# === Step 2: Initialize Parameters ===
w = np.zeros(n) # Weight vector: shape (n,)
b = 0.0 # Bias
alpha = 0.01 # Learning rate
epochs = 10 # Number of full passes over the dataset
# === Step 3: Stochastic Gradient Descent Loop ===
for epoch in range(epochs):
for i in range(m):
xi = X[i] # Feature vector for one sample
yi = y[i] # True label for one sample
# Step 3.1: Predict using current parameters
y_pred = np.dot(xi, w) + b
# Step 3.2: Compute gradients
error = y_pred - yi
dw = error * xi # gradient w.r.t. w
db = error # gradient w.r.t. b
# Step 3.3: Update weights
w -= alpha * dw
b -= alpha * db
# Optional: Print loss after each epoch
y_all_pred = np.dot(X, w) + b
loss = (1 / (2 * m)) * np.sum((y_all_pred - y) ** 2)
print(f"Epoch {epoch + 1}: Loss = {loss:.4f}, w = {w}, b = {b:.4f}")
# === Final Output ===
print("\nFinal learned weights and bias:")
print(f"w = {w}")
print(f"b = {b:.4f}")
⚖️ 3. Mini-Batch Gradient Descent
When Batch is too slow and Stochastic is too chaotic, you go for the Goldilocks method — Mini-Batch Gradient Descent.
Instead of using all the data (like Batch) or just one sample (like SGD), Mini-Batch takes a small batch of samples (say, 16, 32, 64, etc.) and performs an update on each.
It gives us the best of both worlds — faster training and more stable convergence.
🧾 Formula (Per Mini-Batch):
Let the batch size be b, The update rule becomes:
$$\begin{aligned} \hat{y}^{(i)} &= w \cdot x^{(i)} + b \\ dw &= \frac{1}{b} \sum_{i=1}^{b} \left( \hat{y}^{(i)} - y^{(i)} \right) x^{(i)} \\ db &= \frac{1}{b} \sum_{i=1}^{b} \left( \hat{y}^{(i)} - y^{(i)} \right) \\ w &:= w - \alpha \cdot dw \\ b &:= b - \alpha \cdot db \end{aligned}$$
The gradients are averaged over the mini-batch before updating. You get stability and efficiency.
✅ Pros:
⚡️ Faster than full-batch gradient descent
🌊 Less noisy than SGD updates
🧠 Generalizes well to unseen data
💻 Perfect for GPUs — vectorized batch processing
⚠️ Cons:
🎯 Needs careful tuning of batch size (too small = jittery, too big = slow)
🔀 Requires data shuffling for good performance
🧩 Slightly more complex to implement than batch or SGD
🧪 Best for:
🚀 Training deep learning models
📊 Medium to large datasets
⚖️ When you want a sweet spot between convergence speed and quality
import numpy as np
# === Step 1: Prepare Sample Dataset ===
X = np.array([
[1, 2],
[2, 3],
[3, 4],
[4, 5],
[5, 6],
[6, 7]
], dtype=float)
# Target: y = 2*x1 + 1*x2 + 1
y = np.array([6, 9, 12, 15, 18, 21], dtype=float)
m, n = X.shape # m = number of samples, n = number of features
# === Step 2: Initialize Parameters ===
w = np.zeros(n) # weight vector
b = 0.0 # bias
alpha = 0.01 # learning rate
epochs = 50
batch_size = 2
# === Step 3: Mini-Batch Gradient Descent ===
for epoch in range(epochs):
# Step 3.1: Shuffle data each epoch
indices = np.arange(m)
np.random.shuffle(indices)
X_shuffled = X[indices]
y_shuffled = y[indices]
# Step 3.2: Process each mini-batch
for i in range(0, m, batch_size):
X_batch = X_shuffled[i:i+batch_size]
y_batch = y_shuffled[i:i+batch_size]
b_size = len(X_batch)
# Step 3.3: Predict
y_pred = np.dot(X_batch, w) + b
# Step 3.4: Compute gradients
dw = (1 / b_size) * np.dot(X_batch.T, (y_pred - y_batch))
db = (1 / b_size) * np.sum(y_pred - y_batch)
# Step 3.5: Update weights
w -= alpha * dw
b -= alpha * db
# Step 3.6: Optionally print loss
if epoch % 10 == 0 or epoch == epochs - 1:
full_pred = np.dot(X, w) + b
loss = (1 / (2 * m)) * np.sum((full_pred - y) ** 2)
print(f"Epoch {epoch}: Loss = {loss:.4f}, w = {w}, b = {b:.4f}")
# === Step 4: Final Output ===
print("\nFinal learned weights and bias:")
print(f"w = {w}")
print(f"b = {b:.4f}")
🔍 "Keen observers might have noticed..."
...that in every flavor of Gradient Descent — Batch, Stochastic, or Mini-Batch — there's this mysterious little number sneaking around called alpha
, a.k.a. the learning rate.
$$Alpha : \alpha$$
It may look small, but it’s the single most influential hyperparameter in how your model learns.
🚀 What is alpha
( learning rate )?
It controls how big a step your model takes in the direction of the gradient:
Too small? The model learns painfully slow 🐢
Too large? It overshoots and diverges into oblivion 🚀💥
Just right? Converges smoothly to the minimum 🎯
$$w := w - \alpha \cdot \frac{\partial \mathcal{L}}{\partial w}$$
Think of it like tuning the volume on your model's learning speed. Too quiet — you hear nothing. Too loud — it’s all noise.
🧠 “So… which one should I use?”
At this point, you might be wondering:
"Okay, I get how Batch, Stochastic, and Mini-Batch Gradient Descent work — but when do I use which?"
Great question. Here's a quick side-by-side comparison to help you decide:
📊 Gradient Descent Variants at a Glance
Feature | Batch GD | Stochastic GD (SGD) | Mini-Batch GD |
Update Frequency | Once per epoch | Every sample | Every mini-batch |
Speed | 🐢 Slow | ⚡ Fast | ⚖️ Balanced |
Convergence | 🧘♂️ Stable | 🎢 Noisy | 🎯 Smooth & fast |
Memory Usage | 🔺 High (all data) | 🔻 Low (1 sample) | 🔁 Moderate (batches) |
Best For | Small datasets | Online/streaming data | Deep learning (default) |
🤔 Did You Know Gradient Descent Has Smarter Cousins?
Yep — the version we just learned is just the starting point.
As your models get deeper and your data gets heavier, vanilla gradient descent might start to:
🚶♂️ Converge slowly
🎯 Miss the mark (or bounce around it)
💤 Learn inefficiently
That’s why the ML world came up with clever upgrades — optimizers like Momentum, RMSProp, and Adam — designed to make learning faster, smoother, and smarter.
Let’s meet them. 🚀
⚡️ 1. Momentum
“It’s like GD with a memory.”
Instead of blindly following the current slope, Momentum builds up speed in directions it consistently travels. Think of rolling a ball down a hill — it picks up speed and keeps going even if the slope flattens briefly.
Update Rule:
$$\begin{aligned} v_t &= \beta \cdot v_{t-1} + (1 - \beta) \cdot \nabla \mathcal{L}(\theta) \\ \theta &= \theta - \alpha \cdot v_t \end{aligned}$$
✅ Pros:
Speeds up convergence
Smooths out noisy gradients
🔍 2. RMSProp
“It learns how fast to learn.”
RMSProp scales the learning rate for each parameter individually, based on how much the gradients vary. If a parameter has been bouncing around, RMSProp slows it down.
Update Rule:
$$\begin{aligned} E[g^2]t &= \beta \cdot E[g^2]{t-1} + (1 - \beta) \cdot g_t^2 \\ \theta &= \theta - \frac{\alpha}{\sqrt{E[g^2]_t + \epsilon}} \cdot g_t \end{aligned}$$
✅ Pros:
Handles non-stationary objectives
Excellent for RNNs and noisy gradients
🧠 3. Adam (Adaptive Moment Estimation)
“The Swiss Army knife of optimizers.”
Adam combines Momentum and RMSProp into one powerhouse. It keeps track of both the mean and variance of gradients and corrects for their bias in early stages.
Update Rule:
$$\begin{aligned} m_t &= \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t \\ v_t &= \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2 \\ \hat{m}_t &= \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t} \\ \theta &= \theta - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \cdot \hat{m}_t \end{aligned}$$
✅ Pros:
Usually works great out of the box
Handles sparse gradients well
Default choice in deep learning
📌 Note: Going deep into the inner workings of these optimizers — like the math behind bias correction in Adam or why RMSProp works so well on noisy data — is a bit beyond the scope of this article.
🔍 But don’t worry! I’ll be exploring each of them in detail (with clear visuals and from-scratch code) in upcoming posts.
For now, it’s enough to understand what they do and why they matter — you’ve already leveled up your gradient game!
✅ Wrapping Up
By now, you’ve not only grasped how Gradient Descent works — but also explored its most powerful variants and the secret sauce behind training smarter models.
🧠 From batch updates to adaptive optimizers, you’ve taken your first big step into the world of ML optimization!
But this is just the beginning. In the next post, we’ll apply everything you've learned to build a real-world model —
🧪 Logistic Regression from scratch, with Gradient Descent guiding every step.
Subscribe to my newsletter
Read articles from Sagar Tewari directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Sagar Tewari
Sagar Tewari
ML Engineer in the making — focused on explainability, fairness, and production-grade AI. Sharing my journey, one post at a time.