[AI #2] Understanding Gradient Descent: The Heart of Machine Learning

Table of contents
- Master Gradient Descent: The Complete Guide to Machine Learning's Most Important Optimization Algorithm
- Understanding Gradient Descent: The Core Concept
- The Critical Role of Learning Rate
- How Gradient Descent Actually Works
- Convergence Properties and Characteristics
- Implementation Details and Best Practices
- Variants of Gradient Descent
- Limitations and Advanced Considerations
- Conclusion

Master Gradient Descent: The Complete Guide to Machine Learning's Most Important Optimization Algorithm
Gradient descent is the backbone of machine learning optimization, powering everything from simple linear regression to complex neural networks. This comprehensive guide will take you from beginner to expert, covering Andrew Ng's essential teachings and practical implementation strategies.
Learning Objective: Master Gradient Descent from Andrew Ng's Machine Learning course - the most crucial optimization algorithm in ML - and reach a level where you can apply it confidently in real-world projects.
Understanding Gradient Descent: The Core Concept
The Big Picture
Gradient descent is an algorithm that minimizes the cost function \(\\J(w,b)\) by iteratively finding the optimal parameters.
Intuitive Understanding: Think of it like hiking down a mountain with fog. You can only see your immediate surroundings, so you look around 360 degrees, find the steepest downward slope, take one step in that direction, and repeat this process until you reach the bottom (local minimum).
Mathematical Foundation
The cost function we're trying to minimize is:
\(J(w,b) = \frac{1}{2m} \sum_{i=1}^{m} (f_{w,b}(x^{(i)}) - y^{(i)})^2\)
The gradient descent update rules are:
\(w = w - \alpha \times \frac{\partial J}{\partial w}\)
\(b = b - \alpha \times \frac{\partial J}{\partial b}\)
(Note: = means assignment, not equality)
Intuitive breakdown:
We square the difference between predicted and actual values to eliminate negative values
We average across all data points for generalization
The factor of \(\frac{1}{2}\) simplifies derivative calculations
Key Components:
α (alpha): Learning rate - determines how big steps we take
\( \frac{\partial J}{\partial w}, \frac{\partial J}{\partial b}\): Partial derivatives - act like a compass showing which direction to move
The Critical Role of Learning Rate
Learning rate α is the most important hyper-parameter in gradient descent.
Hyper-parameter: A value that you must set manually before training
When Learning Rate is Too Large
Problem: We might overshoot the minimum and diverge
Result: Cost function keeps increasing or oscillating wildly
When Learning Rate is Too Small
Problem: Takes forever to reach the minimum
Result: Training becomes impractically slow
Finding the Right Learning Rate
Step 1: Try values like 0.001, 0.01, 0.1, 1.0
Step 2: Monitor if the cost function decreases
Step 3: Reduce if too fast, increase if too slow
How Gradient Descent Actually Works
Step-by-Step Algorithm
Step 1: Initialize Parameters
Set w and b to random values (usually 0 or small random numbers)
Set learning rate α
Step 2: Calculate Partial Derivatives
Compute the gradients at the current position:
\(\frac{\partial J}{\partial w} = \frac{1}{m} \sum_{i=1}^{m} (f_{w,b}(x^{(i)}) - y^{(i)}) \cdot x^{(i)}\)
\(\frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=1}^{m} (f_{w,b}(x^{(i)}) - y^{(i)})\)
Step 3: Update Parameters
\(w = w - \alpha \frac{\partial J}{\partial w}\)
\(b= b - \alpha \frac{\partial J}{\partial b}\)
Step 4: Repeat
Continue steps 2-3 until convergence.
⚠️ Critical Point
Simultaneous Update: Always update w and b using the same set of partial derivatives
Convergence Condition: Stop when partial derivatives approach zero or cost function change becomes negligible
Convergence Properties and Characteristics
Automatic Step Size Adjustment
Key Insight: As we approach the minimum, the gradient (slope) automatically becomes smaller.
Distance from Minimum | Gradient Size | Step Size | Result |
Far away | Large | Large steps | Fast movement |
Getting closer | Medium | Medium steps | Steady progress |
Very close | Small | Small steps | Fine-tuning |
At minimum | Zero | No movement | Converged |
Linear Regression's Special Property
Convex Function Property
Linear Regression's cost function J(w,b) is perfectly convex (bowl-shaped)
Result: Local minimum = Global minimum (guaranteed!)
Comparison with Other Algorithms
Important Note: The "bowl shape" is only guaranteed for Linear Regression with MSE (Mean Squared Error)
Other scenarios:
Neural Networks: Non-convex - multiple local minima exist
Logistic Regression: Convex but different shape
Other cost functions: Each has unique characteristics
Implementation Details and Best Practices
The Importance of Simultaneous Updates
Correct Implementation (Pseudocode)
temp_w = w - α × (∂J/∂w)
temp_b = b - α × (∂J/∂b)
w = temp_w
b = temp_b
Incorrect Implementation
w = w - α × (∂J/∂w) // Update w first
b = b - α × (∂J/∂b) // Use already changed w - WRONG!
Why simultaneous updates matter:
Mathematical reason: Gradient represents direction at a specific point
Problem: Updating one parameter first changes the gradient for the other
Convergence Criteria: Theory vs Practice
Theoretical Convergence
Stop when partial derivatives are exactly zero:
\(\frac{\partial J}{\partial w} = 0 \text{ and } \frac{\partial J}{\partial b} = 0\)
Practical Implementation
Stop when smaller than epsilon \(\epsilon\):
\(\left|\frac{\partial J}{\partial w}\right| < \varepsilon \text{ and } \left|\frac{\partial J}{\partial b}\right| < \varepsilon\)
Why use epsilon?
Floating-point limitations: Computers can't represent exact zero
Numerical error accumulation: Tiny errors build up over iterations
Infinite loop prevention: Exact zero is practically impossible
Problem Type | Recommended Epsilon | Use Case |
High precision required | 1e-6 ~ 1e-8 | Scientific computing, finance |
General machine learning | 1e-3 ~ 1e-4 | Most practical applications |
Fast approximation | 1e-2 ~ 1e-1 | Prototyping, real-time systems |
Variants of Gradient Descent
What we've covered so far is Batch Gradient Descent (uses entire dataset). In practice, more efficient variants are commonly used.
Three Main Variants Compared
Method | Data Usage | Speed | Memory | Convergence Stability | Practical Usage |
Batch | Entire dataset | Slow | High | High | Low |
SGD | Single sample | Fast | Low | Low (noisy) | Medium |
Mini-batch | Small batches | Medium | Medium | Medium | High |
When to Use Each Variant
Batch Gradient Descent
Example: Medical research with rare disease patients (500 samples)
Characteristics: Small dataset fits in memory
Reason: Accuracy is crucial (medical field requires high precision)
Priority: Stability over speed
Stochastic Gradient Descent (SGD)
Example: Netflix recommendation system (real-time learning)
Characteristics: Update model immediately as users watch movies
Reason: One-by-one processing needed, real-time response critical
Advantages: Minimal memory usage, helps escape local minima
Mini-batch Gradient Descent (Most Common)
Example: Self-driving car object recognition (1M ImageNet images)
Batch size: Process 32-128 images at once
Reason: Leverages GPU parallel processing capabilities
Advantages: Faster than Batch, more stable than SGD
Sweet spot: Balance between memory efficiency and speed
Limitations and Advanced Considerations
Major Limitations
1️⃣ Learning Rate Selection Challenge
Problems:
Too large = divergence, too small = slow convergence
Fixed learning rate isn't optimal for all situations
Different learning rates needed for early vs late training
2️⃣ Feature Scaling Issues
Problem scenario: When features have vastly different ranges
Example: House price prediction - bedrooms (1-10) vs price ($100K-$10M)
Result: Creates elliptical contours leading to inefficient zigzag paths
3️⃣ Local Minima and Saddle Points
Saddle Points:
Minimum in some directions, maximum in others
Gradient is zero but not a true minimum
More common in high-dimensional spaces
Can trap gradient descent algorithms
Modern Solutions
While basic gradient descent has limitations, modern variants address these issues:
Adaptive learning rates: Adam, RMSprop, AdaGrad
Momentum methods: Help escape saddle points
Feature normalization: Solves scaling problems
Learning rate scheduling: Automatic adjustment over time
Conclusion
Gradient descent is the foundation of machine learning optimization. Understanding its mechanics, variants, and limitations is crucial for any ML practitioner. While we've focused on the basic algorithm, remember that modern deep learning frameworks implement sophisticated variants that address most of the limitations discussed.
Key takeaways:
Start with proper learning rate selection
Always use simultaneous updates
Choose the right variant for your data size
Understand convergence criteria for practical implementation
Be aware of limitations and when to use advanced techniques
The principles you've learned here apply to everything from simple linear regression to complex neural networks. Master these fundamentals, and you'll have a solid foundation for tackling any optimization challenge in machine learning.
This guide is based on Andrew Ng's Machine Learning course materials and practical industry experience. For hands-on practice, try implementing gradient descent from scratch before using library implementations.
Note : I thought ‘$$’ was the common way and correct syntax for latex in markdown , but at least in Hashnode the correct syntax was double backslash and parenthesis.
Subscribe to my newsletter
Read articles from Daniel Lim directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
