[AI #2] Understanding Gradient Descent: The Heart of Machine Learning

Daniel LimDaniel Lim
7 min read

Master Gradient Descent: The Complete Guide to Machine Learning's Most Important Optimization Algorithm

Gradient descent is the backbone of machine learning optimization, powering everything from simple linear regression to complex neural networks. This comprehensive guide will take you from beginner to expert, covering Andrew Ng's essential teachings and practical implementation strategies.

Learning Objective: Master Gradient Descent from Andrew Ng's Machine Learning course - the most crucial optimization algorithm in ML - and reach a level where you can apply it confidently in real-world projects.

Understanding Gradient Descent: The Core Concept

The Big Picture

Gradient descent is an algorithm that minimizes the cost function \(\\J(w,b)\) by iteratively finding the optimal parameters.

Intuitive Understanding: Think of it like hiking down a mountain with fog. You can only see your immediate surroundings, so you look around 360 degrees, find the steepest downward slope, take one step in that direction, and repeat this process until you reach the bottom (local minimum).

Mathematical Foundation

The cost function we're trying to minimize is:

\(J(w,b) = \frac{1}{2m} \sum_{i=1}^{m} (f_{w,b}(x^{(i)}) - y^{(i)})^2\)

The gradient descent update rules are:

\(w = w - \alpha \times \frac{\partial J}{\partial w}\)

\(b = b - \alpha \times \frac{\partial J}{\partial b}\)

(Note: = means assignment, not equality)

Intuitive breakdown:

  • We square the difference between predicted and actual values to eliminate negative values

  • We average across all data points for generalization

  • The factor of \(\frac{1}{2}\) simplifies derivative calculations

Key Components:

  • α (alpha): Learning rate - determines how big steps we take

  • \( \frac{\partial J}{\partial w}, \frac{\partial J}{\partial b}\): Partial derivatives - act like a compass showing which direction to move

The Critical Role of Learning Rate

Learning rate α is the most important hyper-parameter in gradient descent.

Hyper-parameter: A value that you must set manually before training

When Learning Rate is Too Large

  • Problem: We might overshoot the minimum and diverge

  • Result: Cost function keeps increasing or oscillating wildly

When Learning Rate is Too Small

  • Problem: Takes forever to reach the minimum

  • Result: Training becomes impractically slow

Finding the Right Learning Rate

  1. Step 1: Try values like 0.001, 0.01, 0.1, 1.0

  2. Step 2: Monitor if the cost function decreases

  3. Step 3: Reduce if too fast, increase if too slow

How Gradient Descent Actually Works

Step-by-Step Algorithm

Step 1: Initialize Parameters

  • Set w and b to random values (usually 0 or small random numbers)

  • Set learning rate α

Step 2: Calculate Partial Derivatives

Compute the gradients at the current position:

\(\frac{\partial J}{\partial w} = \frac{1}{m} \sum_{i=1}^{m} (f_{w,b}(x^{(i)}) - y^{(i)}) \cdot x^{(i)}\)

\(\frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=1}^{m} (f_{w,b}(x^{(i)}) - y^{(i)})\)

Step 3: Update Parameters

\(w = w - \alpha \frac{\partial J}{\partial w}\)

\(b= b - \alpha \frac{\partial J}{\partial b}\)

Step 4: Repeat

Continue steps 2-3 until convergence.

⚠️ Critical Point

  • Simultaneous Update: Always update w and b using the same set of partial derivatives

  • Convergence Condition: Stop when partial derivatives approach zero or cost function change becomes negligible

Convergence Properties and Characteristics

Automatic Step Size Adjustment

Key Insight: As we approach the minimum, the gradient (slope) automatically becomes smaller.

Distance from MinimumGradient SizeStep SizeResult
Far awayLargeLarge stepsFast movement
Getting closerMediumMedium stepsSteady progress
Very closeSmallSmall stepsFine-tuning
At minimumZeroNo movementConverged

Linear Regression's Special Property

Convex Function Property

  • Linear Regression's cost function J(w,b) is perfectly convex (bowl-shaped)

  • Result: Local minimum = Global minimum (guaranteed!)

Comparison with Other Algorithms

Important Note: The "bowl shape" is only guaranteed for Linear Regression with MSE (Mean Squared Error)

Other scenarios:

  • Neural Networks: Non-convex - multiple local minima exist

  • Logistic Regression: Convex but different shape

  • Other cost functions: Each has unique characteristics

Implementation Details and Best Practices

The Importance of Simultaneous Updates

Correct Implementation (Pseudocode)

temp_w = w - α × (∂J/∂w)
temp_b = b - α × (∂J/∂b)
w = temp_w
b = temp_b

Incorrect Implementation

w = w - α × (∂J/∂w)  // Update w first
b = b - α × (∂J/∂b)  // Use already changed w - WRONG!

Why simultaneous updates matter:

  • Mathematical reason: Gradient represents direction at a specific point

  • Problem: Updating one parameter first changes the gradient for the other

Convergence Criteria: Theory vs Practice

Theoretical Convergence

Stop when partial derivatives are exactly zero:

\(\frac{\partial J}{\partial w} = 0 \text{ and } \frac{\partial J}{\partial b} = 0\)

Practical Implementation

Stop when smaller than epsilon \(\epsilon\):

\(\left|\frac{\partial J}{\partial w}\right| < \varepsilon \text{ and } \left|\frac{\partial J}{\partial b}\right| < \varepsilon\)

Why use epsilon?

  • Floating-point limitations: Computers can't represent exact zero

  • Numerical error accumulation: Tiny errors build up over iterations

  • Infinite loop prevention: Exact zero is practically impossible

Problem TypeRecommended EpsilonUse Case
High precision required1e-6 ~ 1e-8Scientific computing, finance
General machine learning1e-3 ~ 1e-4Most practical applications
Fast approximation1e-2 ~ 1e-1Prototyping, real-time systems

Variants of Gradient Descent

What we've covered so far is Batch Gradient Descent (uses entire dataset). In practice, more efficient variants are commonly used.

Three Main Variants Compared

MethodData UsageSpeedMemoryConvergence StabilityPractical Usage
BatchEntire datasetSlowHighHighLow
SGDSingle sampleFastLowLow (noisy)Medium
Mini-batchSmall batchesMediumMediumMediumHigh

When to Use Each Variant

Batch Gradient Descent

  • Example: Medical research with rare disease patients (500 samples)

  • Characteristics: Small dataset fits in memory

  • Reason: Accuracy is crucial (medical field requires high precision)

  • Priority: Stability over speed

Stochastic Gradient Descent (SGD)

  • Example: Netflix recommendation system (real-time learning)

  • Characteristics: Update model immediately as users watch movies

  • Reason: One-by-one processing needed, real-time response critical

  • Advantages: Minimal memory usage, helps escape local minima

Mini-batch Gradient Descent (Most Common)

  • Example: Self-driving car object recognition (1M ImageNet images)

  • Batch size: Process 32-128 images at once

  • Reason: Leverages GPU parallel processing capabilities

  • Advantages: Faster than Batch, more stable than SGD

  • Sweet spot: Balance between memory efficiency and speed

Limitations and Advanced Considerations

Major Limitations

1️⃣ Learning Rate Selection Challenge

Problems:

  • Too large = divergence, too small = slow convergence

  • Fixed learning rate isn't optimal for all situations

  • Different learning rates needed for early vs late training

2️⃣ Feature Scaling Issues

  • Problem scenario: When features have vastly different ranges

  • Example: House price prediction - bedrooms (1-10) vs price ($100K-$10M)

  • Result: Creates elliptical contours leading to inefficient zigzag paths

3️⃣ Local Minima and Saddle Points

Saddle Points:

  • Minimum in some directions, maximum in others

  • Gradient is zero but not a true minimum

  • More common in high-dimensional spaces

  • Can trap gradient descent algorithms

Modern Solutions

While basic gradient descent has limitations, modern variants address these issues:

  • Adaptive learning rates: Adam, RMSprop, AdaGrad

  • Momentum methods: Help escape saddle points

  • Feature normalization: Solves scaling problems

  • Learning rate scheduling: Automatic adjustment over time

Conclusion

Gradient descent is the foundation of machine learning optimization. Understanding its mechanics, variants, and limitations is crucial for any ML practitioner. While we've focused on the basic algorithm, remember that modern deep learning frameworks implement sophisticated variants that address most of the limitations discussed.

Key takeaways:

  • Start with proper learning rate selection

  • Always use simultaneous updates

  • Choose the right variant for your data size

  • Understand convergence criteria for practical implementation

  • Be aware of limitations and when to use advanced techniques

The principles you've learned here apply to everything from simple linear regression to complex neural networks. Master these fundamentals, and you'll have a solid foundation for tackling any optimization challenge in machine learning.


This guide is based on Andrew Ng's Machine Learning course materials and practical industry experience. For hands-on practice, try implementing gradient descent from scratch before using library implementations.

Note : I thought ‘$$’ was the common way and correct syntax for latex in markdown , but at least in Hashnode the correct syntax was double backslash and parenthesis.

0
Subscribe to my newsletter

Read articles from Daniel Lim directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Daniel Lim
Daniel Lim