From Spikes to Smoothness: Differentiability and Its Role in Real-World Optimization


In the world of artificial intelligence and optimization, differentiability is not just a mathematical nicety — it’s a critical enabler of nearly everything we train, tune, and learn.
But why does smoothness matter so much? And what happens when our problem isn't smooth?
Let’s unpack this through intuition, real-world applications, and the brilliant concept of function relaxation.
1. Differentiability: The Engine of Iterative Optimization
What is Differentiability?
A function is differentiable if you can compute its derivative — that is, if it has a well-defined slope at every point. Geometrically, this means the function is smooth and has no sharp corners or jumps.
Mathematically:
$$f'(x) = \lim_{\Delta x \to 0} \frac{f(x + \Delta x) - f(x)}{\Delta x}$$
In practice, this allows us to locally approximate a function linearly — the foundation of nearly all learning algorithms.
Why Does It Matter?
Most machine learning models are trained using iterative gradient-based optimization like gradient descent or Adam. These algorithms require:
A loss function L(θ)
The ability to compute gradients
$$\frac{\partial \mathcal{L}}{\partial \theta}$$
No gradient? No direction to move. No learning.
2. When the World Isn’t Smooth: Discrete and Non-Differentiable Functions
Real-world examples of non-differentiable problems:
Binary classification outputs (hard 0/1 decisions)
Argmax operations (e.g., choosing a label)
Sorting or ranking (like top-k search)
Program execution paths
Reinforcement learning with discrete actions
These problems are often:
Non-continuous
Non-differentiable
Combinatorial
Which makes them incompatible with gradient-based learning directly.
3. The Elegant Trick: Function Relaxation
When faced with a non-differentiable function, we often relax it — meaning we replace it with a smooth, differentiable approximation that’s close enough for optimization, but still meaningful.
This is a cornerstone trick in modern machine learning.
Example 1: Softmax as a Relaxed Argmax
Instead of:
$$\text{argmax}(x_1, x_2, ..., x_n)$$
We use:
$$\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}$$
Outputs a smooth probability distribution
Is fully differentiable
Allows learning through cross-entropy loss
In deep learning, softmax is the smooth bridge that lets us back-propagate through classification tasks.
Example 2: Straight-Through Estimator (STE)
In models like binary neural networks, we want 0/1 weights — but that’s not differentiable.
So we:
Use a hard threshold in the forward pass
Use a soft approximation in the backward pass
This is called the Straight-Through Estimator (STE) — a practical hack that makes training discrete models possible.
Example 3: Gumbel-Softmax for Discrete Latent Variables
In variational autoencoders (VAEs) with categorical latent variables, gradients through discrete choices are impossible.
Gumbel-Softmax provides a reparameterizable, differentiable approximation of sampling from a categorical distribution.
It enables:
End-to-end differentiable training
Learning of complex, structured generative models
Real-World Applications
Domain | Challenge | Relaxation Strategy |
Deep Learning | argmax in classification | softmax + cross-entropy |
Reinforcement Learning | discrete actions | policy gradients, soft Q-learning |
Neural Architecture Search | binary architecture choices | sigmoid relaxations, STE |
NLP & Graphs | discrete tokens, structures | differentiable surrogates, Gumbel-softmax |
Sorting & Ranking | top-k indices | differentiable approximations to sort (e.g., SoftSort) |
Why This Matters for You as an AI Engineer
Mastering differentiability and relaxation:
Opens up hard, discrete domains to gradient-based optimization
Allows training of models with symbolic or combinatorial structure
Enables end-to-end learning even in traditionally non-smooth spaces
It’s a bridge between pure math and pragmatic engineering.
This ability to convert "impossible-to-learn" into "trainable with a trick" is what separates a strong practitioner from a great AI systems designer.
Final Thought
Differentiability isn’t just about calculus. It’s about making learning possible. In every neural net, loss function, policy optimizer, and differentiable simulator, the same principle holds:
If you can smooth it, you can optimize it.
And that’s one of the most elegant tricks in the entire field.
Subscribe to my newsletter
Read articles from Sudhin Karki directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Sudhin Karki
Sudhin Karki
I am a Machine Learning enthusiast with a motivation of building ML integrated apps. I am currently exploring the groundbreaking ML / DL papers and trying to understand how this is shaping the future.