From Spikes to Smoothness: Differentiability and Its Role in Real-World Optimization

Sudhin KarkiSudhin Karki
3 min read

In the world of artificial intelligence and optimization, differentiability is not just a mathematical nicety — it’s a critical enabler of nearly everything we train, tune, and learn.

But why does smoothness matter so much? And what happens when our problem isn't smooth?

Let’s unpack this through intuition, real-world applications, and the brilliant concept of function relaxation.

1. Differentiability: The Engine of Iterative Optimization

What is Differentiability?

A function is differentiable if you can compute its derivative — that is, if it has a well-defined slope at every point. Geometrically, this means the function is smooth and has no sharp corners or jumps.

Mathematically:

$$f'(x) = \lim_{\Delta x \to 0} \frac{f(x + \Delta x) - f(x)}{\Delta x}$$

In practice, this allows us to locally approximate a function linearly — the foundation of nearly all learning algorithms.

Why Does It Matter?

Most machine learning models are trained using iterative gradient-based optimization like gradient descent or Adam. These algorithms require:

  • A loss function L(θ)

  • The ability to compute gradients

$$\frac{\partial \mathcal{L}}{\partial \theta}$$

No gradient? No direction to move. No learning.

2. When the World Isn’t Smooth: Discrete and Non-Differentiable Functions

Real-world examples of non-differentiable problems:

  • Binary classification outputs (hard 0/1 decisions)

  • Argmax operations (e.g., choosing a label)

  • Sorting or ranking (like top-k search)

  • Program execution paths

  • Reinforcement learning with discrete actions

These problems are often:

  • Non-continuous

  • Non-differentiable

  • Combinatorial

Which makes them incompatible with gradient-based learning directly.

3. The Elegant Trick: Function Relaxation

When faced with a non-differentiable function, we often relax it — meaning we replace it with a smooth, differentiable approximation that’s close enough for optimization, but still meaningful.

This is a cornerstone trick in modern machine learning.

Example 1: Softmax as a Relaxed Argmax

Instead of:

$$\text{argmax}(x_1, x_2, ..., x_n)$$

We use:

$$\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}$$

  • Outputs a smooth probability distribution

  • Is fully differentiable

  • Allows learning through cross-entropy loss

In deep learning, softmax is the smooth bridge that lets us back-propagate through classification tasks.

Example 2: Straight-Through Estimator (STE)

In models like binary neural networks, we want 0/1 weights — but that’s not differentiable.

So we:

  1. Use a hard threshold in the forward pass

  2. Use a soft approximation in the backward pass

This is called the Straight-Through Estimator (STE) — a practical hack that makes training discrete models possible.

Example 3: Gumbel-Softmax for Discrete Latent Variables

In variational autoencoders (VAEs) with categorical latent variables, gradients through discrete choices are impossible.

Gumbel-Softmax provides a reparameterizable, differentiable approximation of sampling from a categorical distribution.

It enables:

  • End-to-end differentiable training

  • Learning of complex, structured generative models

Real-World Applications

DomainChallengeRelaxation Strategy
Deep Learningargmax in classificationsoftmax + cross-entropy
Reinforcement Learningdiscrete actionspolicy gradients, soft Q-learning
Neural Architecture Searchbinary architecture choicessigmoid relaxations, STE
NLP & Graphsdiscrete tokens, structuresdifferentiable surrogates, Gumbel-softmax
Sorting & Rankingtop-k indicesdifferentiable approximations to sort (e.g., SoftSort)

Why This Matters for You as an AI Engineer

Mastering differentiability and relaxation:

  • Opens up hard, discrete domains to gradient-based optimization

  • Allows training of models with symbolic or combinatorial structure

  • Enables end-to-end learning even in traditionally non-smooth spaces

It’s a bridge between pure math and pragmatic engineering.

This ability to convert "impossible-to-learn" into "trainable with a trick" is what separates a strong practitioner from a great AI systems designer.

Final Thought

Differentiability isn’t just about calculus. It’s about making learning possible. In every neural net, loss function, policy optimizer, and differentiable simulator, the same principle holds:

If you can smooth it, you can optimize it.

And that’s one of the most elegant tricks in the entire field.

0
Subscribe to my newsletter

Read articles from Sudhin Karki directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Sudhin Karki
Sudhin Karki

I am a Machine Learning enthusiast with a motivation of building ML integrated apps. I am currently exploring the groundbreaking ML / DL papers and trying to understand how this is shaping the future.