KL Divergence

Kunal NayyarKunal Nayyar
17 min read

Remember when we chatted about how AI models like Generative Adversarial Networks (GANs) learn to create? Well, there's another class of AI magicians : Variational Autoencoders (VAEs) and Diffusion Models, that achieve incredible feats of generation using a slightly different, but equally fascinating, approach. And at the heart of their magic lies a concept called Kullback-Leibler (KL) Divergence.

Now, if you've seen the formula for KL Divergence or the equations behind VAEs, they can look like a wall of Greek letters. But what if I told you that these intimidating formulas aren't just plucked out of thin air? What if we could actually build them together, step by logical step, to understand why they exist and what problem they solve?

Grab a coffee, because we're about to embark on a little mathematical adventure. We're going to uncover the core reason KL Divergence shows up in these models, especially how it's baked right into something called the Evidence Lower Bound (ELBO), which is crucial for training VAEs. But first let’s understand why VAEs were needed when we already had Autoencoders.


The Problem with Autoencoders: A Broken Blueprint

Before we had VAEs, there were regular Autoencoders. Imagine an Autoencoder as a diligent student tasked with compressing and then reconstructing an image. It has two parts: an encoder that shrinks an image down into a compact "code" (a vector of numbers), and a decoder that tries to rebuild the original image from that code. The goal? To make the reconstructed image look exactly like the original.

Sounds great, right? It's excellent for tasks like image compression or noise reduction. But here's the catch: the latent space (that "code" or compressed representation) that a standard Autoencoder learns is often discontinuous and disorganized.

t-SNE (Stochastic Neighbor Embedding) is just a way to represent the latent space of an autoencoder in such a way that similar data points are mapped closer together in a lower-dimensional space for visualization.

If you try to pick a random point in this learned latent space and ask the decoder to generate an image, you'll often get garbage. It's like having a blueprint where all the good designs are scattered in weird, disconnected spots, and most areas of the blueprint are just static. You can reconstruct existing designs perfectly, but you can't generate new, meaningful ones by simply exploring the space. This severely limits their creative power.

VAEs: Making the Blueprint Usable for Creation

This is where Variational Autoencoders (VAEs) step in as a crucial upgrade. VAEs don't just learn a single "code" for an image; they learn a distribution (like a bell curve) for that image's latent code. And more importantly, they use KL Divergence (as we will derive!) to force this latent space to be continuous and well-structured, often resembling a simple, well-behaved distribution like a standard Gaussian. By imposing this "smoothness" and "regularity" on the latent space, VAEs create a usable blueprint. Now, you can pick any random point from that smooth, organized space, feed it to the decoder, and reliably generate a new, believable image. Think of it like a robot that can replicate a “dish” from a “recipe book” but can’t create it’s own edible recipe if given a list of ingredients. That structured latent space is the key to their generative magic.

The Big AI Problem: The "Chicken and Egg" of Probabilities

Imagine we want our AI model to truly understand the world of, say, cat photos. We want it to learn the true probability distribution of cat photos, let's call it \(p(x)\). If our model could truly learn \(p(x)\), it could generate endless, perfect cat photos.

But here's the catch: \(p(x)\) is incredibly complex and high-dimensional. It's practically impossible to calculate directly. It's like trying to perfectly map every single grain of sand on every beach in the world.

So, how do VAEs approach this? They introduce a clever idea: latent variables over a distribution, usually denoted as \(z\). Think of \(z\) as a compressed, abstract "code" for a cat photo. Maybe one part of \(z\) controls fluffiness, another controls eye color, and so on.

A VAE is designed to learn two things:

  1. A Decoder (\(\mathbf{p_{\theta}(x | z)}\)): This part learns how to take a latent code z and generate a cat photo x. (Think of it as creating a “dish" from the "ingredients")

  2. An Encoder (\(\mathbf{q_{\phi}(z|x)}\)): This part learns how to take a cat photo \(x\) and figure out what its latent code \(z\) should be. (Think of it as detecting "ingredients" from the “dish” itself.)

And here's where we hit our first roadblock – our "chicken and egg problem," as a wise person once put it

To fully understand the distribution of photos \(p(x)\), we'd ideally know the true posterior distribution \(p(z|x) \) ,which tells us the exact latent code for a given photo.

But to know \(p(z|x)\), we'd need to know \(p(x)\).

Why? Because of Bayes' Rule. It's a fundamental theorem in probability that relates conditional probabilities. It tells us that:

$$p(z \mid x) = \frac{p(x \mid z) \cdot p(z)} {p(x)}$$

Also for those unaware of the jargon, In Bayesian statistics:

  • Prior: Your belief about a parameter before seeing any data.

  • Posterior: Your updated belief about a parameter after incorporating the evidence from the data.

Let's look at the components of this equation in our VAE context:

  • \(\mathbf{p(z|x)}\): This is the true posterior we want to know, the probability of a latent code \(z\) given a specific image \(x\).

  • \(\mathbf{p(x|z)}\): This is what our decoder models, the probability of an image \(x\) given a latent code \(z\). We can train this part.

  • \(\mathbf{p(z)}\): This is the prior distribution over our latent space, a simple distribution (like a standard Gaussian) that we choose ourselves. So, we know this.

  • \(\mathbf{p(x)}\): This is the true data distribution – the very thing we initially identified as intractable :( and impossible to compute!

Do you see the circular dependency? To calculate the true \(p(z|x)\) for our encoder, we need \(p(x)\), which we can't compute. So, not only is \(p(x)\) intractable, but its intractability directly makes the true posterior \(\boldsymbol{p(z|x)}\) intractable as well.

We can't directly get our hands on either the true data distribution or the true latent code distribution given data. This is the core chicken and egg problem VAEs elegantly sidestep.


The Variational "Workaround": Our Best Guess

Since we can't get the true posterior \(p(z|x)\), VAEs do something very clever, they learn an approximate posterior distribution, let's call it \(q_{\phi}(z|x)\).

This \(q_{\phi}(z|x)\) is what our encoder outputs – it's our model's best guess at what the latent code \(z\) for a given photo \(x\) should be. The \(\phi\) just reminds us that this is controlled by our encoder's parameters, just like \(\theta\) determines our decoder’s parameters.

Our goal now becomes to make \(\boldsymbol{q_{\phi}(z|x)}\) as close as possible to the true but unknown \(p(z|x)\). If we can do that, then our encoder is doing a good job of finding meaningful latent codes.

But wait, how do we measure "close" if we don't know \(p(z|x)\)? This is where the Evidence Lower Bound (ELBO) comes into play. It's a brilliant trick that allows us to train our model without needing to calculate the elusive \(p(x)\) or \(p(z|x)\) directly.


Building the ELBO: Our Mathematical Journey Begins

Our ultimate goal in training a generative model is to make it learn the true data distribution, \(p(x)\). We want to maximize the likelihood of our observed data under our model. So, let's start with what we want to maximize: \(log \, p_\theta(x)\) (the logarithm makes the math nicer, especially with probabilities and \(\theta \) because we want to train it to get as close as possible to the true distribution).

Here's the first magic trick: "Multiply by 1" (the clever way).

We know that if we integrate any probability distribution over all its possible values, the result is 1. So, we can "multiply"

$$\log p_\theta(x) * \int q_{\phi}(z \mid x) dz$$

because

$$\int_{-\infty}^{\infty} q_{\phi}(z \mid x)dz = 1$$

This seems random, but trust me, it's a key step.

$$\log p_\theta(x) = \log p_\theta(x) \int_{-\infty}^{\infty} q_\phi(z|x) dz$$

Now, let's bring \(log \, p_\theta(x)\) inside the integral. Since \(log \, p_\theta(x)\) doesn't depend on \(z\), it can slide right in:

$$\log p_\theta(x) = \int_{-\infty}^{\infty} q_\phi(z|x) \log p_\theta(x) dz$$

This above expression is actually the expected value of \(log \, p_{\theta}(x)\)with respect to our approximate posterior \(q_{\phi}(z|x)\)

Think of an expected value as a weighted average. When we write \(\mathbb{E}_A[B]\), it means "the average value of B, where each possible value of B is weighted by its probability according to distribution A."

In our specific case:

  • The distribution we're averaging over is \(q_{\phi}(z \mid x)\). This is our encoder's learned probability distribution over the latent variable \(z\) given an input \(x\). The subscript \(q_{\phi}(z \mid x)\) under the \(\mathbb{E}\) tells us that \(z\) is the random variable we're integrating over, and its probability is given by \(q_{\phi}(z|x)\).

  • The quantity we're calculating the average of is \(log \, p_\theta(x)\). Even though \(log \, p_\theta(x)\) itself doesn't explicitly contain \(z\), we're integrating it with respect to \(z\) (multiplied by \(q_{\phi}(z|x)\)) over the entire latent space. This step is a mathematical maneuver that allows us to eventually introduce terms that do depend on \(z\) and relate them to the intractable \(p_\theta(x)\).

In fancy math notation, it's:

$$\log p_\theta(x) = \mathbb{E}_{q_{_\phi(z|x)}}[\log p_\theta(x)]$$

So far, so good. We haven't changed anything, just rewritten it.


Introducing the Bayes' Rule Twist

Remember how we said \(p_{\theta}(x)\) is hard to calculate? Let's use Bayes' Rule to relate \(p_\theta(x)\)to things we CAN work with:

We know that the joint probability of \(x\) and \(z\) can be written in two ways:

$$p_\theta(x,z) = p_\theta(x|z)p(z)$$

(This is our decoder's job, and \(p(z)\) is our simple prior)

$$p_\theta(x,z) = p_\theta(z|x)p_\theta(x)$$

(This would be our true posterior, which is hard to compute!)

From the second line, we can rearrange to get

$$p_\theta(x) = \frac{p_\theta(x,z)}{p_\theta(z|x)}$$

Let's plug this into our expectation:

$$\log p_\theta(x) = \mathbb{E}_{q_{_\phi(z|x)}}\left[\log \frac{p_\theta(x,z)}{p_\theta(z|x)}\right]$$

Okay, now for another clever "multiply by 1" trick. This time, we're going to multiply the numerator and denominator inside the logarithm by our approximate posterior \(q_{\phi}(z|x)\):

$$\log p_\theta(x) = \mathbb{E}_{q_{_\phi(z|x)}}\left[\log \frac{p_\theta(x,z)}{p_\theta(z|x)} \cdot \frac{q_\phi(z|x)}{q_\phi(z|x)}\right]$$

Rearranging the terms a bit inside the \(log\) to get something familiar:

$$\log p_\theta(x) = \mathbb{E}_{q_{_\phi(z|x)}}\left[\log \left( \frac{p_\theta(x,z)}{q_\phi(z|x)} \cdot \frac{q_\phi(z|x)}{p_\theta(z|x)} \right)\right]$$

Using the logarithm property \(log(AB) = log(A) + log(B)\), we can split this expectation into two parts:

$$\log p_\theta(x) = \mathbb{E}_{q_{_\phi(z|x)}}\left[\log \frac{p_\theta(x,z)}{q_\phi(z|x)}\right] + \mathbb{E}_{q_{_\phi(z|x)}}\left[\log \frac{q_\phi(z|x)}{p_\theta(z|x)}\right]$$

And look at that second term!


The Grand Reveal: KL Divergence Appears!

The second term :

$$\mathbb{E}_{q_{_\phi(z|x)}}\left[\log \frac{q_\phi(z|x)}{p_\theta(z|x)}\right]$$

is exactly the definition of KL Divergence between \(\boldsymbol{q_{\phi}(z|x)}\) and \(\boldsymbol{p_{\theta}(z|x)}\), both of them are gaussian distributions which we want to bring as close as possible to each other.

$$KL(q_\phi(z|x) || p_\theta(z|x)) = \mathbb{E}_{q_{_\phi(z|x)}}\left[\log \frac{q_\phi(z|x)}{p_\theta(z|x)}\right]$$

Think like this, you’re a translator tasked with converting messages from a native language (distribution P, the true distribution) to a foreign language (distribution Q, the model’s approximation). KL divergence, specifically \(KL(P||Q)\), represents the extra effort or misunderstanding that occurs when you use the foreign language to express ideas that are naturally spoken in the native language. If the two languages are very similar (P and Q are close), the translation is easy, and the KL divergence is small. If they’re very different, the translation is cumbersome, and the KL divergence is large.

Now to emphasize on the concept of asymmetry in KL divergence, let us suppose that you’re fluent in the native vernacular (P), but now you’re forced to express your deepest thoughts in a clunky, half-learned foreign language (Q). Every word feels like walking through a field full of landmines, meanings lost in the stutter and fumble. That’s \(KL(P || Q)\), the hefty toll of forcing the true story into a mismatched mold.

But now, reverse the canvas. You’re a foreigner (Q) trying to chat in the native vernacular (P). Think about yourself, a well educated person going abroad, you’re born and brought up in India with Hindi speaking people all around you, but you will have a far easier time conversing to a foreigner from America than him trying to learn and speak Hindi to converse with you, it would still be a kerfuffle but only less brutal.

This lopsidedness is the soul of KL divergence’s charm. It ain’t about a fair-and-square distance between two points. It’s like taking water at room temperature and trying to mold it into a stencil which you want the ice to look like if it makes sense, except in this case it’s not really possible to convert one distribution exactly to another, just an approximation as good as possible.

So, we can rewrite our expression for \(log \, p_{\theta}(x)\) as:

$$\log p_\theta(x) = \mathbb{E}_{q_{_\phi(z|x)}}\left[\log \frac{p_\theta(x,z)}{q_\phi(z|x)}\right] + KL(q_\phi(z|x) || p_\theta(z|x))$$

This is a beautiful identity! It shows that the true data likelihood, \(log \, p_\theta(x)\), is equal to the first term PLUS the KL Divergence. What might the first term be ? Let’s find out


The Evidence Lower Bound (ELBO): Our Trainable Objective

Now, remember a crucial property of KL Divergence -: it's always non-negative!

$$KL(P || Q) \ge 0$$

Since \(KL(q_{\phi}(z|x) || p_{\theta}(z|x))\)is always greater than or equal to zero, we can deduce something very important:

$$\log p_\theta(x) \ge \mathbb{E}_{q_{_\phi(z|x)}}\left[\log \frac{p_\theta(x,z)}{q_\phi(z|x)}\right]$$

That term on the right is our Evidence Lower Bound (ELBO)!

$$\text{ELBO} = \mathbb{E}_{q_{_\phi(z|x)}}\left[\log \frac{p_\theta(x,z)}{q_\phi(z|x)}\right]$$

Since the KL divergence will at the least be 0, the lower bound on \(log \, p_\theta(x)\) depends upon this first term hence the “Lower Bound”.

Now, Why is it called "evidence"?

Imagine you have a model \(p_\theta(x)\) and you observe some data \(x\). The probability \(p_\theta (x)\) tells you how likely it is to observe the data \(x\), given your model. If \(p_\theta(x)\) is high, it means your data is "good evidence" for your model, suggesting your model is a good fit for the data generating process.

If \(p_\theta(x)\) is low, the data isn't strong "evidence" for your model. It's akin to how likely a crime scene (data) is, given a particular suspect (model).

Fun Reference: If someone has watched the show “The Mentalist”, there’s a special “smiley face”, if it’s the first thing you see on the crime scene, It drives the likelihood of Red John being behind the murder close to one. :)

This means the ELBO is always a lower bound on the true log-likelihood of our data. If we maximize the ELBO, we are also maximizing the true log-likelihood (or at least, we're pushing it up as much as possible).

Let's expand that ELBO term a little further, using the property \(log(A/B) = log(A) - log(B)\):

$$\text{ELBO} = \mathbb{E}_ {q_{_\phi(z|x)}}\left[\log p_\theta(x,z) - \log q_\phi(z|x)\right]$$

And using \(\textit{p}_\theta(x,z) = p_\theta(x \mid z)p(z)\):

$$\text{ELBO} = \mathbb{E}_{q_{_\phi(z|x)}}\left[\log p_\theta(x|z) + \log p(z) - \log q_\phi(z|x)\right]$$

Rearranging the terms within the expectation:

$$\text{ELBO} = \mathbb{E}_{q_{_\phi(z|x)}}[\log p_\theta(x|z)] + \mathbb{E}_{q_{_\phi(z|x)}}[\log p(z) - \log q_\phi(z|x)]$$

And that second expectation? It's simply the negative of the KL Divergence! (Specifically, \(KL(q_\phi(z|x) || p(z)\), this is the 2nd KL divergence term we are encountering, notice the distributions which are being compared in both the terms, to distinguish them)

This term represents the Kullback-Leibler (KL) divergence between two probability distributions:

  • \(\boldsymbol{q_{\phi}(z|x)}\): This is the approximate posterior distribution. It's the distribution over the latent variables \(z\) that the encoder learns for a given input data point \(x\). This is our learned belief about what \(z\) should be, given \(x\).

  • \(\boldsymbol{p(z)}\): This is the prior distribution over the latent variables \(z\). It's a simple, predefined distribution (usually a standard normal distribution, i.e., a Gaussian with mean 0 and variance 1, sometimes also written as \(\mathcal{N}(0, I)\)) that we assume the latent space should generally follow. This is our initial belief about \(z\) before seeing any data.

The \(KL(q_{\phi}(z|x) || p(z))\) term in the VAE's loss function acts as a regularization penalty. Its job is to force the approximate posterior \(\boldsymbol{q_{\phi}(z|x)}\) to be as close as possible to our chosen prior \(\boldsymbol{p(z)}\).

So, the ELBO ultimately boils down to:

$$\text{ELBO} = \mathbb{E}_{q_{_\phi(z|x)}} [\log p_\theta(x|z)] - KL(q_\phi(z|x) || p(z))$$

This is the exact form of the ELBO you often see in VAE papers! We built it, piece by piece!🎉🎉🎉

The Infamous Reparameterization Trick

Now, you might be wondering: if the encoder outputs a distribution \(q_{\phi}(z | x)\), how do we actually sample a latent vector \(z\) from it to feed into the decoder, and more importantly, how do we backpropagate gradients through that sampling process to train the encoder?

This is where standard autoencoders had it easy, they just output a fixed latent vector. But for VAEs, if we simply sampled \(z\) directly from \(q_{\phi}(z|x)\), the sampling operation itself is non-differentiable. You can't pass gradients through a random "coin flip"! This would completely block the backpropagation of errors from the decoder back to the encoder, making the whole system untrainable.

Enter the brilliant solution proposed by Kingma and Welling (2013): the Reparameterization Trick.

Here's how it works, step by logical step:

  1. Encoder Outputs Parameters, Not Samples: Instead of the encoder directly outputting a sample \(z\), it outputs the parameters that define the distribution \(q_{\phi}(z|x)\). If we assume \(q_{\phi}(z|x)\) is a Gaussian, the encoder outputs a mean vector \(\mu\) and a variance vector \(\sigma ^ 2\) (or more commonly, its logarithm, \(log \, \sigma ^ 2\) to ensure positivity). You can simulate it yourself in Desmos, by taking the standard gaussian function \(\frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}(\frac{x - \mu}{\sigma}) ^ 2} \) and now vary the \(\mu\) and \(\sigma\) and see how the gaussian shape changes.

  2. Introduce an Auxiliary Random Variable: We then introduce a new, independent random variable, \(\epsilon\), which is sampled from a simple, fixed distribution that doesn't depend on our network parameters. For a Gaussian, \(\epsilon\) is sampled from a standard normal distribution, \(\mathcal{N}(0, I)\) (mean 0, variance 1).

  3. Construct z Deterministically: We then compute our latent vector \(z\) using a deterministic function of the encoder's outputs (\(\mu\) and \(\sigma\)) and our auxiliary random variable \(\epsilon\):

    So now, \(z = \mu + \epsilon \cdot\sigma\) (Add mean to a scaled standard deviation)

    Here, \(\sigma\) is the standard deviation.

    Let's look at the components:

    • \(\mu\): This is the mean output from your encoder for a given input \(x\).

    • \(\sigma\): This is the standard deviation output from your encoder for a given input \(x\).

    • \(\epsilon\): A random value sampled from a standard normal distribution \(\mathcal{N}(0, I)\).

  4. Why This is Genius: The key insight is that the randomness (the part we can't backpropagate through) is now isolated in \(\epsilon\), which is outside the computation graph that depends on the network's trainable parameters (\(\mu\) and \(\sigma\)).

    The operations of adding \(\mu\) and scaling by \(\sigma\) are deterministic and fully differentiable🎉🎉. This allows gradients to flow smoothly through \(\mu\) and \(\sigma\) back to the encoder's weights.

In essence, the reparameterization trick effectively "moves" the sampling operation out of the direct path of backpropagation, making the entire VAE model end-to-end trainable with gradient descent, which is fundamental to their success.


The ELBO: A Balancing Act of Two Goals

Now that we've derived it, let's understand what maximizing this ELBO actually means for our VAE:

$$\text{Maximize ELBO} \equiv \text{Maximize } \mathbb{E}{q_{\phi(z|x)}}[\log p_\theta(x|z)] - \text{Minimize } KL(q_\phi(z|x) || p(z))$$

Think of it like managing a budget to maximize your "profit":

$$\mathbb{E}{q_{\phi(z|x)}}[\log p_\theta(x|z)] \, \, \textbf{(Maximize Reconstruction Likelihood)} :$$

This is your "Revenue" term. It encourages the decoder to reconstruct the original input \(x\) as accurately as possible from the latent code \(z\) provided by the encoder. We want the probability of reconstructing the correct \(x\) to be very high. This is the VAE's "fidelity" component.

$$KL(q_\phi(z|x) || p(z)) \, \, (\textbf{Minimize Latent Space Divergence}):$$

This is your "Cost" term. It measures how much our encoder's learned latent distribution \(q_{\phi}(z|x)\) diverges from our desired prior distribution \(p(z)\) (e.g., a simple Gaussian). We want this cost to be as small as possible.

Why is this a "cost"? Because a large divergence means our encoder is putting latent codes into weird, scattered places that don't match our nice, smooth prior. If we want to generate new samples by drawing from the prior \(p(z)\), our encoder needs to be mapping real data into a space that looks like that prior. This KL term is the "regularizer" that keeps the latent space organized and prevents the model from just memorizing data. It ensures the latent space is smooth and continuous, making it truly generative.

The approximate distribution is sometimes also referred as the surrogate distribution. Here is an interactive plot of 2 distributions which underlines how changing the mean and variance of the 2 gaussian distributions changes the ELBO between the 2 distributions, you’ll notice that, the closer the distributions get, the lower the ELBO goes, the further apart they seem, the higher the ELBO goes.

So, training a VAE means finding the sweet spot: reconstruct images well, and keep your latent codes organized and similar to your chosen prior. Notice how the \(log \, p(x=D)\) term doesn’t change, it’s a constant because it is the Log likelihood of the true distribution which doesn’t change.


A Quick Peek at Diffusion Models: KL in Every Step

While the full ELBO derivation for Diffusion Models is more complex, the core idea of KL Divergence persists.

Diffusion models learn to reverse a noisy process (Denoising). They go from pure noise back to a clean image over many small steps. At each step, the model learns to transform a noisy image \( x_t \) into a slightly cleaner one \( x_{t-1} \) .

The training objective for these models essentially involves minimizing KL Divergence terms that look like this at each step:

$$KL(q(x_{t-1} | x_t, x_0) || p_\theta(x_{t-1} | x_t))$$

  • \(\boldsymbol{q(x_{t-1} | x_t, x_0)}\): This is the mathematically derivable "true" way to denoise one step, knowing the original image \(x_0\).

  • \(\boldsymbol{p_\theta(x_{t-1} | x_t)}\): This is what our neural network (our model) learns to do, its best guess at how to denoise \(x_t\) to get \(x_{t-1}\).

By minimizing this KL term at every single step, the diffusion model learns to perfectly mimic the true denoising process. It's like teaching a sculptor to make the perfect tiny chisel stroke at each stage, ensuring the final masterpiece is flawless.


The Takeaway: KL Divergence is AI's Quality Control

We've peeled back the layers of KL Divergence, not just defined it, but seen it emerge from the fundamental problem of training generative models. It's the mathematical tool that allows VAEs to learn meaningful latent spaces and Diffusion Models to master the art of sequential denoising.

So, the next time you see a stunning AI-generated image, remember the humble KL Divergence – the silent quality control expert 🧐 working behind the scenes.

1
Subscribe to my newsletter

Read articles from Kunal Nayyar directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Kunal Nayyar
Kunal Nayyar