I'll Never Forget Bayes's Theorem Again


Mathematical intuition is, as the LLMs say, crucial, to technologists and engineers. We can "chi by eye" or "guesstimate" or even better solve problems with much less effort and error because we know how it works and why it works that way. Bayes's Theorem was a concept I struggled to grasp intuitively for a long time. I could apply the formula, but I didn't get it. The common teaching methods, like confusion matrices and Venn diagrams, helped with the mechanics, but they didn't provide the deeper insight I was looking for. So, I memorized it and applied but, but I needed a more intuitive way to think about it. Last week, I found an approach that finally made it click, and this post shares that method.
The Gist of Bayes's Theorem
Bayes's Theorem, at its core, is about two key ideas.
The First Key Idea: Prior Information
First, you start with some prior knowledge—background information or assumptions about a situation. Then, you gather new evidence that updates your understanding.
Let’s use a simple weather example to illustrate this. Suppose you’re in Austin, TX, and you’ve analyzed historical weather data. You find that:
- It’s sunny 300 days out of the year, or about 82% of the time.
Nothing too complicated here. Austin is in the sunbelt. We get lots of sun.
From what we know at this point in time is that, overall, you have an 82% chance of sun and therefore an 18% chance of clouds. You can imagine standing on the "100% of days" and looking down at your options: sunny or not, 82% or 18%.
Now, I want to know about rain. Can I expect rain or not?
The second half of the first idea, still relating to this prior knowledge, is the likelihood of rain given clouds or not. We can just go look this up in the records, too. Was it cloudy or not? Did it rain or not? From some simple bookkeeping, we see:
On sunny days, it rains only 5% of the time (so 95% of the time it doesn't rain). This makes intuitive sense. If it's sunny, it's pretty unlikely to rain.
On cloudy days, it rains 40% of the time (so 60% of the time it doesn't rain). This also makes intuitive sense: if it's cloudy that doesn't guarantee rain but is does make it more likely that it will rain than if it was sunny.
On sunny days, there's a 5% chance of rain. On cloudy days, that chance increases to 40%. This all
The Second Key Idea: New Information brings new Beliefs
But now, you look up and notice that the sky is cloudy. This is your new observation—the fresh piece of information that changes the situation. You use this new observation to update your beliefs.
Remember we just learned it's not sunny? So, intuitively, imagine walking from "100% of days" down to "18% Not Sunny" and looking at your options of rain or not rain. Notice that from here, the chance of rain or not differs from when I stand on the "82% Sunny" box and look down at my options over there. This should hopefully all just make sense, as it's pretty intuitive. Initially, you knew there was an 82% chance of sun on any given day. But now that you’ve seen clouds, and you know rain is more likely on cloudy days, you should adjust your expectations. Bayes's Theorem helps you put an actual number to this updated scenario.
In short, Bayes's Theorem is about starting with what you know, incorporating new evidence, and updating your beliefs accordingly. It’s a powerful tool for making sense of uncertainty in a logical, data-driven way.
Here’s where Bayes's Theorem shines. It lets us flip the question and learn something new.
Flipping the Question with Bayes's Theorem
The key idea is that Bayes's Theorem combines prior knowledge (what you knew before) with new evidence (what you just observed) to update your understanding. It’s a powerful tool for making sense of uncertainty and answering questions that seem backward at first glance. I'll show you this explicitly, later on. First, let's talk about why Bayes's Theorem can stump people.
Where Lots of Folks Get Tripped Up
Let's revisit the scenario from above real quick. Here are some places where people go astray.
If it's sunny notice we have a 5% chance of rain and a 95% chance of no rain. Those percentages total 100%; however, it's 100% of the 82% of being sunny; it is NOT 100% of the 100% of days. Similar for not sunny: 40% + 60% = 100% of the 18% of the time it's cloudy. I think this confusion arises due to the notation of probabilities.
Now, let's think about chances of rain overall. Just because you know it's sunny does not mean you suddenly have a flat 5% chance of rain overall. The 5% chance of rain only applies when it is sunny, which happens 82% of the time. If you ignore the fact that rain also happens on cloudy days (40% chance when not sunny), you are effectively ignoring the contribution of the 18% of the time when it is not sunny.
Why I Wrote This
Bayes's Theorem formalizes the process of updating probabilistic beliefs in light of new evidence. It states that the posterior probability of a hypothesis H, given observed evidence E, is proportional to the product of the prior probability of H and the likelihood of observing E under H. Mathematically, this is expressed as:
where:
P(H | E) is the posterior probability—the updated belief in H after observing E
P(E | H) is the likelihood—the probability of observing E if H is true,
P(H) is the prior probability—the initial belief in H before observing E,
P(E) is the marginal likelihood—the total probability of observing E under all possible hypotheses.
In essence, Bayes's Theorem provides a rigorous framework for integrating prior knowledge with new data to refine our understanding of uncertain events.
To many the above is inscrutable or not intuitive. That's why I'm writing this. Because I finally found a way that is easy for me to understand how to think about this and thus calculate the answers we seek. I already alluded to it above. But, here I'll show you what I do.
The Method
The realization for me was that I could draw this out simply as a small tree. To make the numbers easy, just start with 100% and break down the 100%. It’s like having a starting point, gathering fresh clues, and then refining our understanding. Let me explain this with a simple example: your dog’s happiness and tail wagging.
Starting with What We Know
Imagine you’ve observed your dog 100 times. Out of those 100 observations, your dog is happy 90 times and not happy 10 times. This is your prior information—the background knowledge you start with before considering any new evidence.
Good dog.
Adding New Information
Now, let’s add what you know about your dog’s tail wagging. When your dog is happy, he wags his tail 70% of the time. Note, happiness and tail wagging are two very different things. Doing very simple math, that’s 63 times out of those 90 happy observations (70% wagging x 90 happy = 63 wagging given happiness). The other 30% of the time, he doesn’t wag his tail—that’s 27 times. Notice now, that 63 + 27 = 90. It does NOT equal 100.
Next, your dog's not always happy. When your dog is not happy, he wags his tail only 10% of the time. That’s 1 time out of those 10 not-happy observations. The other 90% of the time, he doesn’t wag his tail—that’s 9 times. You now know how tail wagging relates to your dog’s actual happiness.
Here we add the "likelihoods". Eg, how "likely" is the dog to be wagging his tail given he's happy? Clearly, that would be more likely than when he's not happy.
Observing the Key Points
Before diving into calculations, let’s notice a few important things:
First, your dog can only be in one state at a time: happy or not happy. These two states cover all possibilities. If you add up the happy and not-happy observations (90 + 10), you get 100—every observation is accounted for. This is what we call mutually exclusive, collectively exhaustive (MECE).
Second, tail wagging overlaps between the two states. Your dog can wag his tail when he’s happy and when he’s not happy. He can not wag his tail in both states, too. This overlap is critical and often trips people up. It’s why we need Bayes's Theorem to help us sort things out!
Again, this is where Bayes's Theorem really shines! For example, suppose you see your dog not wagging his tail. What’s the probability he’s happy? From the numbers, we know there are 36 times your dog doesn’t wag his tail: 27 times when he’s happy and 9 times when he’s not happy. So, the probability your dog is happy given he’s not wagging is 27 divided by 36, or 75%. Wait, what—oh, exactly. Even though he's not wagging his tail, he's happy so often!
Example: Predicting Computer Downtime Based on CPU Usage
You’re managing a server and want to predict if it will experience downtime based on its CPU usage. You know:
On any given day, the probability the server goes down is 2%. (100 = 2 downtime + 98 no downtime)
If the server is about to experience downtime, there’s an 85% chance that CPU usage will spike above 90%. (2 downtime = 2 ** 85% downtime w/high CPU + 2* 15% downtime w/normal CPU).
If the server is not about to experience downtime, there’s still a 10% chance that CPU usage will spike above 90%. (98 no downtime = 98 10% high CPU + 98 90% normal CPU).
You notice that CPU usage has spiked above 90%. What’s the probability that the server will experience downtime?
Start by drawing the graph:
Start at the top, break down what you know.
And, now the math is really easy. The probability that the server will experience downtime given a spike in CPU is just the amount time the server goes down due to high CPU divided by the total time a server goes down, so: (1.7 / (1.7 + 9.8 )) = 14.78%. So, the CPU can spike and the machine goes down (or not). Because it's so rare for the machine to go down, even if the CPU spikes, it's still rather unlikely the machine goes down.
Example: Sports Analytics
Scenario: Suppose a player makes 40% of their shots (prior probability). Analysis shows they make 70% of shots when they are "in the zone" but only 20% when not in the zone. If the player makes a shot, what is the probability they were "in the zone"?
Bayesian Insight: This shows how prior performance and situational data update the probability of being "in the zone."
So, again, start by drawing the graph:
Same thing. Start at the top, and work your way down.
Let's calculate the probability that he's in the zone given he made the shot? Remember, the player can miss shots while still being in the zone. So, the probability is just the number of shots made in the zone divided by the total number of shots made while in the zone, or (28 / (28 + 12)) = 70%.
Let's do a similar calculation. What's the probability that the player wasn't in the zone given he missed the shot. It's just the number of shots missed when not in the zone divided by the total number of shots not in the zone: (36 / (36 + 12)) = 75%.
Example: Airport Security Screening
Scenario: Suppose 1% of passengers carry prohibited items. The scanner detects prohibited items 99% of the time but falsely flags 5% of clean luggage. If the scanner flags a bag, what is the probability it actually contains a prohibited item?
Bayesian Insight: This highlights how rare events (low prior probability) and scanner accuracy affect the probability.
Same thing. Start at the top. I chose 10,000 passengers here b/c the small percentages would give me fractions of a person if I started from 100 people.
Let's calculate the probability that the bag actually contains a prohibited item given the scanner went off. That's just the number of times the scanner goes off because of a prohibited item divided by the total number of times the scanner goes off: (99 / (99 + 495)) = 16.66%. Think about that! Even though the detector is 99% accurate when it goes off it's still unlikely to be an actual weapon. That's because the prior, 9,900 passengers doing the right thing (495) are flagged more often than those with prohibited weapons (99).
Exploring the extents of the formula
Notice that if the prior is 100% certain, one half of the tree will always be zero. Let's say no one brings prohibited weapons, ever. Then, there will be no one flagged or not on the right side. So, in the above example the probability that the bag actually contains a prohibited item given the scanner went off would be 0% = (0 / (0 + 495) = 0%.
Similarly, if one of the likelihoods is perfectly known – for example, the machine is 100% accurate at detecting weapons if they're present, then you get 100 flagged and 0 not flagged. If the scanner doesn't go off, you're 100% sure there's no weapon (0 / (0 + 9405)) = 0. And, even if the scanner goes off there still could be a chance some innocent person's getting flagged.
In math, I find it very useful to understand how equations change based on their inputs. Play with your own numbers and give it a shot.
Wrapping Up
After writing this, and knowing Bayes's Theorem for so many years, it really made me think more about modes of learning than anything else. Once I was able to devise this technique all the ambiguity I had before, all the lack of clarity, all the clumsiness, just disappeared. Really amazing. It's still the same thing I learned years ago but considered from a different perspective.
I hope someone out there finds this useful.
Subscribe to my newsletter
Read articles from Jason Vertrees directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
