How Well Is My Model Doing? Let’s Decode AUC-ROC Curve with an Email Spam Example

You’ve built a machine learning model. Great!
Now comes the big question:
"How good is it, really?"
Accuracy alone might not always be enough to answer that.
Let’s take a fun, real-life scenario and break it down—step by step, graph by graph, mistake by mistake.
Ready? Let’s get our hands dirty with a bit of spam.
The Spam Filter in Action
Imagine you are building a spam filter, a model whose job is to look at an incoming email and decide: is this spam or is it safe and important? This is a classic binary classification problem, where there are only two possible outcomes—Spam or Not Spam.
You collect a dataset of 100 emails to test your model. Out of these 100 emails:
50 are actually spam.
50 are actually not spam.
You feed them into your model, and here’s what it predicts:
Out of the 50 actual spam emails, it correctly catches 40 as spam.
The remaining 10 spam emails slip through and are wrongly marked as not spam.
Out of the 50 emails that are not spam, the model correctly classifies 45 as clean.
But 5 of those good emails are wrongly labeled as spam and get dumped into the spam folder.
Let’s put that information into a simple table called the Confusion Matrix, which is one of the most important tools in evaluating classification models.
Understanding the Confusion Matrix
The Confusion Matrix is basically a square table that helps you understand how well your classification model is performing. It compares what the model predicted to what the actual answers were. In our spam email example, the Confusion Matrix would look like this:
Predicted Spam | Predicted Not Spam | |
Actual Spam | 40 (True Positives) | 10 (False Negatives) |
Actual Not Spam | 5 (False Positives) | 45 (True Negatives) |
Now let’s break these terms down one by one because they’ll come up a lot as we dig deeper:
True Positives (TP): These are emails that are actually spam, and the model correctly labeled them as spam. In our case, there are 40 of these.
False Negatives (FN): These are spam emails that the model mistakenly thought were not spam, so they sneak into your inbox. There are 10 of these in our example.
False Positives (FP): These are the annoying cases where good emails get wrongly marked as spam and go to the spam folder. Our model made 5 such mistakes.
True Negatives (TN): These are good, clean emails that the model correctly identified as not spam. We have 45 of these.
So out of 100 emails, 40 were caught as spam correctly, 10 spam emails were missed, 5 good emails were wrongly marked, and 45 were handled perfectly.
Let’s Talk Metrics – Accuracy, Precision, and Recall
Now that we understand what happened, let’s calculate how well the model is doing using some basic metrics.
Accuracy
Accuracy tells us what percentage of total predictions were correct. In this case, the model got 40 spam emails right and 45 not-spam emails right, so:
That sounds pretty good. But here’s where things get tricky. What if only 5% of all emails were spam and 95% were not? A model could just say “not spam” every time and still be 95% accurate—while completely failing its job of catching spam. That’s why accuracy is sometimes misleading, and we need other metrics.
Precision
Precision tells us out of all the emails the model predicted as spam, how many were actually spam. It answers the question: When the model says “this is spam,” how often is it right?
So, about 89% of the time, the model is correct when it predicts something as spam.
Recall (also called Sensitivity or True Positive Rate)
Recall tells us out of all the actual spam emails, how many the model was able to catch. It answers: How good is the model at catching spam?
So, the model is catching 80% of the spam emails, but it’s letting 20% of them slip through. That could be a problem depending on how strict you want the spam detection to be.
Why Accuracy Isn’t Enough
Let’s say tomorrow you deploy this model to filter a million emails, and only 5% of them are spam. That’s 50,000 spam emails. If your model decides to label everything as "not spam," it would still be 95% accurate—because 950,000 emails were correctly marked. But it would miss all the actual spam and do nothing helpful.
This is why we need more reliable tools to evaluate model performance—especially in cases where one type of mistake (like letting spam into an inbox) is more costly than another (like occasionally misplacing a good email in spam).
The ROC Curve – A Better Way to Measure Performance
Now we step into something called the ROC Curve, which stands for Receiver Operating Characteristic. It sounds technical, but the idea is actually pretty simple once you understand what it does.
Most machine learning classifiers, especially probabilistic ones like logistic regression, don’t just say “yes” or “no” to a prediction. Instead, they say something like, “I think this email is 90% likely to be spam.”
You, as the developer, choose a threshold let’s say 0.5 and say that anything above 0.5 will be labeled as spam. But what if you changed that threshold? What if you made it 0.7? Or 0.3?
Lowering the threshold means the model becomes more aggressive and labels more emails as spam. Raising it makes the model more cautious.
The ROC Curve helps you visualize what happens to your model’s performance as you vary this threshold.
What Does the ROC Curve Show?
The ROC Curve is a plot between two important values:
True Positive Rate (TPR), which you already know as Recall.
False Positive Rate (FPR), which tells you how many not-spam emails were wrongly marked as spam, out of all actual not-spam emails.
Now imagine you gradually change the threshold and measure the TPR and FPR each time. You then plot these values on a graph, with FPR on the x-axis and TPR on the y-axis. That’s your ROC Curve.
A perfect model would quickly jump to the top-left corner of the graph, which means it catches all the spam (high TPR) and rarely mislabels good emails (low FPR).
A model that guesses randomly would make a diagonal line from the bottom-left to the top-right of the graph.
What Is AUC and Why Does It Matter?
AUC stands for Area Under the Curve. It is a single number that tells you how much of the graph lies underneath the ROC Curve. In simple terms, AUC measures how well your model can separate the two classes: spam and not spam.
Here’s how to interpret AUC scores:
AUC = 1.0 → The model is perfect. It always ranks spam higher than not spam.
AUC = 0.9 or higher → Excellent model.
AUC = 0.8 to 0.9 → Good model.
AUC = 0.7 to 0.8 → Fair model.
AUC = 0.6 to 0.7 → Poor model.
AUC = 0.5 → Model is guessing randomly.
AUC less than 0.5 → Model is worse than random. It might be doing the opposite of what it should.
So, if your spam email model has an AUC of 0.95, that means it’s doing a very good job of telling spam from not spam.
Visualizing the Curve with Thresholds
Here’s a quick table showing what happens when we vary the threshold:
Threshold | TPR (Recall) | FPR |
0.2 | 0.95 | 0.40 |
0.4 | 0.90 | 0.20 |
0.6 | 0.85 | 0.10 |
0.8 | 0.75 | 0.05 |
1.0 | 0.00 | 0.00 |
By plotting these points, we can draw the ROC Curve. The area under it becomes our AUC score. The more the curve hugs the top-left side of the graph, the better the model.
Here’s the ROC Curve based on the threshold table you provided. Each point represents the model's performance at a different threshold, and the curve clearly shows how True Positive Rate and False Positive Rate trade off. The better the model, the more this curve bends toward the top-left corner of the graph.
Why You Should Care About AUC-ROC
If you’re working with classification problems—whether it’s spam detection, medical diagnosis, fraud detection, or anything where two outcomes exist—you’ll want your model to be evaluated thoroughly. Accuracy may look impressive, but it can hide some big problems, especially in real-world scenarios where one type of mistake is more dangerous than another.
The Confusion Matrix helps you break things down. The ROC Curve shows you the full picture across thresholds. And the AUC gives you a solid summary in one number.
So next time you build a classifier, don’t stop at accuracy. Dig deeper. Plot the curve. Calculate the AUC. And make sure your model doesn’t just work—it works well for the real-world decisions it’s meant to support.
Subscribe to my newsletter
Read articles from Kumkum Hirani directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
