When it comes to evaluating machine learning models, picking the right metric is like picking the right outfit for an occasion—it can make or break your impression. Sure, accuracy and precision-recall are great, but they sometimes fall short. AUC-ROC Curve is that metric which tells a deeper story than other metrics.

Most Machine Learning models, even the classification ones, calculate probability of each class and have a predefined decision boundary of 0.5. It means, any probability greater than 0.5 would mean a positive class and less than 0.5 is a negative class. This may look good in theory but it is not practical. There are scenarios where you need models to be extremely sure when making a positive classification (again, the rare disease classification example from the previous article) and AUC ROC Curve makes a whole lot of sense to find the decision boundary.

AUC-ROC Curve: A Quick Overview

Let’s break it down:

ROC stands for Receiver Operating Characteristic. Fancy name, but all it means is a graph that shows how well your model separates the positive and negative classes as you tweak the decision threshold.
AUC stands for Area Under the Curve. This is the number that summarizes the ROC curve into one handy score. (Remember area under the curve concept while learning integration which many question why do we learn this and your teacher couldn’t probably explain you that? Yeah, that area under the curve concept)

In plain English, the ROC curve tells you how good your model is at making the right calls, while the AUC is the gold star rating—higher is better.

Why Should You Care About the AUC-ROC Curve?

Let’s face it: accuracy can be a real jerk sometimes. It looks good on paper but doesn’t always tell you the full story. Imagine a dataset where 99% of the cases are “No” and 1% are “Yes.” A model that just says “No” all the time scores a 99% accuracy. Impressive? Not really.

The AUC-ROC Curve doesn’t fall for such tricks. It doesn’t just focus on getting answers right—it checks if your model can tell the difference between the two classes. It’s like asking, “Does this model have good instincts?”

Breaking Down the ROC Curve

Here’s the gist:

True Positive Rate (TPR), also known as Recall.This measures how good the model is at catching actual positives.

$$TPR = \frac{\text{TP}}{\text{TP} + \text{FN}}$$

False Positive Rate (FPR): This checks how often the model cries wolf when there isn’t one.

$$FPR = \frac{\text{FP}}{\text{FP} + \text{TN}}$$

At every threshold, you plot these values to create the ROC curve. A perfect model would hit the top-left corner of the graph (TPR = 1, FPR = 0), which means it’s nailing all positives and ignoring all negatives. The AUC quantifies this—higher is better!

So, How Do You Use It?

You didn’t come here just for theory, right? Let’s get our hands dirty with some code. We’ll use Python and the trusty Scikit-learn library to show how it’s done.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

# Generate some synthetic data
X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.9, 0.1], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a model
model = RandomForestClassifier()
model.fit(X_train, y_train)

# Get predicted probabilities
y_scores = model.predict_proba(X_test)[:, 1]

# Calculate the ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_scores)

# Calculate AUC
auc = roc_auc_score(y_test, y_scores)
print(f"AUC Score: {auc:.2f}")

# Plot the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f"ROC Curve (AUC = {auc:.2f})")
plt.plot([0, 1], [0, 1], 'k--', label="Random Guess")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.show()

What Do the Results Mean?

AUC Score:
- 1.0: Your model’s a rockstar!
- 0.5: Your model’s as good as flipping a coin.
- < 0.5: Uh-oh, something’s seriously wrong.
ROC Curve:
- The closer it hugs the top-left corner, the better.
- A straight diagonal line? That’s random guessing.

When to Use AUC-ROC vs. Precision-Recall

Here’s a quick tip: AUC-ROC is great for balanced datasets. But when your data is heavily imbalanced (like fraud detection or rare disease diagnosis), precision-recall often give more meaningful insights. Why choose one when you can compare both?

The AUC-ROC Curve can be used for imbalanced datasets, but it’s not always the best tool. Here’s why:

Balanced Datasets: The ROC curve works well because both the True Positive Rate (TPR) and False Positive Rate (FPR) are meaningful metrics. The model's ability to distinguish between classes is clear and reliable.
Imbalanced Datasets: When there’s a severe imbalance, the False Positive Rate (FPR) can become misleading. This is because the negative class dominates, making the FPR very small even for a model that’s just guessing. In these cases, the Precision-Recall (PR) curve becomes more informative since it focuses on the positive class (which is usually the minority class in imbalanced datasets).

Using the ROC Curve to Set Decision Boundaries

The AUC-ROC Curve isn’t just about evaluating performance; it’s also a handy tool for finding the optimal decision boundary for your model.

By default, many binary classifiers use 0.5 as the threshold for assigning a positive or negative class. However, this one-size-fits-all approach might not work in all cases. For instance:

In medical diagnostics, false negatives (missing an actual disease) can be catastrophic. You might want to set a threshold closer to 0.3 or 0.4 to ensure fewer false negatives, even if it means slightly more false positives.
In spam detection, false positives (marking a legitimate email as spam) are annoying. You might prefer a threshold closer to 0.7 or 0.8 to minimize those errors.

How the ROC Curve Helps

The ROC curve gives you a visual way to assess these trade-offs:

At different thresholds, calculate the True Positive Rate (TPR) and False Positive Rate (FPR).
Choose the threshold where the balance between TPR and FPR aligns with your business goals.

For example:

A steeper curve at the top-left corner means the model achieves high TPR with minimal FPR. This is a good spot to consider your threshold.
A threshold closer to 0.8 may prioritize precision over recall, useful for high-stakes scenarios where false positives are costly.

How to find the optimal decision boundary from TPR & FPR?

You can calculate Youden’s Index for the same. It is just another fancy term but it basically is a difference between TPR & FPR. Wherever this index is highest, that’s where if you use a decision boundary, you’ll get the best results.

Is the Model making right predictions? - Part 3 of 5 on Evaluation of Machine Learning Models