Complete Guide to AIML Basics: Understanding Cross Validation, Confusion Matrix, Sensitivity, and Specificity

Cross Validation:

Cross-validation is a technique used in Machine Learning to compare different Machine Learning Methods and understand which methods perform well in practice. It addresses the challenge of needing data for both training and testing an algorithm.

The Problems it Solves:
1. When developing a Machine Learning Model, you need data to train the algorithm(i.e., estimate the parameter, like the shape of a curve for logistic regression) and data to “Test” how well the method works now on new, unseen data.
2. A poor approach might be to use, for example, the first 75% of the data for training and the last 25% for testing. However, there’s no guarantee that this specific division is the best way to split the data, or that another block of data wouldn’t be better for testing.

3. To avoid this issue, cross-validation provides a systematic way to ensure that all data is used for both training and testing in different combinations*, offering a more reliable performance estimate.*

How Cross-Validation Works:
1. Instead of worrying about which single block of data is best for testing, cross-validation uses all blocks of data for testing, one at a time, and then summarizes the results.
2. This process typically involves dividing the entire dataset into several “blocks“.

For Example,
If data is divided into four blocks, cross-validation would:
1. Use the first three blocks to train the method and the last block to test it. It then keeps track of how well the technique performs with the test data.
2. Then, it uses a different combination of blocks for training and another block for testing, again keeping track of the performance.
3. This process continues until every block of data has been used for testing.

Types of Cross-Validation:
→ k-fold cross-validation: When data is divided into a specific number of blocks, say ‘k‘ blocks, it is called k-fold cross-validation. For example, dividing data into 10 blocks is known as 10-fold cross-validation, a method that is very common in practice.
→ Leave-One-Out Cross-Validation: In an extreme case, each patient or sample can be considered a ‘block‘, and each sample is tested individually.

—————————————————————————————————

Confusion Matrix:

A Confusion Matrix is a simple, clear table that helps you see exactly how well a machine learning model performed when making predictions on test data.
It shows:
1. What the model predicted correctly
2. What the model predicted wrongly

Structure of 2×2 confusion Matrix(Binary Classification)

Actual: PositiveActual: Negative
Predicted: PositiveTrue Positive (TP)False Positive (FP)
Predicted: NegativeFalse Negative (FN)True Negative (TN)

Example Confusion matrix:

In this example, the model predicts whether the patient will get heart disease or not.

Program:

import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Sample data: True labels vs Predicted labels
y_true = [0, 0, 0, 0, 1, 1, 1, 1, 0, 1]  # 0 = No Heart Disease, 1 = Heart Disease
y_pred = [0, 0, 0, 1, 1, 1, 1, 0, 0, 1]

# Generate the confusion matrix
cm = confusion_matrix(y_true, y_pred)

# Display the confusion matrix with clearer formatting
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["No Heart Disease", "Heart Disease"])

fig, ax = plt.subplots(figsize=(4, 4))  # Bigger, clearer figure size
disp.plot(cmap="Blues", values_format='d', ax=ax)

plt.title("Confusion Matrix: Heart Disease Prediction", fontsize=8)
plt.xlabel("Predicted Label", fontsize=10)
plt.ylabel("Actual Label", fontsize=10)
plt.tight_layout()
plt.show()

What you see in the output:
The resulting matrix visualization will clearly show:
→ Correct Predictions(true positives and true negatives) on the diagonal.
→ Mistakes (false positives and false negatives) off the diagonal.
→ This helps you quickly assess if your model is confusing positive and negative predictions and where improvement might be needed.

Why is it Useful?

1. Find Weakness: Helps you see where your model makes mistakes(False Positives / False Negatives).
2. Compare Models: Easily compares different models side-by-side.
3. Improve Performance: Gives deeper insight than just looking at overall accuracy.

Connection to Cross-Validation:
→ Use Cross-Validation first to ensure fair training/testing splits.
→ Then use a Confusion Matrix to evaluate how well your model performs


Sensitivity

What it tells us:
How well our model detects actual positives.
In health: “Of all the people with the disease, how many did the model catch?“
Formula:
Sensitivity = TP / (TP + FN)
When to prioritize:
When missing a positive is dangerous(sx: disease detection, fraud)

Specificity

What it tells you:
How well your model avoids false positives.
In health: “Of all the healthy people, how many did the model correctly say are healthy?“
Formula:
Sensitivity = TN / (TN + FP)
When to prioritize:

When wrongly called, someone positive is harmful(ex: Expensive Treatment, Legal Consequences).

Easy Example To Remember:

Actually PositiveActually Negative
Predicted PositiveTrue Positive (TP)False Positive (FP)
Predicted NegativeFalse Negative (FN)True Negative (TN)

Quick Decision Rule

What matters more?Metric to Prioritize
Catching every possible positiveSensitivity
Avoiding false alarms on negativesSpecificity

Why it matters:
1. Sensitivity helps avoid missing a real problem.
2. Specificity helps avoid creating a false problem.

0
Subscribe to my newsletter

Read articles from Sam Anirudh Malarvannan directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Sam Anirudh Malarvannan
Sam Anirudh Malarvannan