Machine Learning: Bias, Variation, ROC, AUC

Why This Article?

This article explains the fundamentals of machine learning, covering everything an AI/ML engineer needs to know. It also serves as a place where I practice, learn, and recall concepts through blogging. The article focuses on how we understand the concepts we learn.

AIML Concepts

The fundamental concepts of machine learning are numerous, but I won’t explain all of them here. Instead, we'll look at some key concepts:

Bias and Variance
ROC (Receiver Operating Characteristics)
AUC (Area Under Curve)

Bias and Variance

What is Bias?

Bias occurs when a model is too simplistic to represent the real problem accurately.
**Example: If the true relationship is curved but we use a straight line to model it, the model will make errors. This is called high bias because the model lacks flexibility.
Imagine predicting a mouse’s height based on its weight. The actual relationship might be curved (since taller mice don’t grow taller at a fixed rate as their weight increases).
If you use a straight line for predictions, your model is too simple and overlooks the true curve.**High bias results in poor performance on both training and test data because the model doesn’t capture the real pattern.

Example of High Bias:

Applying linear regression (a straight line) to data requiring a curve.
The model makes incorrect predictions because it oversimplifies the problem.

Summary:

Bias occurs when a model is too simple to accurately represent the true relationship in the data.

What is Overfitting?

Overfitting is a direct result of high variance.
When a model performs extremely well on the training data but badly on new, unseen data, it’s said to be overfitting.

Overfitting Happens Because

The model is too complex — it’s not just learning the pattern but also the random noise, mistakes, and outliers in the training data.

Signs of Overfitting:

High accuracy on training data
Poor accuracy on testing data
Very wavy or complicated model curves

What is Underfitting?

Underfitting is a direct result of high bias.
When a model performs poorly on both training data and new, unseen data, it's said to be underfitting.

Signs of Underfitting:

Low accuracy on training data
Low accuracy on testing data
Model is too simple (straight lines, not enough flexibility)
Fails to capture important patterns

What is the Ideal Model?

The goal is to build a model with both low bias and low variance.

Low bias: Understands the real pattern.
Low variance: Gives stable, reliable predictions even on new data.

Such a model captures the true relationship without being distracted by noise.

Quick Comparison: Simple vs. Complex Models

Model Type	Bias	Variance	Performance on New Data
Simple (Linear)	High	Low	Underfits (Misses patterns)
Complex (Wavy)	Low	High	Overfits (Catches noise)
Balanced	Low	Low	Performs well consistently

ROC (Receiver Operating Characteristic) & AUC (Area Under the Curve)

What is ROC?

ROC Curve is a graph used to evaluate the performance of a classification model, especially in binary classification problems (Yes/No, 0/1, True/False).
It shows how well the model can separate the positive class from the negative class by changing the classification threshold.

How ROC Works:

The ROC curve is plotted between two important metrics:

Axis	Meaning	Formula
X-axis	False Positive Rate (FPR)	FPR = FP / (FP + TN)
Y-axis	True Positive Rate (TPR)	TPR = TP / (TP + FN)

TPR (Sensitivity) tells us:
“Of all actual positives, how many did we correctly classify as positive?”
FPR tells us:
“Of all actual negatives, how many did we wrongly classify as positives?”

How to Read the ROC Curve:

Top-left corner (0,1) is the best possible spot:
→ 0% False Positives, 100% True Positives.
Diagonal line from (0, 0) to (1, 1) represents random guessing.
If your ROC curve is close to this diagonal, your model is not better than guessing.
The higher the ROC curve (closer to top-left), the better the model.

What is AUC?

AUC stands for Area Under the ROC Curve*.
It provides a single number to summarize the ROC curve’s performance.*

How to Interpret AUC:

AUC Score	Meaning
1.0	Perfect model
0.9 – 1.0	Excellent model
0.8 – 0.9	Good model
0.7 – 0.8	Fair model
0.5	Same as guessing
< 0.5	Worse than guessing

Higher AUC = Better model

Why ROC and AUC are Useful:

Helps compare models at all thresholds, not just at 0.5
Shows trade-off between sensitivity (TPR) and specificity (1 - FPR)
AUC gives a single score to compare models easily

Example:

Imagine predicting if a patient has a disease.
We adjust the threshold for saying “Yes, disease”.
ROC shows how False Positives and True Positives change.
AUC tells us how good our model is overall.

Machine Learning Fundamentals: Bias, Variation, ROC(Receiver Operating Characteristic) and AUC(Area Under the Curve)