Machine Learning Fundamentals: Bias, Variation, ROC(Receiver Operating Characteristic) and AUC(Area Under the Curve)

Why This Article?

This article explains the fundamentals of machine learning, covering everything an AI/ML engineer needs to know. It also serves as a place where I practice, learn, and recall concepts through blogging. The article focuses on how we understand the concepts we learn.

AIML Concepts

The fundamental concepts of machine learning are numerous, but I won’t explain all of them here. Instead, we'll look at some key concepts:

  1. Bias and Variance

  2. ROC (Receiver Operating Characteristics)

  3. AUC (Area Under Curve)

Bias and Variance

What is Bias?

Bias occurs when a model is too simplistic to represent the real problem accurately.
**Example:
If the true relationship is curved but we use a straight line to model it, the model will make errors. This is called high bias because the model lacks flexibility.
Imagine predicting a mouse’s height based on its weight. The actual relationship might be curved (since taller mice don’t grow taller at a fixed rate as their weight increases).
If you use a straight line for predictions, your model is too simple and overlooks the true curve.**High bias results in poor performance on both training and test data because the model doesn’t capture the real pattern.

Example of High Bias:

Applying linear regression (a straight line) to data requiring a curve.
The model makes incorrect predictions because it oversimplifies the problem.

Summary:

Bias occurs when a model is too simple to accurately represent the true relationship in the data.

What is Overfitting?

Overfitting is a direct result of high variance.
When a model performs extremely well on the training data but badly on new, unseen data, it’s said to be overfitting.

Overfitting Happens Because

The model is too complex — it’s not just learning the pattern but also the random noise, mistakes, and outliers in the training data.

Signs of Overfitting:

  • High accuracy on training data

  • Poor accuracy on testing data

  • Very wavy or complicated model curves

What is Underfitting?

Underfitting is a direct result of high bias.
When a model performs poorly on both training data and new, unseen data, it's said to be underfitting.

Signs of Underfitting:

  • Low accuracy on training data

  • Low accuracy on testing data

  • Model is too simple (straight lines, not enough flexibility)

  • Fails to capture important patterns

What is the Ideal Model?

The goal is to build a model with both low bias and low variance.

  • Low bias: Understands the real pattern.

  • Low variance: Gives stable, reliable predictions even on new data.

Such a model captures the true relationship without being distracted by noise.


Quick Comparison: Simple vs. Complex Models

Model TypeBiasVariancePerformance on New Data
Simple (Linear)HighLowUnderfits (Misses patterns)
Complex (Wavy)LowHighOverfits (Catches noise)
BalancedLowLowPerforms well consistently

ROC (Receiver Operating Characteristic) & AUC (Area Under the Curve)

What is ROC?

ROC Curve is a graph used to evaluate the performance of a classification model, especially in binary classification problems (Yes/No, 0/1, True/False).
It shows how well the model can separate the positive class from the negative class by changing the classification threshold.

How ROC Works:

The ROC curve is plotted between two important metrics:

AxisMeaningFormula
X-axisFalse Positive Rate (FPR)FPR = FP / (FP + TN)
Y-axisTrue Positive Rate (TPR)TPR = TP / (TP + FN)
  • TPR (Sensitivity) tells us:
    “Of all actual positives, how many did we correctly classify as positive?”

  • FPR tells us:
    “Of all actual negatives, how many did we wrongly classify as positives?”

How to Read the ROC Curve:

  • Top-left corner (0,1) is the best possible spot:
    → 0% False Positives, 100% True Positives.

  • Diagonal line from (0, 0) to (1, 1) represents random guessing.
    If your ROC curve is close to this diagonal, your model is not better than guessing.

  • The higher the ROC curve (closer to top-left), the better the model.

What is AUC?

AUC stands for Area Under the ROC Curve*.
It provides a single number to summarize the ROC curve’s performance.*

How to Interpret AUC:

AUC ScoreMeaning
1.0Perfect model
0.9 – 1.0Excellent model
0.8 – 0.9Good model
0.7 – 0.8Fair model
0.5Same as guessing
< 0.5Worse than guessing

Higher AUC = Better model

Why ROC and AUC are Useful:

  1. Helps compare models at all thresholds, not just at 0.5

  2. Shows trade-off between sensitivity (TPR) and specificity (1 - FPR)

  3. AUC gives a single score to compare models easily

Example:

Imagine predicting if a patient has a disease.
We adjust the threshold for saying “Yes, disease”.
ROC shows how False Positives and True Positives change.
AUC tells us how good our model is overall.

0
Subscribe to my newsletter

Read articles from Sam Anirudh Malarvannan directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Sam Anirudh Malarvannan
Sam Anirudh Malarvannan