📏 Model Evaluation Metrics in ML: From Accuracy to AUC

Tilak SavaniTilak Savani
3 min read

“A good model isn't just accurate — it's understood.”
Tilak Savani



🧠 Introduction

After training a model, we can’t just stop at accuracy — especially when dealing with imbalanced datasets or real-world problems like spam detection, medical diagnosis, or credit scoring.

This blog will help you understand, calculate, and apply the most important model evaluation metrics.


🧪 Confusion Matrix

A confusion matrix gives a full picture of model performance for classification tasks.

Predicted PositivePredicted Negative
Actual PositiveTrue Positive (TP)False Negative (FN)
Actual NegativeFalse Positive (FP)True Negative (TN)

🎯 Classification Metrics

✅ Accuracy

    Accuracy = (TP + TN) / (TP + FP + TN + FN)

Good when classes are balanced.

🎯 Precision

    Precision = TP / (TP + FP)

How many predicted positives were actually correct?
High precision = fewer false positives.

🧲 Recall (Sensitivity)

    Recall = TP / (TP + FN)

How many actual positives did we catch?
High recall = fewer false negatives.

🔁 F1-Score

    F1 = 2 * (Precision * Recall) / (Precision + Recall)

Balances precision and recall. Great for imbalanced datasets.


📉 ROC Curve & AUC

ROC Curve plots True Positive Rate vs False Positive Rate.
AUC (Area Under the Curve) tells how well the model separates the classes.

  • AUC = 1: perfect

  • AUC = 0.5: random guessing


📊 Regression Metrics

🔸 MAE (Mean Absolute Error)

    MAE = (1/n) * Σ |yᵢ - ŷᵢ|

Average absolute difference between predicted and actual values.

🔸 MSE (Mean Squared Error)

    MSE = (1/n) * Σ (yᵢ - ŷᵢ)²

Squares errors — penalizes large errors more.

🔸 RMSE (Root Mean Squared Error)

    RMSE = √MSE

Same units as the output variable. Most common.

🔸 R² Score (Coefficient of Determination)

    R² = 1 - (Σ(yᵢ - ŷᵢ)² / Σ(yᵢ - ȳ)²)

Measures how much of the variance in the target is explained by the model.
Ranges from −∞ to 1. Closer to 1 = better fit.


🧪 Python Code Example

To bring these concepts to life, let’s implement everything using real-world-style examples in Python with scikit-learn — the go-to ML library used by professionals.

We’ll evaluate both:

  • A classification task (e.g., spam detection)

  • A regression task (e.g., predicting house prices)

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, roc_auc_score,
    mean_absolute_error, mean_squared_error, r2_score
)

# Classification example
y_true_cls = [1, 0, 1, 1, 0]
y_pred_cls = [1, 0, 1, 0, 0]

print("Accuracy:", accuracy_score(y_true_cls, y_pred_cls))
print("Precision:", precision_score(y_true_cls, y_pred_cls))
print("Recall:", recall_score(y_true_cls, y_pred_cls))
print("F1 Score:", f1_score(y_true_cls, y_pred_cls))
print("Confusion Matrix:\n", confusion_matrix(y_true_cls, y_pred_cls))

# Regression example
y_true_reg = [3.5, 2.0, 4.5, 5.0]
y_pred_reg = [3.7, 2.1, 4.0, 5.3]

print("MAE:", mean_absolute_error(y_true_reg, y_pred_reg))
print("MSE:", mean_squared_error(y_true_reg, y_pred_reg))
print("RMSE:", mean_squared_error(y_true_reg, y_pred_reg, squared=False))
print("R² Score:", r2_score(y_true_reg, y_pred_reg))

✅ When to Use What

Task TypeUse These Metrics
Classification (Balanced)Accuracy, F1-Score
Classification (Imbalanced)Precision, Recall, ROC-AUC
RegressionRMSE, MAE, R² Score

🧩 Final Thoughts

Don’t stop at accuracy. Knowing how to evaluate your model using the right metric for the right problem is what makes you a smart ML engineer.

“A model’s worth lies in how it's measured.”
Tilak Savani


📬 Subscribe

If this helped you understand ML evaluation better, follow me on Hashnode for more blogs that combine math + code + practical wisdom.

Thanks for reading! 🙌

0
Subscribe to my newsletter

Read articles from Tilak Savani directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Tilak Savani
Tilak Savani