📏 Model Evaluation Metrics in ML: From Accuracy to AUC


“A good model isn't just accurate — it's understood.”
— Tilak Savani
🧠 Introduction
After training a model, we can’t just stop at accuracy — especially when dealing with imbalanced datasets or real-world problems like spam detection, medical diagnosis, or credit scoring.
This blog will help you understand, calculate, and apply the most important model evaluation metrics.
🧪 Confusion Matrix
A confusion matrix gives a full picture of model performance for classification tasks.
Predicted Positive | Predicted Negative | |
Actual Positive | True Positive (TP) | False Negative (FN) |
Actual Negative | False Positive (FP) | True Negative (TN) |
🎯 Classification Metrics
✅ Accuracy
Accuracy = (TP + TN) / (TP + FP + TN + FN)
Good when classes are balanced.
🎯 Precision
Precision = TP / (TP + FP)
How many predicted positives were actually correct?
High precision = fewer false positives.
🧲 Recall (Sensitivity)
Recall = TP / (TP + FN)
How many actual positives did we catch?
High recall = fewer false negatives.
🔁 F1-Score
F1 = 2 * (Precision * Recall) / (Precision + Recall)
Balances precision and recall. Great for imbalanced datasets.
📉 ROC Curve & AUC
ROC Curve plots True Positive Rate vs False Positive Rate.
AUC (Area Under the Curve) tells how well the model separates the classes.
AUC = 1: perfect
AUC = 0.5: random guessing
📊 Regression Metrics
🔸 MAE (Mean Absolute Error)
MAE = (1/n) * Σ |yᵢ - ŷᵢ|
Average absolute difference between predicted and actual values.
🔸 MSE (Mean Squared Error)
MSE = (1/n) * Σ (yᵢ - ŷᵢ)²
Squares errors — penalizes large errors more.
🔸 RMSE (Root Mean Squared Error)
RMSE = √MSE
Same units as the output variable. Most common.
🔸 R² Score (Coefficient of Determination)
R² = 1 - (Σ(yᵢ - ŷᵢ)² / Σ(yᵢ - ȳ)²)
Measures how much of the variance in the target is explained by the model.
Ranges from −∞
to 1
. Closer to 1 = better fit.
🧪 Python Code Example
To bring these concepts to life, let’s implement everything using real-world-style examples in Python with scikit-learn — the go-to ML library used by professionals.
We’ll evaluate both:
A classification task (e.g., spam detection)
A regression task (e.g., predicting house prices)
from sklearn.metrics import (
accuracy_score, precision_score, recall_score, f1_score,
confusion_matrix, roc_auc_score,
mean_absolute_error, mean_squared_error, r2_score
)
# Classification example
y_true_cls = [1, 0, 1, 1, 0]
y_pred_cls = [1, 0, 1, 0, 0]
print("Accuracy:", accuracy_score(y_true_cls, y_pred_cls))
print("Precision:", precision_score(y_true_cls, y_pred_cls))
print("Recall:", recall_score(y_true_cls, y_pred_cls))
print("F1 Score:", f1_score(y_true_cls, y_pred_cls))
print("Confusion Matrix:\n", confusion_matrix(y_true_cls, y_pred_cls))
# Regression example
y_true_reg = [3.5, 2.0, 4.5, 5.0]
y_pred_reg = [3.7, 2.1, 4.0, 5.3]
print("MAE:", mean_absolute_error(y_true_reg, y_pred_reg))
print("MSE:", mean_squared_error(y_true_reg, y_pred_reg))
print("RMSE:", mean_squared_error(y_true_reg, y_pred_reg, squared=False))
print("R² Score:", r2_score(y_true_reg, y_pred_reg))
✅ When to Use What
Task Type | Use These Metrics |
Classification (Balanced) | Accuracy, F1-Score |
Classification (Imbalanced) | Precision, Recall, ROC-AUC |
Regression | RMSE, MAE, R² Score |
🧩 Final Thoughts
Don’t stop at accuracy. Knowing how to evaluate your model using the right metric for the right problem is what makes you a smart ML engineer.
“A model’s worth lies in how it's measured.”
— Tilak Savani
📬 Subscribe
If this helped you understand ML evaluation better, follow me on Hashnode for more blogs that combine math + code + practical wisdom.
Thanks for reading! 🙌
Subscribe to my newsletter
Read articles from Tilak Savani directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
