📊 10 Metrics You’re Not Logging — But Absolutely Should


Gradient Descent Weekly — Issue #21
You’re logging loss, accuracy, maybe F1.
Cool.
But the stuff that breaks your model in prod?
That’s hidden in the metrics you forgot to track.
In this issue, we’re diving into 10 often-overlooked metrics that high-performing ML teams log religiously — because they want fewer incidents, faster iteration, and more trustworthy models.
Let’s expose the blind spots.
⚙️ Why Metrics Matter More Than You Think
Metrics aren’t just for dashboards. They’re your:
Early warning system
Debugging trail
Retraining signal
Business alignment translator
Justification for promotions and funding
But most teams only log model-centric metrics. The elite teams?
They log systemic, behavioral, and longitudinal metrics too.
🧩 The 10 Metrics You're Not Logging (But Should Be)
1. Data Coverage per Segment
📉 “Our model performs great… on users aged 30–45 from Tier 1 cities.”
✅ What to log:
% of test/train data per key segment (age, region, language, device, etc.)
Distribution drift over time
Class balance across groups
🎯 Why: Helps uncover model bias, lack of generalization, and potential fairness issues.
2. Confidence Calibration (Not Just Accuracy)
🤖 “Your model is 99% confident — and dead wrong.”
✅ What to log:
Reliability diagrams
Expected Calibration Error (ECE)
Overconfidence under error
🎯 Why: Overconfident wrong predictions in high-stakes use cases (healthcare, finance, fraud) are silent killers.
3. Input Feature Null/NaN Rate
🚨 “Production data has 35% missing features. Your training data had 0%.”
✅ What to log:
Per-feature null rate (daily/weekly)
Time-series chart of NaN evolution
Impact of missingness on prediction quality
🎯 Why: Missing data = hidden drift. Catch it before your model quietly derails.
4. Inference Latency (P50, P95, P99)
🐢 “Model is accurate, but inference takes 8 seconds.”
✅ What to log:
Median and tail latency by endpoint/model version
Cold start vs warm latency
Per-hardware profile latency
🎯 Why: Speed is UX. Track it like uptime.
5. Feature Importance Over Time
🔄 “Why is this model suddenly obsessed with
account_created_month
?”
✅ What to log:
Top N feature importance per training iteration
Shifts in top contributors
SHAP value distributions by segment
🎯 Why: Detects model logic drift or bugs from changed feature distributions.
6. Drift Score (Daily/Weekly)
🌊 “It’s not a data tsunami. It’s slow erosion.”
✅ What to log:
Population Stability Index (PSI)
JS Divergence / KL Divergence / KS Test
Drift score per feature
🎯 Why: Tracks how different production input is from your training baseline. Early signal for retraining.
7. Model Version Adoption Rate
🚧 “We deployed the new model, but 80% of traffic is still hitting the old one.”
✅ What to log:
% of traffic per model version
Error rate delta across versions
Rollout progression chart
🎯 Why: Model upgrades are only real if traffic flows to them.
8. Feedback Loop Latency
⏱ “How long does it take to get new labels after a prediction?”
✅ What to log:
Time between prediction → ground truth feedback
Label pipeline SLA
Retrain lag after data is labeled
🎯 Why: Determines how stale your model becomes over time. Critical in fraud, recommendation, and real-time systems.
9. Prediction Volatility
🌀 “Same input. Same model. Different outputs.”
✅ What to log:
Prediction deltas for repeated inputs
Model determinism checks
Stochasticity in ensemble/LLM output
🎯 Why: Useful for debugging flakiness, random seeds, and unreliable behavior.
10. Business Outcome Proxy
📈 “Did better predictions actually help revenue, engagement, conversion?”
✅ What to log:
Post-prediction user behavior (clicked, converted, churned?)
Uplift vs control group
Downstream KPI improvement
🎯 Why: This is the real metric that leadership cares about. And it helps prioritize what to improve.
🧠 Bonus: Metrics You Can Skip (Until Needed)
Metric | Delay if... |
FLOPs or param count | Unless you're optimizing for mobile |
Fairness metrics | Unless you're in regulated domains |
Gradients, activations | Unless debugging deep learning internals |
LLM token counts | Unless optimizing cost or latency |
Don’t log everything. Log what matters to performance, trust, and cost.
🔚 Final Thoughts: Metrics = Muscle
You can’t improve what you don’t measure.
But worse — you can’t trust what you don’t track.
The difference between a hobby ML project and a production-grade system?
About 10 metrics that tell you what’s happening when no one is looking.
Log smarter. Not more. But better.
🔮 Up Next on Gradient Descent Weekly:
- Feature Stores: Do You Actually Need One?
Subscribe to my newsletter
Read articles from Bikram Sarkar directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Bikram Sarkar
Bikram Sarkar
Forward-thinking IT Operations Leader with cross-domain expertise spanning incident & change management, cloud infrastructure (Azure, AWS, GCP), and automation engineering. Proven track record in building and leading high-performance operations teams that drive reliability, innovation, and uptime across mission-critical enterprise systems. Adept at aligning IT services with business goals through strategic leadership, cloud-native transformation, and process modernization. Currently spearheading application operations and monitoring for digital modernization initiatives. Deeply passionate about coding in Rust, Go, and Python, and solving real-world problems through machine learning, model inference, and Generative AI. Actively exploring the intersection of AI engineering and infrastructure automation to future-proof operational ecosystems and unlock new business value.