📊 10 Metrics You’re Not Logging — But Absolutely Should

Bikram SarkarBikram Sarkar
4 min read

Gradient Descent Weekly — Issue #21

You’re logging loss, accuracy, maybe F1.
Cool.
But the stuff that breaks your model in prod?
That’s hidden in the metrics you forgot to track.

In this issue, we’re diving into 10 often-overlooked metrics that high-performing ML teams log religiously — because they want fewer incidents, faster iteration, and more trustworthy models.

Let’s expose the blind spots.

⚙️ Why Metrics Matter More Than You Think

Metrics aren’t just for dashboards. They’re your:

  • Early warning system

  • Debugging trail

  • Retraining signal

  • Business alignment translator

  • Justification for promotions and funding

But most teams only log model-centric metrics. The elite teams?
They log systemic, behavioral, and longitudinal metrics too.

🧩 The 10 Metrics You're Not Logging (But Should Be)

1. Data Coverage per Segment

📉 “Our model performs great… on users aged 30–45 from Tier 1 cities.”

✅ What to log:

  • % of test/train data per key segment (age, region, language, device, etc.)

  • Distribution drift over time

  • Class balance across groups

🎯 Why: Helps uncover model bias, lack of generalization, and potential fairness issues.

2. Confidence Calibration (Not Just Accuracy)

🤖 “Your model is 99% confident — and dead wrong.”

✅ What to log:

  • Reliability diagrams

  • Expected Calibration Error (ECE)

  • Overconfidence under error

🎯 Why: Overconfident wrong predictions in high-stakes use cases (healthcare, finance, fraud) are silent killers.

3. Input Feature Null/NaN Rate

🚨 “Production data has 35% missing features. Your training data had 0%.”

✅ What to log:

  • Per-feature null rate (daily/weekly)

  • Time-series chart of NaN evolution

  • Impact of missingness on prediction quality

🎯 Why: Missing data = hidden drift. Catch it before your model quietly derails.

4. Inference Latency (P50, P95, P99)

🐢 “Model is accurate, but inference takes 8 seconds.”

✅ What to log:

  • Median and tail latency by endpoint/model version

  • Cold start vs warm latency

  • Per-hardware profile latency

🎯 Why: Speed is UX. Track it like uptime.

5. Feature Importance Over Time

🔄 “Why is this model suddenly obsessed with account_created_month?”

✅ What to log:

  • Top N feature importance per training iteration

  • Shifts in top contributors

  • SHAP value distributions by segment

🎯 Why: Detects model logic drift or bugs from changed feature distributions.

6. Drift Score (Daily/Weekly)

🌊 “It’s not a data tsunami. It’s slow erosion.”

✅ What to log:

  • Population Stability Index (PSI)

  • JS Divergence / KL Divergence / KS Test

  • Drift score per feature

🎯 Why: Tracks how different production input is from your training baseline. Early signal for retraining.

7. Model Version Adoption Rate

🚧 “We deployed the new model, but 80% of traffic is still hitting the old one.”

✅ What to log:

  • % of traffic per model version

  • Error rate delta across versions

  • Rollout progression chart

🎯 Why: Model upgrades are only real if traffic flows to them.

8. Feedback Loop Latency

⏱ “How long does it take to get new labels after a prediction?”

✅ What to log:

  • Time between prediction → ground truth feedback

  • Label pipeline SLA

  • Retrain lag after data is labeled

🎯 Why: Determines how stale your model becomes over time. Critical in fraud, recommendation, and real-time systems.

9. Prediction Volatility

🌀 “Same input. Same model. Different outputs.”

✅ What to log:

  • Prediction deltas for repeated inputs

  • Model determinism checks

  • Stochasticity in ensemble/LLM output

🎯 Why: Useful for debugging flakiness, random seeds, and unreliable behavior.

10. Business Outcome Proxy

📈 “Did better predictions actually help revenue, engagement, conversion?”

✅ What to log:

  • Post-prediction user behavior (clicked, converted, churned?)

  • Uplift vs control group

  • Downstream KPI improvement

🎯 Why: This is the real metric that leadership cares about. And it helps prioritize what to improve.

🧠 Bonus: Metrics You Can Skip (Until Needed)

MetricDelay if...
FLOPs or param countUnless you're optimizing for mobile
Fairness metricsUnless you're in regulated domains
Gradients, activationsUnless debugging deep learning internals
LLM token countsUnless optimizing cost or latency

Don’t log everything. Log what matters to performance, trust, and cost.

🔚 Final Thoughts: Metrics = Muscle

You can’t improve what you don’t measure.
But worse — you can’t trust what you don’t track.

The difference between a hobby ML project and a production-grade system?
About 10 metrics that tell you what’s happening when no one is looking.

Log smarter. Not more. But better.

🔮 Up Next on Gradient Descent Weekly:

  • Feature Stores: Do You Actually Need One?
0
Subscribe to my newsletter

Read articles from Bikram Sarkar directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Bikram Sarkar
Bikram Sarkar

Forward-thinking IT Operations Leader with cross-domain expertise spanning incident & change management, cloud infrastructure (Azure, AWS, GCP), and automation engineering. Proven track record in building and leading high-performance operations teams that drive reliability, innovation, and uptime across mission-critical enterprise systems. Adept at aligning IT services with business goals through strategic leadership, cloud-native transformation, and process modernization. Currently spearheading application operations and monitoring for digital modernization initiatives. Deeply passionate about coding in Rust, Go, and Python, and solving real-world problems through machine learning, model inference, and Generative AI. Actively exploring the intersection of AI engineering and infrastructure automation to future-proof operational ecosystems and unlock new business value.