Gradient Descent Weekly — Issue #21

You’re logging loss, accuracy, maybe F1.
Cool.
But the stuff that breaks your model in prod?
That’s hidden in the metrics you forgot to track.

In this issue, we’re diving into 10 often-overlooked metrics that high-performing ML teams log religiously — because they want fewer incidents, faster iteration, and more trustworthy models.

Let’s expose the blind spots.

⚙️ Why Metrics Matter More Than You Think

Metrics aren’t just for dashboards. They’re your:

Early warning system
Debugging trail
Retraining signal
Business alignment translator
Justification for promotions and funding

But most teams only log model-centric metrics. The elite teams?
They log systemic, behavioral, and longitudinal metrics too.

🧩 The 10 Metrics You're Not Logging (But Should Be)

1. Data Coverage per Segment

📉 “Our model performs great… on users aged 30–45 from Tier 1 cities.”

✅ What to log:

% of test/train data per key segment (age, region, language, device, etc.)
Distribution drift over time
Class balance across groups

🎯 Why: Helps uncover model bias, lack of generalization, and potential fairness issues.

2. Confidence Calibration (Not Just Accuracy)

🤖 “Your model is 99% confident — and dead wrong.”

✅ What to log:

Reliability diagrams
Expected Calibration Error (ECE)
Overconfidence under error

🎯 Why: Overconfident wrong predictions in high-stakes use cases (healthcare, finance, fraud) are silent killers.

3. Input Feature Null/NaN Rate

🚨 “Production data has 35% missing features. Your training data had 0%.”

✅ What to log:

Per-feature null rate (daily/weekly)
Time-series chart of NaN evolution
Impact of missingness on prediction quality

🎯 Why: Missing data = hidden drift. Catch it before your model quietly derails.

4. Inference Latency (P50, P95, P99)

🐢 “Model is accurate, but inference takes 8 seconds.”

✅ What to log:

Median and tail latency by endpoint/model version
Cold start vs warm latency
Per-hardware profile latency

🎯 Why: Speed is UX. Track it like uptime.

5. Feature Importance Over Time

🔄 “Why is this model suddenly obsessed with account_created_month?”

✅ What to log:

Top N feature importance per training iteration
Shifts in top contributors
SHAP value distributions by segment

🎯 Why: Detects model logic drift or bugs from changed feature distributions.

6. Drift Score (Daily/Weekly)

🌊 “It’s not a data tsunami. It’s slow erosion.”

✅ What to log:

Population Stability Index (PSI)
JS Divergence / KL Divergence / KS Test
Drift score per feature

🎯 Why: Tracks how different production input is from your training baseline. Early signal for retraining.

7. Model Version Adoption Rate

🚧 “We deployed the new model, but 80% of traffic is still hitting the old one.”

✅ What to log:

% of traffic per model version
Error rate delta across versions
Rollout progression chart

🎯 Why: Model upgrades are only real if traffic flows to them.

8. Feedback Loop Latency

⏱ “How long does it take to get new labels after a prediction?”

✅ What to log:

Time between prediction → ground truth feedback
Label pipeline SLA
Retrain lag after data is labeled

🎯 Why: Determines how stale your model becomes over time. Critical in fraud, recommendation, and real-time systems.

9. Prediction Volatility

🌀 “Same input. Same model. Different outputs.”

✅ What to log:

Prediction deltas for repeated inputs
Model determinism checks
Stochasticity in ensemble/LLM output

🎯 Why: Useful for debugging flakiness, random seeds, and unreliable behavior.

10. Business Outcome Proxy

📈 “Did better predictions actually help revenue, engagement, conversion?”

✅ What to log:

Post-prediction user behavior (clicked, converted, churned?)
Uplift vs control group
Downstream KPI improvement

🎯 Why: This is the real metric that leadership cares about. And it helps prioritize what to improve.

🧠 Bonus: Metrics You Can Skip (Until Needed)

Metric	Delay if...
FLOPs or param count	Unless you're optimizing for mobile
Fairness metrics	Unless you're in regulated domains
Gradients, activations	Unless debugging deep learning internals
LLM token counts	Unless optimizing cost or latency

Don’t log everything. Log what matters to performance, trust, and cost.

🔚 Final Thoughts: Metrics = Muscle

You can’t improve what you don’t measure.
But worse — you can’t trust what you don’t track.

The difference between a hobby ML project and a production-grade system?
About 10 metrics that tell you what’s happening when no one is looking.

Log smarter. Not more. But better.

🔮 Up Next on Gradient Descent Weekly:

Feature Stores: Do You Actually Need One?

📊 10 Metrics You’re Not Logging — But Absolutely Should

⚙️ Why Metrics Matter More Than You Think

🧩 The 10 Metrics You're Not Logging (But Should Be)

1. Data Coverage per Segment

2. Confidence Calibration (Not Just Accuracy)

3. Input Feature Null/NaN Rate

4. Inference Latency (P50, P95, P99)

5. Feature Importance Over Time

6. Drift Score (Daily/Weekly)

7. Model Version Adoption Rate

8. Feedback Loop Latency

9. Prediction Volatility

10. Business Outcome Proxy

🧠 Bonus: Metrics You Can Skip (Until Needed)

🔚 Final Thoughts: Metrics = Muscle

🔮 Up Next on Gradient Descent Weekly:

Subscribe to my newsletter

Bikram Sarkar

Bikram Sarkar