3 reasons traces are better than metrics for debugging

Introduction

Today I'll talk about the superiority of traces for understanding your system state and investigating problems. Traces excel at debugging your application due to several inherent properties. The granularity of traces provides a distinctive advantage. The rich cross-system context is critical to understand issues in a world dominated by micro-services. Also, cardinality limitations hinder metrics investigative power. Metrics are great for many other things, but they can't compete with traces for investigating problems.

Requests (traces) are more valuable than summarized artifacts (metrics)

When debugging an application issue, or investigating system state, a request (trace) a more useful granule than metrics. Ideally, you want to take a request and examine it carefully. You want to understand what came in, what went out, what system interactions took place, identify component failures, etc... Once you've identified interesting patterns, you want to zoom out to understand how widely a pattern applies. Perhaps you got a high latency alert. You dig and realize, calls to the Identity service are delayed. Then you can zoom out and see if all calls are plagued by elevated latencies to said service. Traces provide the perfect resolution for theory construction, and hypothesis testing.

Metrics lack much of the fidelity and granularity traces provide. They compress several requests, into a few buckets, destroying the fidelity, leaving us with coarse information. You can get an alert about high latency, and see which geographic zone, host or a few other parameters, but that's as far as the metric will get you. Without traces it's anybody's best guess, or up to the team hero to magically interpret the graph, abra kababra'ing the root cause. As a retiring magician, I fondly recall the days predicting root causes for one of our services. I recall we had one particularly unreliable vendor. If the latency for certain endpoints went off, I didn't need to investigate further. I was pretty sure it was them! Newbies couldn't be expected to make the same derivations from the graphs. The metrics alone were insufficient. The same logic applies to custom metrics too! We want to make debugging accessible to everyone. Better to have more context available for debugging - traces better provide that granularity, metrics don't.

Traces provide rich cross-system context

Understanding the full picture and role of all the services in processing a request is critical for debugging. Micro-services are a staple of cloud architectures today. Multiple services coordinate to satisfy a user's request. While it has great advantages, micro-services also introduce multiple points of failure. Maybe a service I depend on is broken. Maybe our service is broken. Possibly it's a cloud dependency that's degraded. Or all of those things! A trace captures the full picture of the entire transaction and its fulfillment in our system.

With great tools, we can interrogate the entire web of dependencies to determine the root cause. How many callers are impacted by a failing dependency on our side? Is a consumer causing the high latency calls? What's the latency of requests with calls to our Payments micro-service? What's the user impact -- let's group the error by user ID? Is there a problematic service upstream? Is our public API degraded? There are so many questions we can ask! A trace organically assembles the context required to get answers.

Metrics do provide some cross-system context but it's not the same. Many APM providers provide standard service metrics (e.g. DataDog APM provides latency, errors, throughput, etc...). Cloud components also contain a trove of metrics (e.g. cache, database, queues, etc...). Outside of these well established metrics, teams will add custom metrics to their services. Team metrics must first be discovered, before they can be correlated against other metrics. Even then, differences in tags means we aren't comparing apples to apples. With cloud components, there are no guarantees all teams use the same components (e.g. different databases, lambda vs EC2, Python vs Ruby, etc...). With metrics, we need to carefully and surgically construct a system picture with arbitrary puzzle pieces. Traces, by virtue of their structure and the spans provide a detailed picture for use in investigations. Even when there are differences, the visual structure provides more value than metrics do.

Metrics place hard limits on what you can investigate

When pursuing a root cause, everything's a suspect until proven innocent. We might want to verify whether a traffic surge originates from one or multiple users. We've had our API attacked previously by a single user. We may want to determine if the elevated invoice generation latency is specific to a particular invoice, customer, business, or any another domain entity. I've experienced login failures only impacting users who logged in during a failed framework upgrade we rolled back. There are no guarantees in the wild west called production. We must leave no stone unturned to discover root causes.

Metrics limit our investigations, in that they prohibit certain kinds of questions. Specifically, questions containing high cardinality attributes. High cardinality attributes have many possible values. I mentioned several high cardinality attributes earlier. Many identifiers are high cardinality. A user ID for a user hitting our API. An invoice ID, a business ID, or a customer ID. There are others too -- like a timestamp. I mentioned the timestamp of users logging in. Users unlucky enough to login during a failed upgrade had a broken experience. All of these attributes are potential keys to solving an application mystery. Without support for high cardinality attributes, we can't ask targeted, granular questions.

Traces don't prohibit questions based on cardinality. I can slice and dice data as I'd love to, in support of whatever hypothesis I'm concocting. Maybe I'll look group requests by the user ID to understand the customer impact. Or perhaps I'll break down subscription purchase latency by the subscription to see if all purchases are equally impacted. Or I'll break down login by user ID because I want to see if any user is experiencing pronounced degradation. Sure... with metrics I can ask some questions of coarse tags, but do I really want to be limited in what I can ask?

Conclusion

Closing out this one... traces are superior for understanding system states and investigating issues within modern, complex environments. Their granularity, rich cross-system context, and freedom from cardinality limitations make them invaluable for debugging and root cause analysis. Metrics do have their place but they fall short when it comes to the depth of insight required for thorough investigations. Traces provide a comprehensive and detailed view that empowers teams to ask the right questions, discover hidden patterns, and resolve issues more effectively.

3 reasons traces are better than metrics for debugging your application

Table of contents