Concrete Applications of Purposeful Instrumentation


In our Purposeful Instrumentation blog post, we laid the groundwork for a more disciplined approach to observability. We argued that the goal isn't to merely amass data, but to cultivate high-quality telemetry signals – focusing on quality over quantity. The aim is to transform our experience during high-pressure incidents from frantically searching through a "dense thicket of irrelevant data" to confidently navigating a "well-lit path to the root cause."
Many of us have experienced the pitfalls of the "instrument everything" mantra. While well-intentioned, it often leads to an "overgrown jungle of telemetry data," where critical signals are drowned out by noise. Purposeful instrumentation, in contrast, encourages us to strategically gather the right data. This isn't just about digital decluttering; it yields tangible benefits: reduced noise, faster troubleshooting, and improved clarity and maintainability in our systems.
This post moves from philosophy to practice. We'll dive into concrete examples and techniques, showcasing how to apply purposeful instrumentation in real-world scenarios—from initial telemetry design to ongoing pipeline adjustments and even code-level optimizations.
Designing Telemetry with NASA's Rigor
When we think about systems operating under the most severe limitations, spacecraft telemetry, particularly from missions like NASA's Mars rovers, offers profound inspiration. The extreme constraints of space exploration—limited bandwidth, power, and processing capabilities—force engineers to meticulously justify and optimize every single bit of data transmitted. For observability engineers on Earth, even without such stark limitations, these practices offer invaluable lessons in cultivating efficiency.
Here are some key takeaways:
Data Type Optimization: Spacecraft systems often convert 64-bit floating-point numbers to 32-bit or even 16-bit integers. Sometimes, scaled integers (like centi-degrees Celsius) are used to preserve essential precision while drastically reducing data volume. For our enterprise systems, this prompts a critical question: Do we really need microsecond precision for every timer, or would seconds suffice for certain metrics, thereby reducing storage and processing overhead?
Bit Packing and Enumerated Types: To save space, boolean flags and enumerated values with a limited set of states are often packed into smaller integer types on spacecraft. For example, 15 distinct safety checks might be encoded into a single 16-bit integer. This principle is directly applicable to software telemetry, particularly in how we design attributes to reduce cardinality and data volume. Instead of verbose string representations for statuses, can an enumerated integer suffice?
Configurable Data Collection: Spacecraft aren't static in their data collection. They possess "knobs" that allow operators to increase data verbosity for anomaly investigations, switching between "Brief records" for nominal operations and "Verbose records" when digging deeper. This mirrors the need in our systems for dynamic control over telemetry, perhaps adjusting log levels or sampling rates based on operational context rather than maintaining a constant, high-volume stream.
Summary Data and Compression: Reporting small, high-level summary data packets independently from detailed diagnostic data products allows for quick operational decision-making. If summaries are nominal, large, detailed data products might even be discarded to save precious bandwidth. Lossless compression is also a standard practice, always balancing the CPU cost of compression/decompression against bandwidth savings.
The "Very Small Products" Problem: Interestingly, generating a multitude of tiny data products can be inefficient, consuming storage slots and impacting system performance, as was observed with the Mars 2020 rover's packetizer. This highlights the importance of batching and aggregation not just for network efficiency but also for processing and storage optimization within our telemetry pipelines. The OpenTelemetry Collector’s batch processor is a prime example of applying this principle.
These extreme examples from NASA underscore a fundamental discipline: diligently asking, "What data do I really need?" and "What is the cost versus the value?" This scrutiny is crucial for building sustainable and effective telemetry strategies, ensuring we're not just collecting data, but harvesting actionable insights.
Tuning Automatic Instrumentation for Precision with OpenTelemetry
OpenTelemetry's auto-instrumentation agents are a massive boon, offering broad telemetry coverage for popular libraries and frameworks with minimal upfront effort. It’s tempting to see this as "zero code, zero thought." However, this convenience doesn't absolve us from the need for purposeful configuration. Blindly enabling instrumentation for every conceivable library can quickly lead back to that "overgrown jungle of telemetry data," swamping your systems with noise and incurring unnecessary costs.
Review Default Configurations: Auto-instrumentation defaults are often tuned for maximum coverage, which might not align with your specific observability goals or the critical paths of your application. As Elena Kovalenko of Delivery Hero noted, unconfigured auto-instrumentation can generate extremely high cardinality and massive data volumes, potentially overloading collectors and backend systems. It’s vital to treat the default settings as a starting point, not a final destination.
Selectively Disable Unnecessary Instrumentation: Most OpenTelemetry auto-instrumentation agents allow for fine-grained control, enabling you to disable instrumentation for components that are irrelevant to your critical diagnostic paths or those known to produce excessive, low-value data.
- Concrete Example: Suppressing JDBC Telemetry: If your primary diagnostic focus is at the service interaction level, the verbose telemetry generated by JDBC instrumentation (tracing every database call) might be more noise than signal. With the OpenTelemetry Java agent, for instance, you can easily disable this by setting the environment variable
OTEL_INSTRUMENTATION_JDBC_ENABLED=false
. This targeted pruning ensures that resources aren't wasted collecting, processing, and storing data that doesn't contribute significantly to your understanding of system health.
- Concrete Example: Suppressing JDBC Telemetry: If your primary diagnostic focus is at the service interaction level, the verbose telemetry generated by JDBC instrumentation (tracing every database call) might be more noise than signal. With the OpenTelemetry Java agent, for instance, you can easily disable this by setting the environment variable
Auto-instrumentation plants the seeds of visibility; purposeful configuration helps you cultivate the desired crop, ensuring a healthy yield of actionable insights rather than a field of weeds.
Optimizing Data Flow with the OpenTelemetry Collector: Pipeline Adjustments
The OpenTelemetry Collector is more than just a telemetry forwarder; it's a powerful, vendor-agnostic control plane. It’s a great place to implement purposeful telemetry strategies by filtering, sampling, enriching, and transforming data before it even reaches your observability backends. Let's look at how sophisticated organizations are leveraging the Collector.
eBay's Journey: Scaling Distributed Tracing with Cost Optimization
Handling telemetry at eBay's scale—ingesting 6.5 million spans per second—necessitates highly judicious instrumentation and aggressive optimization. They faced challenges with broken call chains due to context propagation issues and the difficulty of applying uniform sampling across APIs with vastly different traffic volumes.
Their approach to sampling evolved:
Initial Strategy: They started with head sampling at the client (e.g., 2% of requests) combined with parent-based sampling to ensure entire traces were captured if any part was sampled.
Adding Tail Sampling: After that, they employed a tail-sampling strategy to retain "interesting" traces—those with errors, high latency, or specific critical attributes—along with a baseline 1% of successful traces, storing these for 14 days. This allowed them to focus retention on the most valuable diagnostic data.
Evolving Tail Sampling with OTel Collector: Recognizing the significant memory and complexity challenges of performing in-memory tail sampling within the OpenTelemetry Collector for long-duration traces or requests spanning multiple clusters, eBay pivoted. They now leverage exemplars from metrics to identify traces of interest. These traces are then copied from a raw trace table to a sampled table after a 10-15 minute delay. This innovative, storage-based tail sampling approach demonstrates a mature balance between comprehensive diagnostic capability and cost control.
TomTom's Centralized Control: Enforcing Governance and Flexibility
TomTom implemented a centralized OpenTelemetry Collector Service that acts as a gateway between their internal applications and various SaaS observability platforms. This central hub provides several advantages:
Governance and Standardization: It allows them to enforce authentication, manage general configurations like batching and encryption consistently, and, crucially, handle data enrichment and manipulation centrally.
Filtering and PII Redaction: They use the filterprocessor to drop noisy or irrelevant logs (e.g., from specific Kubernetes namespaces). For sensitive data, a combination of the transformprocessor and attributesprocessor is used to redact Personally Identifiable Information (PII) before telemetry leaves their trust boundary.
Telemetry Enrichment: Data is enriched with valuable metadata, such as an "owner" label, which provides better context during troubleshooting and improves accountability.
Strategic Benefits: This centralized model offers flexibility in switching telemetry backends, enforces data governance policies, and has proven critical for cost control and maintaining data quality at an enterprise scale.
These real-world examples illustrate the power of the OpenTelemetry Collector as a central point for cultivating telemetry quality.
Crafting Intentional Manual Instrumentation: OllyGarden’s Example
While auto-instrumentation provides breadth, manual instrumentation offers depth and precision. But even here, more isn't always better. A common pitfall is "over-spanning": creating an excessive number of highly granular spans for minor, sequential internal operations within a single logical unit of work. This can obscure the true flow of a request, add unnecessary overhead, and make traces harder to interpret—akin to "wandering aimlessly in the woods" instead of following a clear path. For example, a single logical onTraces
operation might be fragmented into several child spans for processResourceSpans
, cluttering the trace view and inflating span counts unnecessarily.
Here’s the original Go code we wrote and landed in production:
ctx, span := telemetry.Tracer().Start(ctx, "tendril.processResourceSpans")
defer span.End()
// Extract service information from resource
svcName := getResourceString(rs.Resource(), attrServiceName)
span.SetAttributes(attribute.String("service.name", svcName))
svcVersion := getResourceString(rs.Resource(), attrServiceVersion)
span.SetAttributes(attribute.String("service.version", svcVersion))
svcEnv := getResourceString(rs.Resource(), attrEnvironmentName)
span.SetAttributes(attribute.String("deployment.environment.name", svcEnv))
The Purposeful Solution: Leveraging Span Events
Instead of creating distinct child spans for every micro-step, it's often far more effective to consolidate these internal milestones as span events within a single, overarching span that represents the larger logical operation. This aligns with the core principle of choosing the "most effective signal type for your defined purpose." Logs provide detailed context for discrete occurrences, metrics track aggregatable trends, and traces show flow; span events offer a way to add rich, contextual markers to a span without creating new ones.
And here’s the code after the fine-tuning:
span := trace.SpanFromContext(ctx)
// Extract service information from resource
svcName := getResourceString(rs.Resource(), attrServiceName)
svcVersion := getResourceString(rs.Resource(), attrServiceVersion)
svcEnv := getResourceString(rs.Resource(), attrEnvironmentName)
span.AddEvent("processing resource spans for service", trace.WithAttributes(
attribute.String("service.name", svcName),
attribute.String("service.version", svcVersion),
attribute.String("deployment.environment.name", svcEnv),
))
Benefits of Using Span Events Over Excessive Child Spans:
Clearer Trace Representation: A single span with well-defined events provides a cleaner, more focused view of a component's internal workings within the context of the larger trace. This gives a "well-lit path" to understanding that component's behavior.
Reduced Overhead and Cost: Span events are generally lighter-weight than full spans. This translates to reduced data volume and consequently lower processing and storage costs in your observability backend.
Enhanced Context: Events, with their associated attributes, allow you to capture crucial details (e.g., input size, processing duration for a specific sub-task, success/failure flags) at precise points within the operation, without fragmenting the trace into many tiny pieces.
Conclusion: Towards Insightful and Economical Observability
Moving from indiscriminate data collection to purposeful software telemetry is more than an engineering exercise; it's a strategic imperative. It ensures that our substantial investments in observability deliver tangible business value—faster incident resolution, optimized performance, and controlled costs—rather than just overwhelming data lakes.
This journey of continuous cultivation is not a one-off task. It requires ongoing review, governance, and a feedback loop where insights from incidents, performance anomalies, and cost reports are fed back into your instrumentation design and data pipeline policies. As your systems evolve, so too must your telemetry strategy.
The guiding questions we discussed in our previous post remain your most valuable tools:
"What question are we trying to answer with this data?"
"What data do we truly need, and at what precision?"
"Why this specific signal type (metric, log, trace, event)?"
"How will this data actually be used and by whom?"
"And critically, what is its ongoing cost versus its value?"
By consistently applying this critical lens, engineering teams can cultivate an observability practice that is not only powerful and insightful but also sustainable and economically sound. This deliberate, adaptive, and insight-driven approach is the future of effective software observability. OllyGarden is committed to being a neutral and valuable partner in this ecosystem, helping you analyze, optimize, and manage your OpenTelemetry pipelines to harvest the richest insights efficiently.
Subscribe to my newsletter
Read articles from Juraci Paixão Kröhling directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Juraci Paixão Kröhling
Juraci Paixão Kröhling
🚀 Building Solutions in Observability | Co-Founder of OllyGarden As the co-founder of OllyGarden, I focus on creating tools and frameworks that enhance observability and address challenges in telemetry and distributed systems. My work leverages expertise in OpenTelemetry and insights from industry collaborations to develop practical, scalable solutions. 💡 A Career Rooted in Technology and Innovation With experience spanning startups, enterprise environments, and global collaborations, I bring a well-rounded approach to building software systems. My focus areas include observability practices, data pipelines, and message queue processing, ensuring reliability and efficiency in modern systems. 🤝 Collaborating Across the Ecosystem I value working with and learning from others in the technology ecosystem. Through discussions with experts and partnerships, I continuously seek to address industry challenges and uncover new opportunities for growth and innovation. 🌱 Continuous Development I am dedicated to refining my skills in software engineering and advancing observability practices. By staying engaged with emerging technologies and trends, I aim to develop solutions that address real-world challenges and drive progress in the field. 🎯 How I Can Contribute Addressing observability challenges Implementing OpenTelemetry solutions Exploring partnerships to bring ideas to life