In our Purposeful Instrumentation blog post, we laid the groundwork for a more disciplined approach to observability. We argued that the goal isn't to merely amass data, but to cultivate high-quality telemetry signals – focusing on quality over quantity. The aim is to transform our experience during high-pressure incidents from frantically searching through a "dense thicket of irrelevant data" to confidently navigating a "well-lit path to the root cause."

Many of us have experienced the pitfalls of the "instrument everything" mantra. While well-intentioned, it often leads to an "overgrown jungle of telemetry data," where critical signals are drowned out by noise. Purposeful instrumentation, in contrast, encourages us to strategically gather the right data. This isn't just about digital decluttering; it yields tangible benefits: reduced noise, faster troubleshooting, and improved clarity and maintainability in our systems.

This post moves from philosophy to practice. We'll dive into concrete examples and techniques, showcasing how to apply purposeful instrumentation in real-world scenarios—from initial telemetry design to ongoing pipeline adjustments and even code-level optimizations.

Designing Telemetry with NASA's Rigor

When we think about systems operating under the most severe limitations, spacecraft telemetry, particularly from missions like NASA's Mars rovers, offers profound inspiration. The extreme constraints of space exploration—limited bandwidth, power, and processing capabilities—force engineers to meticulously justify and optimize every single bit of data transmitted. For observability engineers on Earth, even without such stark limitations, these practices offer invaluable lessons in cultivating efficiency.

Here are some key takeaways:

Data Type Optimization: Spacecraft systems often convert 64-bit floating-point numbers to 32-bit or even 16-bit integers. Sometimes, scaled integers (like centi-degrees Celsius) are used to preserve essential precision while drastically reducing data volume. For our enterprise systems, this prompts a critical question: Do we really need microsecond precision for every timer, or would seconds suffice for certain metrics, thereby reducing storage and processing overhead?
Bit Packing and Enumerated Types: To save space, boolean flags and enumerated values with a limited set of states are often packed into smaller integer types on spacecraft. For example, 15 distinct safety checks might be encoded into a single 16-bit integer. This principle is directly applicable to software telemetry, particularly in how we design attributes to reduce cardinality and data volume. Instead of verbose string representations for statuses, can an enumerated integer suffice?
Configurable Data Collection: Spacecraft aren't static in their data collection. They possess "knobs" that allow operators to increase data verbosity for anomaly investigations, switching between "Brief records" for nominal operations and "Verbose records" when digging deeper. This mirrors the need in our systems for dynamic control over telemetry, perhaps adjusting log levels or sampling rates based on operational context rather than maintaining a constant, high-volume stream.
Summary Data and Compression: Reporting small, high-level summary data packets independently from detailed diagnostic data products allows for quick operational decision-making. If summaries are nominal, large, detailed data products might even be discarded to save precious bandwidth. Lossless compression is also a standard practice, always balancing the CPU cost of compression/decompression against bandwidth savings.
The "Very Small Products" Problem: Interestingly, generating a multitude of tiny data products can be inefficient, consuming storage slots and impacting system performance, as was observed with the Mars 2020 rover's packetizer. This highlights the importance of batching and aggregation not just for network efficiency but also for processing and storage optimization within our telemetry pipelines. The OpenTelemetry Collector’s batch processor is a prime example of applying this principle.

These extreme examples from NASA underscore a fundamental discipline: diligently asking, "What data do I really need?" and "What is the cost versus the value?" This scrutiny is crucial for building sustainable and effective telemetry strategies, ensuring we're not just collecting data, but harvesting actionable insights.

Tuning Automatic Instrumentation for Precision with OpenTelemetry

OpenTelemetry's auto-instrumentation agents are a massive boon, offering broad telemetry coverage for popular libraries and frameworks with minimal upfront effort. It’s tempting to see this as "zero code, zero thought." However, this convenience doesn't absolve us from the need for purposeful configuration. Blindly enabling instrumentation for every conceivable library can quickly lead back to that "overgrown jungle of telemetry data," swamping your systems with noise and incurring unnecessary costs.

Review Default Configurations: Auto-instrumentation defaults are often tuned for maximum coverage, which might not align with your specific observability goals or the critical paths of your application. As Elena Kovalenko of Delivery Hero noted, unconfigured auto-instrumentation can generate extremely high cardinality and massive data volumes, potentially overloading collectors and backend systems. It’s vital to treat the default settings as a starting point, not a final destination.
Selectively Disable Unnecessary Instrumentation: Most OpenTelemetry auto-instrumentation agents allow for fine-grained control, enabling you to disable instrumentation for components that are irrelevant to your critical diagnostic paths or those known to produce excessive, low-value data.
- Concrete Example: Suppressing JDBC Telemetry: If your primary diagnostic focus is at the service interaction level, the verbose telemetry generated by JDBC instrumentation (tracing every database call) might be more noise than signal. With the OpenTelemetry Java agent, for instance, you can easily disable this by setting the environment variable OTEL_INSTRUMENTATION_JDBC_ENABLED=false. This targeted pruning ensures that resources aren't wasted collecting, processing, and storing data that doesn't contribute significantly to your understanding of system health.

Auto-instrumentation plants the seeds of visibility; purposeful configuration helps you cultivate the desired crop, ensuring a healthy yield of actionable insights rather than a field of weeds.

Optimizing Data Flow with the OpenTelemetry Collector: Pipeline Adjustments

The OpenTelemetry Collector is more than just a telemetry forwarder; it's a powerful, vendor-agnostic control plane. It’s a great place to implement purposeful telemetry strategies by filtering, sampling, enriching, and transforming data before it even reaches your observability backends. Let's look at how sophisticated organizations are leveraging the Collector.

eBay's Journey: Scaling Distributed Tracing with Cost Optimization

Handling telemetry at eBay's scale—ingesting 6.5 million spans per second—necessitates highly judicious instrumentation and aggressive optimization. They faced challenges with broken call chains due to context propagation issues and the difficulty of applying uniform sampling across APIs with vastly different traffic volumes.

Their approach to sampling evolved:

Initial Strategy: They started with head sampling at the client (e.g., 2% of requests) combined with parent-based sampling to ensure entire traces were captured if any part was sampled.
Adding Tail Sampling: After that, they employed a tail-sampling strategy to retain "interesting" traces—those with errors, high latency, or specific critical attributes—along with a baseline 1% of successful traces, storing these for 14 days. This allowed them to focus retention on the most valuable diagnostic data.
Evolving Tail Sampling with OTel Collector: Recognizing the significant memory and complexity challenges of performing in-memory tail sampling within the OpenTelemetry Collector for long-duration traces or requests spanning multiple clusters, eBay pivoted. They now leverage exemplars from metrics to identify traces of interest. These traces are then copied from a raw trace table to a sampled table after a 10-15 minute delay. This innovative, storage-based tail sampling approach demonstrates a mature balance between comprehensive diagnostic capability and cost control.

TomTom's Centralized Control: Enforcing Governance and Flexibility

TomTom implemented a centralized OpenTelemetry Collector Service that acts as a gateway between their internal applications and various SaaS observability platforms. This central hub provides several advantages:

Governance and Standardization: It allows them to enforce authentication, manage general configurations like batching and encryption consistently, and, crucially, handle data enrichment and manipulation centrally.
Filtering and PII Redaction: They use the filterprocessor to drop noisy or irrelevant logs (e.g., from specific Kubernetes namespaces). For sensitive data, a combination of the transformprocessor and attributesprocessor is used to redact Personally Identifiable Information (PII) before telemetry leaves their trust boundary.
Telemetry Enrichment: Data is enriched with valuable metadata, such as an "owner" label, which provides better context during troubleshooting and improves accountability.
Strategic Benefits: This centralized model offers flexibility in switching telemetry backends, enforces data governance policies, and has proven critical for cost control and maintaining data quality at an enterprise scale.

These real-world examples illustrate the power of the OpenTelemetry Collector as a central point for cultivating telemetry quality.

Crafting Intentional Manual Instrumentation: OllyGarden’s Example

While auto-instrumentation provides breadth, manual instrumentation offers depth and precision. But even here, more isn't always better. A common pitfall is "over-spanning": creating an excessive number of highly granular spans for minor, sequential internal operations within a single logical unit of work. This can obscure the true flow of a request, add unnecessary overhead, and make traces harder to interpret—akin to "wandering aimlessly in the woods" instead of following a clear path. For example, a single logical onTraces operation might be fragmented into several child spans for processResourceSpans, cluttering the trace view and inflating span counts unnecessarily.

Here’s the original Go code we wrote and landed in production:

    ctx, span := telemetry.Tracer().Start(ctx, "tendril.processResourceSpans")
    defer span.End()

    // Extract service information from resource
    svcName := getResourceString(rs.Resource(), attrServiceName)
    span.SetAttributes(attribute.String("service.name", svcName))

    svcVersion := getResourceString(rs.Resource(), attrServiceVersion)
    span.SetAttributes(attribute.String("service.version", svcVersion))

    svcEnv := getResourceString(rs.Resource(), attrEnvironmentName)
    span.SetAttributes(attribute.String("deployment.environment.name", svcEnv))

The Purposeful Solution: Leveraging Span Events

Instead of creating distinct child spans for every micro-step, it's often far more effective to consolidate these internal milestones as span events within a single, overarching span that represents the larger logical operation. This aligns with the core principle of choosing the "most effective signal type for your defined purpose." Logs provide detailed context for discrete occurrences, metrics track aggregatable trends, and traces show flow; span events offer a way to add rich, contextual markers to a span without creating new ones.

And here’s the code after the fine-tuning:

    span := trace.SpanFromContext(ctx)

    // Extract service information from resource
    svcName := getResourceString(rs.Resource(), attrServiceName)
    svcVersion := getResourceString(rs.Resource(), attrServiceVersion)
    svcEnv := getResourceString(rs.Resource(), attrEnvironmentName)

    span.AddEvent("processing resource spans for service", trace.WithAttributes(
        attribute.String("service.name", svcName),
        attribute.String("service.version", svcVersion),
        attribute.String("deployment.environment.name", svcEnv),
    ))

Benefits of Using Span Events Over Excessive Child Spans:

Clearer Trace Representation: A single span with well-defined events provides a cleaner, more focused view of a component's internal workings within the context of the larger trace. This gives a "well-lit path" to understanding that component's behavior.
Reduced Overhead and Cost: Span events are generally lighter-weight than full spans. This translates to reduced data volume and consequently lower processing and storage costs in your observability backend.
Enhanced Context: Events, with their associated attributes, allow you to capture crucial details (e.g., input size, processing duration for a specific sub-task, success/failure flags) at precise points within the operation, without fragmenting the trace into many tiny pieces.

Conclusion: Towards Insightful and Economical Observability

Moving from indiscriminate data collection to purposeful software telemetry is more than an engineering exercise; it's a strategic imperative. It ensures that our substantial investments in observability deliver tangible business value—faster incident resolution, optimized performance, and controlled costs—rather than just overwhelming data lakes.

This journey of continuous cultivation is not a one-off task. It requires ongoing review, governance, and a feedback loop where insights from incidents, performance anomalies, and cost reports are fed back into your instrumentation design and data pipeline policies. As your systems evolve, so too must your telemetry strategy.

The guiding questions we discussed in our previous post remain your most valuable tools:

"What question are we trying to answer with this data?"
"What data do we truly need, and at what precision?"
"Why this specific signal type (metric, log, trace, event)?"
"How will this data actually be used and by whom?"
"And critically, what is its ongoing cost versus its value?"

By consistently applying this critical lens, engineering teams can cultivate an observability practice that is not only powerful and insightful but also sustainable and economically sound. This deliberate, adaptive, and insight-driven approach is the future of effective software observability. OllyGarden is committed to being a neutral and valuable partner in this ecosystem, helping you analyze, optimize, and manage your OpenTelemetry pipelines to harvest the richest insights efficiently.

Concrete Applications of Purposeful Instrumentation