Observability 2.0: A Unified, High-Resolution Approach for Modern Software Development


Most of us used at work tools like Grafana, ELK, and Jeager to monitor, and track the behavior of our applications. The traditional “Observability 1.0” refers to an approach to monitoring and understanding system performance, based on three main data types: metrics, logs, and traces, commonly called “three pillars”. This model relies on separate tools for each data type, leading to siloed information and potential challenges correlating data across systems. Moreover, separate dashboards and disconnected data sources can make it hard to know what’s happening in your environment. In contrast to this approach modern approach, called “Observability 2.0”, focuses on high-resolution data unification. That means all telemetry data, from all pillars, are ingested as structured data by one technical solution.
Curious why so many teams are shifting in this direction? When your system can store structured, raw data with high-cardinality fields, you can analyze behavior at a very granular level. This evolution brings new capabilities like asking ad-hoc questions of their systems in real-time, providing deeper insights than traditional monitoring could, and revealing “unknown unknowns”. As a developer, wouldn’t it be nice to see everything in one spot?
What is the core concept behind it?
The Idea of Observability 2.0 is capturing detailed event data and keeping it accessible for immediate and deeper inspection. Instead of sprinkling metrics, logs, and traces across several tools, all that information goes into one place. By doing this, teams remove barriers that previously made correlation a challenge. The result is a flexible data foundation that supports both real-time alerting and retrospective investigation. All of that couldn’t be done without key concepts that stay behind it:
Structured High-Cardinality and High-Dimensional Data
Unified Data Store for Raw Events
Real-Time and Historical Analysis Capabilities
In the next couple of paragraphs, I will try to explain the constraints of traditional observability and explain how the above concept brings benefits by providing real-life application as an example.
Limitations of Traditional Observability
Conventional methods frequently rely on basic aggregations and partial samples. These may mask unusual behaviors that could be critical to spot. Having data in different systems also makes it difficult to piece things together, especially when production goes off-track. Teams end up juggling separate dashboards and might lose valuable time trying to reassemble the bigger picture.
What benefits do the key concepts provide?
Here I would like back to the key concepts I mentioned earlier. All of them have a purpose and bring benefits. Moreover, these concepts could be the beginning of journey with creating benefits around new observability implementation.
Structured High-Cardinality and High-Dimensional Data
Unlike older monitoring tools that struggled with data explosion, Observability 2.0 prioritizes capturing as much detail as needed. Event data are stored in a structured way and provide bespoke data granularity, enabling granular filtering and analysis. Telemetry with a large number of unique values, like user IDs or session IDs(High-Cardinality) and events enriched with many attributes or tags(high-dimensional data), give a comprehensive view of the system state. This rich context means engineers can differentiate even very similar events and slice data along virtually any dimension, e.g. filter events by a specific customer, feature tag, or event type, and to all that without having pre-aggregated metrics. That support is a cornerstone of Observability 2.0’s superpower.
Unified Data Store for Raw Events
Using a common standard for traces, logs, and metrics allows to simplify software architecture. Tools, like OpenTelemetry, are giving an opportunity to create "one source of truth” of systems, that means there are no longer need to break down data into the silos. Storing events in their raw, unaggregated for, the system can derive metrics, traces, and logs on the fly from the same dataset without to switching or integrating many tools. For metrics and traces nothing stands in the way to store data as structured logs or spans format. Thats model unification simplifies data management and correlation: metrics graphs, distributed trace views, and log search results all reference the same underlying events. Developers reap the benefits by elimination of context-switching between tools and formats and it’s easier to follow a chain of events across services, enhancing the ability to identify and resolve issues.
Real-Time and Historical Analysis Capabilities
All the benefits, mentioned earlier, allow us to easily implement real-time analysis. In the era of AI, it could be easily integrated with LLM models to detect some anomalies in the stream of data. Moreover, Models could be constantly trained to discover new types of anomalies. It significant change in comparison to traditional stack and annoying manual searches across many tools that consume a lot of time.
Most popular tools
My quick research for
Honeycomb.io: Designed around event-based data for real-time debugging. It supports high-cardinality data
Datadog: Offers a suite of integrations, from application performance monitoring (APM) to logs.
New Relic: Broad product range, including distributed tracing and dashboards.
Dynatrace
Prometheus: Often used for time-series metrics, and can integrate with various exporters.
Elasticsearch: Commonly known for logs, but also used for storing and searching large amounts of event data.
EFK as implementation of Observability 2.0
Implementing Observability 2.0 with EFK(Elasticsearch, Fluent(d/bit), Kibana) Stack has evolved from primarily a log management solution to a comprehensive observability platform capable of handling metrics, logs, and traces. You can measure all development aspects of Your technology stack from infrastructure to application metrics. Moreover Elastic allows You to handle anomaly detection and it’s ready to integrate with AI. Here I would like to focus on basic capabilities.
Basic components to implement a complete Observability 2.0 solution using the EFK Stack which typically includes:
Elasticsearch - Core storage and search engine
Fluentbit - Data log collection and processing
Kibana - Visualization and dashboarding
Elastic APM - Application performance monitoring
OpenTelemetry Collector - centralized Open Telemetry collector
OpenTelemetry integrations - For standardized telemetry collection intergrations from applications
You can find source code on my personal GitHub:
https://github.com/konradjed/observability-2.0
After setting up the infrastructure and instrumenting your services, you can use Kibana to create dashboards that combine metrics, logs, and traces:
APM Traces View: Navigate to Observability → APM to see service maps, traces, and performance metrics.
Logs Correlation: Kibana allows you to click on a trace and see logs correlated by trace ID.
Custom Dashboards: With any custom metrics that appear in your applications Custom dashboards in Kibana gives you powerful tool that might be used not only by technical team but by Your business.
To ensure logs are properly linked to traces, make sure your logs include:
trace.id
- The OpenTelemetry trace IDspan.id
- The span ID within the traceservice.name
- The name of the service generating the log This enables Elastic’s APM UI to show logs related to specific traces, providing context when troubleshooting issues
Real-Life application examples
Maintaining resilient, cloud-native, and modern systems couldn’t be done without modern observability. An example that developed illustrates, how we can effectively monitor a heterogeneous microservices architecture, by using OpenTelemetry and Elasticsearch stack. Sample applications, payments services, are composed of multiple services written in different programming languages:
Python - A core payment-service responsible for processing transaction requests.
Java - A fee-calculator service that implements computation of transaction fees.
Node.JS - A user-service that provides user data stored in a PostgreSQL database.
Think of OpenTelemetry as a universal instrument panel: whether you’re running a service in Python, Java, Go, or Node.js, you just plug in the SDK and all your logs, metrics, and traces stream into Elastic. This breaks down silos between teams and tech stacks—no more guessing what happens when a request hops from one language to another. With SDKs and auto-instrumentation libraries for nearly every major language, exporting telemetry data is as simple as adding a handful of configuration lines.
I made a Postman collection to mimic real-world traffic, and to test resilience, I deliberately knocked the user-service, and fee-calculator offline and watched how the system healed itself. In the sections that follow, we’ll dive into key observability features: automatic service detection, performance dashboards, dependency graphs, error tracking, and more.
Service Detection and Building a Service Map
Elastic service map is like a live network diagram that updates itself: once you instrument your code with OpenTelemetry, Elastic spots each service and plots the connections in real time. You’ll instantly see who’s talking to whom, how often, and where the slowdowns live. It’s a godsend for both developers and architects—whether you’re hunting down a sneaky bottleneck or planning architecture refactor. You always get actual picture of your entire microservices landscape.
Elastic Observability automatically detects services instrumented with OpenTelemetry and constructs a dynamic service map. This visualization provides a comprehensive overview of the system architecture, highlighting the interactions between services. Such a map is invaluable for developers and architects to understand the system’s structure and identify potential bottlenecks or points of failure.
Overview of Service Performance Metrics
Imagine having a dashboard that tells you, at a glance, how fast your services are responding, how many requests they’re handling per second, and where errors are creeping in. I have created couple of them and I always need to create the same annoying, repeatable work to create them. That’s precisely what Elastic with automate delivers: fine-grained response-time histograms, throughput charts, and error-rate trends for each service. Armed with this data, your team can spot performance hiccups early—before they impact users—and drill down to the exact code paths or dependencies that need attention.
Dependencies Overview
Think of the dependencies view of your system’s as circuit wiring diagram—it instantly shows which services feed data downstream and which pull from upstream. When something breaks, you don’t have to guess which component caused the cascade; you can trace the failure’s path in seconds. This is a game-changer when we need to analyze root cause of failure and understand the true size of any outage.
Error Metrics and Logs
Imagine having a single command center where every service’s error counts and log entries funnel in. Moreover OpenTelemetry metrics and traces allows You to make advanced analyze application performance which might be an error cause. That’s what Elastic gives you: a centralized dashboard showing not just how often errors occur, but the exact log snippets and stack traces behind them. When an issue pops up, you can jump into trace dashboard and analyze cause. This kind of visibility is a lifesaver when you need to diagnose failures fast.
Service Map with Direct Calls
While the high-level service map gives you the big picture, Elastic also lets you zoom in to see exactly which services are calling each other—and how often. It’s like flipping from a city map to a street view. This kind of detail is incredibly useful when you’re trying to untangle inefficient request chains, debug chatty services, or just understand the real traffic patterns flowing through your system.
Log Filtering Per Service
In a busy system with dozens of services chatting away, digging through logs can feel like trying to find a signal in static. Elastic makes this a non-issue by letting you filter logs down to just the service you care about. Whether you’re chasing a bug in the payment API or tracking weird behavior in a background worker, you can zero in instantly—no more wading through noise from unrelated parts of the stack.
Tracing
If you’ve used tools like Jaeger or Zipkin before, you’ll feel right at home—but with more firepower. Elastic takes tracing a step further by tying together spans, logs, and metrics into a single, unified view. You can follow a request as it jumps across services, see exactly where latency creeps in, and pull the related logs without switching tabs. It’s like having X-ray vision for your distributed system—perfect for tracking down those elusive bottlenecks or unexpected slowdowns.
Infrastructure Metrics
It’s easy to forget that sometimes, the issue isn’t in your code—it’s in the box it’s running on. Elastic helps surface those problems by tracking infrastructure-level metrics like CPU load, memory usage, disk I/O, and network traffic. You can spot when a service is slow because the node it’s on is maxed out, or when a noisy neighbor is hogging resources. These insights give you the full picture, making it easier to fine-tune performance and plan capacity before things go sideways.
Custom Metrics and Dashboards
Elastic allows the creation of custom metrics and dashboards tailored to specific operational needs and key performance indicators (KPIs). This flexibility empowers teams to monitor aspects most relevant to their objectives and to visualize data in a manner that best supports decision-making.
Log Explorer
Elastic’s Log Explorer offers a centralized interface for searching, filtering, and analyzing log data across your entire system. It enables users to quickly access logs from various sources without the need to log into individual servers or navigate through directories. This tool is especially beneficial in heterogeneous environments, providing a unified view of logs from services written in different languages and running on diverse platforms.
Conclusion
This implementation provides a comprehensive Observability 2.0 solution using the ELK Stack, incorporating metrics, logs, and traces. The integration with OpenTelemetry ensures compatibility with the broader observability ecosystem while leveraging Elastic’s powerful search and visualization capabilities. By following this approach, you’ll have a complete view of your system’s health and performance, enabling faster troubleshooting and better understanding of your applications’ behavior.
Future Trends in Observability
As the technology landscape grows more complex, observability is undergoing a transformative shift. No longer confined to traditional monitoring, today’s observability practices are expanding to encompass new dimensions that empower teams to operate faster, smarter, and more securely. Several key trends are shaping the future of observability, redefining how organizations build, maintain, and scale their systems.
Multi-Dimensional Observability
In the past, observability primarily focused on system performance metrics like latency, throughput, and error rates. However, organizations are increasingly recognizing the need for a more holistic view. Multi-dimensional observability integrates cost analysis, compliance tracking, and security monitoring alongside traditional performance data. By doing so, it fosters a collaborative environment where DevOps, SecOps, and FinOps teams can align their efforts. This cross-functional approach not only enhances operational efficiency but also ensures that systems are resilient, secure, and cost-effective from the ground up.
AI-Powered Autonomic
The next major leap in observability is the shift towards AI-powered autonomic operations. In this paradigm, AI doesn’t just identify anomalies or predict potential failures—it actively remediates issues without human intervention. Machine learning algorithms analyze patterns, make decisions, and execute corrective actions in real time. This evolution reduces the burden on IT teams, minimizes downtime, and accelerates incident response, paving the way for self-healing systems that can operate at scale with minimal oversight.
Cost-Optimized Observability
As data volumes skyrocket, the cost of storing, processing, and analyzing telemetry data has become a significant concern. Organizations are now prioritizing cost-optimized observability strategies to manage expenses without compromising insights. Techniques like smarter data sampling, tiered storage solutions, and intelligent data retention policies are being widely adopted. These methods ensure that critical data remains readily accessible while less critical information is archived or discarded appropriately, striking a balance between cost-efficiency and operational visibility.
Closing Thoughts
The transition toward what many call “Observability 2.0” offers a compelling value proposition: faster incident resolution, deeper system insights, and a more unified approach to managing software lifecycles. By adopting a unified data model, teams can achieve consistency across monitoring, alerting, and analysis, resulting in smoother production releases and more effective troubleshooting. Organizations that embrace these emerging trends position themselves for greater agility and resilience in an increasingly complex digital world.
Subscribe to my newsletter
Read articles from Konrad Jędrzejewski directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
