Three Pillars of Observability

What is Obervability?

Observability is the ability to understand a system’s internal state by analyzing the outputs it produces, such as logs, metrics, and traces. A system is considered observable when you can determine what's happening inside purely from this external data.

Modern service infrastructures are growing increasingly complex. In this environment, proactive monitoring alone is no longer enough to quickly detect and resolve application issues. Monitoring helps prevent known issues from recurring, but with intricate architectures, unknown and unpredictable failure modes can emerge. To handle such cases, systems must be designed for observability. An observable system offers highly granular insights and provides rich context about its internal operations — making it possible to uncover hidden issues and understand their root causes.

Monitoring enables failure detection, while observability provides a deeper understanding of system behavior. A common misconception among engineers is that monitoring and observability are entirely separate - in reality, observability is a broader concept, and monitoring is one of its key components. The goal of observability is not just to detect problems, but to pinpoint their origin and underlying causes. It is built on three core pillars: metrics, logs, and traces.

While these alone don’t make a system fully observable, they are essential tools that offer critical insights. Each pillar brings unique strengths - and limitations - to diagnosing and understanding complex system issues."

These three outputs metrics, logs, and traces form the foundational three pillars of observability. Together, they enable proactive monitoring, rapid troubleshooting, and deep system understanding.

Metrics

Quantitative measurements of system performance, like CPU usage, request latency, or error rates. They’re aggregated over time, offering a high-level view of trends and patterns. Think of them as the dashboard gauges for your system - great for detecting anomalies or monitoring SLAs.

Example:

CPU usage: 75%
Requests per second: 120 rps
Error rate: 2%

Popular open-source tool: Prometheus

Logs

Detailed, timestamped records of events or actions in a system, like user logins, errors, or database queries. They’re granular and contextual, perfect for debugging or auditing specific incidents. Logs are like a system’s diary, capturing what happened and when.

Example:
"User 123 failed login at 14:03 due to invalid password"

Popular open-source tool: ELK stack

Traces

Records of a request’s journey through a distributed system, showing each service it touches and the time spent. Traces are critical for identifying bottlenecks or failures in complex, microservice-based architectures - like a GPS tracking a package’s delivery route.

Example:
A request travels from service A → B → C, and you can see that service B took 3s while others took milliseconds.

Popular open-source tool: Jaeger

Pyramid of “Hapiness” for Observability

The "Pyramid of Happiness" in observability is a concept introduced by Alex King in a 2019 article on FAUN.pub, offering a structured, intuitive alternative to the traditional three pillars of observability (metrics, logs, and traces). It reframes observability as an upside-down funnel, guiding teams from raw data to actionable outcomes, with the goal of improving system reliability and, ultimately, team well-being. The term "happiness" reflects the reduced stress and proactive focus that come from fewer outages and better system insights.

Image source

The Pyramid of Happiness organizes observability data hierarchically:

Base (Events): The bottom layer captures a massive volume of raw events from systems - logs, metrics, and other signals. This is the "haystack" where finding the "needle" (the root cause of an issue) is challenging due to the sheer scale of data.
Middle (Correlation and Analysis): As you move up, tools aggregate and correlate events to provide context. This layer reduces noise, identifying patterns or anomalies (e.g., a spike in error logs tied to a specific service). It’s where raw data starts becoming actionable insights.
Top (Outcomes): The apex represents the desired outcome - clear, actionable feedback that pinpoints issues and guides fixes. This leads to reliable systems, fewer incidents, and a shift from reactive firefighting to proactive optimization.

Why "Happiness"?

In his article, King argues that effective observability directly improves team happiness. When systems are reliable you’ve got:

Fewer outages mean less 3 a.m. panic calls.
Proactive analysis replaces reactive chaos, allowing teams to focus on optimization and innovation.
Reliability as a first-class concern fosters confidence in software design and deployment.

The concept of “happiness“ resonates with DevOps principles, emphasizing feedback loops and continuous improvement.

Incident Management Flow

Whichever tooling you use, the principles are the same: consolidate events into an incident, flag the incident, and track progress through to service restoration.