Observability is a critical part of software engineering. I like to think about it from a medical context; the engineer is the doctor and the server/cluster/app is the patient, not necessarily having to be sick but still needs check-ups to make sure she’s healthy and any sign of ailment is tackled early on. Also the doctor(engineer) is able to make recommendations for healthier living/functioning based on the current state of the patient(application).

Before we proceed I have a teaser question, who do you think should be in charge of Observability in a system’s architecture? DevOps Engineer? Site Reliability Engineer? or Developer? This question has had different answers overtime but hold that thought, you’ll be deciding at the end of this piece.

Observability is composed of 3 sub genres(where as the triangle reference): metrics, logs & traces they answer the What, How & Why of the application’s metadata, they are further explained below:

Metrics & Monitoring

Metrics and monitoring deals with the analytical side of observability, it handles the presentation and analysis of the system’s metadata. It uses metric/monitoring systems such as Prometheus, Graphite, InfluxBD, etc to track and query system details such as CPU usage, most used endpoints, application response time and many more. This data is then visualised by a common visualisation tool; Grafana. Prometheus is likely the most used system metrics/monitoring tool, it uses PromQL a native query language for querying the data stored in a Time Series Database(TSDB). Prometheus can be configured to send alerts when certain metrics are met. In the medical context, you can think of it as running heart rate or blood pressure checkups on you patient(server/system) to find out vital signs.

Logs

Logs are more indication inclined, used for debugging, compliance regulation, auditing and more. Why logging is important? Well if you’re a coder then there’s no need explaining :) . Logging in Observability provides a more verbose approach to the art

A few popularly frameworks used for logging are: ELK (ElastiSearch, Logstash & Kibana) & EFK (ElastiSearch, FluentBit & Kibana), Logstash is a heavier, more complete option than FluentBit, it all depends on your use case; Logstash is mostly used for Java-heavy infrastructure or advanced log filtering systems while FluentBit is used in lightweight/containerized applications that don’t require too much operational overhead.

Traces

As the name implies, traces are the aspect of the trio that handle finding the origin of an error or fault, they’re basically the Sherlock Holmes of the Observability triangle. They provide records of the end-to-end journey of a single request as it flows through your application, showing how different services and components interact. From APIs to databases, to queues in a micro service architecture.

Why are Traces important? There are lots of benefits for tracing, from debugging to performance optimisation to root cause tracing, the benefits of tracing are surplus.

There are few tools that can be utilised for tracing such as: Jaeger, Loki, etc and they vary in implementation, some support trace data storage and others have to use other storage services as a query source such as ElastiSearch, the option of which to use depends mostly on the use case, how large the application is & how much scaling will be considered in the future.

And that’s the current 3 components of the Observability architecture but before we can run any of our observability components, we need to setup instrumentation.

Instrumentation

Instrumentation is basically the process of scraping data from your running application, this data is then provided for all the aspects of observability. It is the connection between your observability stack and your application, think of it as a plug.

Individually, you can scrap data for any of the trio you’re looking to implement using their specific tools or frameworks but a tool like Open Telemetry provides instrumentation for all components of observability. Here’s a diagram that depicts the relationship, using several observability tools

You can set instrumentation with Open Telemetry AKA Otel by using its native SDKs which are compatible with programming languages/frameworks like Python, Java, Go, Javascript/Node.js, Ruby, PHP, Rust, C#, etc. This SDK can be used in-code to streamline telemetry data.

So back to my earlier question; who do you think should be responsible for observability? In my opinion, it should be a collective effort; the developer sets up instrumentation and provides support when needed, whereas the DevOps Engineer/SRE sets up the observability architecture. That’s just my opinion anyways, you can let me know yours in the comments. To get hands-on experience with observability tools, you can take on this this short YouTube course, ciao!

The Observability Triangle

Metrics & Monitoring

Logs

Traces

Instrumentation

Subscribe to my newsletter

Isreal Hogan

Isreal Hogan