Enhance Observability with OpenTelemetry & Jaeger

In my last article, I wrote about monitoring using Grafana and Prometheus. Today I would like to add tracing to the mix.

There are three types of data when it comes to monitoring:

Logging
Metrics
Tracing

We already have the first two types of data covered with Loki and Prometheus. For example, you can use Loki to view the logs of your container and use Prometheus to monitor the performance of a pod.

With the third type of data, trace data, it is possible to measure and compare times between individual systems. While logs and metrics provide more point or aggregated information, tracing allows an end-to-end view of requests flowing through distributed systems.

A trace consists of several so-called spans. Each span represents a single operation within a system – for example, an HTTP request, a database query or an API call. These spans are linked together so that you can understand how a request travels through different services. This makes it possible to identify bottlenecks, misbehavior, or latency problems in a very focused way.

A modern approach to tracing —and to observability in general— is OpenTelemetry. This is a CNCF project that aims to provide unified APIs and SDKs for collecting logs, metrics, and traces. OpenTelemetry supports many programming languages and integrates well with existing tools like Grafana Tempo, Jaeger, or Zipkin.

When combined with Grafana, trace data can be visualized clearly. It becomes especially useful when you connect traces to logs and metrics. For instance, you can go directly from a log entry to the related trace or examine the affected traces if there's a metric outlier. This linking of data sources is a powerful tool for error analysis and improving performance.

Practical example: Tracing in a microservices system

Let's imagine an e-commerce system consisting of several microservices:

Front-end service (takes orders)
Order Service(processes the order)
Payment service (makes the payment)
Shipping Service(arranges the shipment)

A customer clicks on "Buy now" in the frontend. This action triggers a series of HTTP requests between the services. With tracing, such as using OpenTelemetry and Grafana Tempo, you can see exactly how long each service takes to complete its task.

A typical trace might look like this:

Service	Duration (ms)	Description
Frontend-Service	20	Request is accepted
Order-Service	130	Validating order
Payment-Service	300	Credit card payment is processed
Shipping-Service	80	Creating a shipping order

In Grafana's trace view, these spans are seen as horizontal bars that represent the timeline. This makes it easy to see at a glance that the payment service is the slowest part of the chain – a possible candidate for optimization.

Even better: If an error occurs, e.g. a timeout at the payment service, you can jump directly to the associated logs via the trace ID and analyze the exact error message there.

A visualization of the trace timeline for the described e-commerce micro services system would look like this:

The graph clearly shows that the payment service has the largest share of the total time with 300 ms. Such visualizations help to quickly identify bottlenecks and optimize them in a targeted manner.

Tracing Tool Selection

As mentioned earlier, tools like Jaeger Tracing and Grafana Tempo are specifically designed to collect, store, and visualize distributed traces. Both tools work well with OpenTelemetry, which provides a standard way to collect trace data.

Jaeger offers an advanced user interface that allows you to analyze individual traces, filter for specific services or errors, and find performance bottlenecks.

Grafana Tempo is well integrated into the Grafana environment and allows you to link traces with logs and metrics, which is a big advantage for complete observability.

Both systems store trace data efficiently and can scale well. Tempo uses object storage (e.g., S3, GCS), while Jaeger supports various backends like Elasticsearch or Cassandra.

Jaeger Tracing Architecture

For my system I chose to use Jaeger Tracing because it has a large community and a user-friendly interface. Before I explain how I deployed Jaeger, I'll describe the architecture.

In Jaeger, you can choose between the all-in-one and the collector/query options. The all-in-one option provides in-memory storage, which is suitable for most development and testing purposes. For a production environment, the collector/query option is recommended because it uses external storage for persistence.

The architecture of Jaeger Tracing is designed to be modular and consists of several key components: the Jaeger Agent, the Jaeger Collector, backend storage solutions, the Jaeger Query, and the Jaeger user interface. Each of these components plays a crucial role in the tracing process, ensuring that trace data is collected, processed, stored, and presented effectively.

Within the application, client libraries such as OpenTelemetry are integrated. These libraries are responsible for generating spans and traces, which are essential for tracking the flow of requests through the application. Depending on the deployment setup of Jaeger Tracing, these spans and traces are sent either to the Jaeger Agent or directly to the Collector.

The Jaeger Agent acts as a lightweight network service, specifically designed to receive UDP packets containing trace data from various applications. Once the data is received, the Agent forwards it to the Jaeger Collector for further processing.

The Jaeger Collector is responsible for batching, validating, and transforming the incoming trace data. After processing, the Collector exports the data to a chosen storage backend. For storage, it is recommended to use robust solutions like Elasticsearch, Cassandra, or Kafka, as they provide reliable and scalable options for managing large volumes of trace data.

The Jaeger Query component is tasked with retrieving traces from the storage backend. It uses an HTTP API to access the stored data, making it available for further analysis and visualization.

The Jaeger user interface (UI) is the final component in the architecture. It provides a visual representation of the traces retrieved via the HTTP API, allowing users to explore and analyze the trace data in detail.

To ensure data integrity and prevent any potential data loss between the Collector and the storage backend, Kafka can be employed as an intermediary. By using Kafka, data can be efficiently loaded into an external database through the Jaeger Ingester, providing an additional layer of reliability in the data processing pipeline.

Deploy Jaeger in K8s

To deploy Jaeger (all-in-one) in my Kubernetes cluster, I first added the Helm chart:

helm repo add jaegertracing https://jaegertracing.github.io/helm-charts
helm repo update

Then I created a separate namespace for Jaeger.

kubectl create ns jaeger

To use the Jaeger All-in-one variant, I needed to enable allInOne.enabled. Additionally, I had to deactivate the Agent, Collector, Ingester, and Query. Finally, I set the database cassandra to false and the storage-type to memory.

To view all the default configurations provided by the chart, you can run the following command:

helm show values jaegertracing/jaeger > jaeger-values.yaml

It was also important for me to change the basePath when activating the Ingress. This allowed me to use the pathType Prefix. According to the documentation, I adjusted the path by using the argument --query.base-path=/jaeger. This means my Jaeger instance runs on http://hometown/jaeger, while my Grafana instance runs on http://hometown/grafana.

I enter these values in the values file with:

provisionDataStore:
  cassandra: false
  elasticsearch: false
  kafka: false

# Overrides the image tag where default is the chart appVersion.
tag: ""

nameOverride: ""
fullnameOverride: "jaeger"

allInOne:
  enabled: true
  replicas: 1
  image:
    registry: ""
    repository: jaegertracing/all-in-one
    tag: ""
    digest: ""
    pullPolicy: IfNotPresent
    pullSecrets: []
  extraEnv: []
  extraSecretMounts:
    []
    # - name: jaeger-tls
    #   mountPath: /tls
    #   subPath: ""
    #   secretName: jaeger-tls
    #   readOnly: true
  # command line arguments / CLI flags
  # See https://www.jaegertracing.io/docs/cli/
  args:
  - "--query.base-path=/jaeger"
  ingress:
    enabled: true
    # For Kubernetes >= 1.18 you should specify the ingress-controller via the field ingressClassName
    # See https://kubernetes.io/blog/2020/04/02/improvements-to-the-ingress-api-in-kubernetes-1.18/#specifying-the-class-of-an-ingress
    ingressClassName: traefik
    annotations: {}
    labels: {}
    # Used to create an Ingress record.
    hosts:
      - hometown
    annotations:
      kubernetes.io/ingress.class: traefik
    #   kubernetes.io/tls-acme: "true"
    labels:
      app: jaeger
    # tls:
    #   # Secrets must be manually created in the namespace.
    #   - secretName: chart-example-tls
    #     hosts:
    #       - chart-example.local
    pathType: Prefix

storage:
  # allowed values (cassandra, elasticsearch, grpc-plugin, badger, memory)
  type: memory

ingester:
  enabled: false

agent:
  enabled: false

collector:
  enabled: false

query:
  enabled: false

Then I deployed Jaeger in my cluster:

helm upgrade -i jaeger jaegertracing/jaeger -n jaeger -f jaeger-values.yaml

After Jaeger was deployed, I needed to adjust the ingress with the path: /jaeger to prevent it from overlapping with other ingress paths.

spec:
  ingressClassName: traefik
  rules:
  - host: hometown
    http:
      paths:
      - backend:
          service:
            name: jaeger-query
            port:
              number: 16686
        path: /jaeger
        pathType: Prefix

Now I can call the Jaeger UI via http://hometown/jaeger.

Example via OpenTelemetry

To thoroughly investigate an application using Jaeger, the first step is to instrument the application with OpenTelemetry. This process involves integrating OpenTelemetry into your application code to enable the collection of telemetry data. OpenTelemetry offers a demo application that is already instrumented, allowing developers to see how it functions in a practical setting.

The OpenTelemetry Demo is a microservice-based application specifically designed to demonstrate the capabilities of OpenTelemetry in a real-world, distributed system environment. This demo application simulates an astronomy-themed online shop, which includes a variety of features typical of an e-commerce platform. Users can browse through a selection of products such as telescopes, star charts, and other space-related merchandise. Additionally, the application supports standard e-commerce operations like adding items to a shopping cart and proceeding through the checkout process.

How to Run OpenTelemetry Demo with Logz.io | Logz.io

What makes this demo particularly valuable is the instrumentation of each service within the application using OpenTelemetry. This instrumentation allows the application to generate detailed traces, metrics, and logs. These telemetry data provide insights into the application's performance and behavior, enabling developers to monitor and analyze how different services interact and perform under various conditions. By examining these traces, metrics, and logs, developers can gain a deeper understanding of the application's architecture and identify potential areas for optimization or troubleshooting.

Deploy OpenTelemetry demo in K8s

First, I added the demo to Helm:

helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update

Since the demo app includes its own Jaeger UI and Grafana, I needed to adjust my setup because I already have these tools running. I disabled the demo's built-in Jaeger, Grafana, and Prometheus to avoid conflicts. Then, I configured the demo to send telemetry data to my existing setup, ensuring smooth data collection and visualization. This way, I can use the demo's insights without disrupting my current infrastructure.

opentelemetry-collector:
  enabled: true

jaeger:
  enabled: false

prometheus:
  enabled: false

grafana:
  enabled: false

opensearch:
  enabled: true

Next, I deployed the Helm chart with my values:

helm install otel-demo open-telemetry/opentelemetry-demo -f otel-demo.yaml -n jaeger

Once the demo is deployed, data from each microservice is sent to Jaeger. This data flow lets you monitor each microservice's performance and interactions. In Jaeger, microservices appear under 'services,' where you can view detailed traces.

When I click on "Find Traces," a comprehensive list of traces for the selected service is displayed. Each trace represents a complete request flow through the microservices, providing detailed insights into the interactions and performance of each component involved. You can explore individual traces to see the sequence of operations, the time taken for each step, and any potential bottlenecks or errors. This level of detail helps in understanding how requests are processed, identifying slow operations, and troubleshooting issues effectively. By analyzing these traces, you can gain a deeper understanding of the service's behavior and make informed decisions to optimize performance and reliability.

To get detailed information about a trace, click on it at the top of the diagram or find it in the table below. The table lets you filter and sort traces to find the one you need. Selecting a trace shows its journey through the system, including each operation, time taken, and any errors.

In the given trace, the frontend service initiates a request to generate a list of recommendations. To achieve this, it sends a GET request to the ProductCatalogService to retrieve the necessary data. Each of these queries is executed efficiently, taking approximately 5 milliseconds, and they are processed in parallel to optimize performance and reduce latency.

You have multiple options to view and analyze this trace. You can explore it through various visual representations, such as a trace timeline, which provides a sequential view of events. Alternatively, you can examine it as a graph to understand the relationships and interactions between services. A flamegraph offers a visual representation of the trace's execution, highlighting the time spent in each operation. Additionally, you can delve into detailed statistics to gain insights into performance metrics, or review the spans table for a structured overview of each operation within the trace. For those who prefer raw data, the trace is also available in JSON format. Here is the trace represented as a flamegraph:

Jaeger also allows you to compare individual traces with each other. To do this, you need the TraceIDs of the traces you want to compare. You can also select two traces in the trace table and then click on "Compare Traces."

By analyzing traces, Jaeger constructs a detailed system architecture of the application. This view helps developers understand component interactions, data flow, and operation sequences. It highlights each component's role and connections, aiding in identifying bottlenecks, understanding dependencies, and optimizing performance. The architecture diagram is a valuable tool for troubleshooting and improving system efficiency.

Hovering over it shows the names of the services, and clicking on a service displays all the services that communicate directly with it.

Under the Directed Acyclic Graph (DAG) section, an architecture diagram shows the system's structure and service dependencies. It displays request flows and the amount of requests between services, helping developers understand communication patterns and data flow.

In Grafana, we can observe that OpenTelemetry processes a substantial amount of data, which is crucial for comprehensive monitoring and analysis. This extensive data consumption is necessary to provide detailed insights into system performance and behavior. Additionally, the load generator, which is responsible for simulating user activity and stress-testing the system, utilizes a significant portion of CPU resources. This high CPU usage is essential to generate the desired load, allowing us to evaluate how the system performs under various conditions and identify potential areas for improvement. By monitoring these metrics in Grafana, we gain valuable information that helps us optimize system performance and ensure reliability.

Conclusion:

Enhanced observability through tracing offers significant benefits in understanding and optimizing microservices architectures. By adopting OpenTelemetry and Jaeger, organizations can gain a comprehensive view of their systems, allowing for precise identification of bottlenecks and performance issues. This leads to more efficient troubleshooting and improved system performance. The integration of tracing with logs and metrics provides a holistic approach to monitoring, making it easier to correlate data and gain insights into system behavior. Embracing these tools can greatly enhance the reliability and efficiency of distributed systems, ultimately leading to better user experiences and operational excellence.

OpenTelemetry & Jaeger: Boosting Observability in Distributed Systems