Observability Like a Pro: Using Prometheus, Grafana, Jaeger & EFK to Monitor Kubernetes

Pooja ManellorePooja Manellore
14 min read

Observability: Understanding and Fixing System Issues

Observability is a critical aspect of modern infrastructure and application monitoring. It helps track the health of your internal systems, including application status, infrastructure requirements, and network performance. But observability goes beyond just monitoring—it doesn’t just tell what happened, but also why it happened and how to fix it.

For example, with observability, you can track:
Disk utilization over 24 hours
CPU and memory usage
Number of successful vs. failed requests

The Three Pillars of Observability

1️⃣ Metrics – Quantifies the state of the system over time.

  • Example: "In the last 30 minutes, 10 HTTP requests failed."

  • Helps track trends like CPU spikes, memory leaks, or traffic surges.

2️⃣ Logs – Provides detailed insights into what happened and why.

  • Example: By checking logs at 10:00 AM, you can identify which user made the request, which part of the application was accessed, and why it failed.

3️⃣ Traces – Maps the exact request path to find performance bottlenecks.

  • Example: A trace follows the request journey from the client → load balancer → front-end → back-end → database, identifying delays and failures along the way.

Why Observability Matters

Observability is not just about detecting failures; it’s about proactively preventing them. By leveraging metrics, logs, and traces, teams can quickly diagnose issues and implement fixes before they impact users.

🚀 With observability, you don’t just see the problem—you understand and solve it!

Monitoring vs. Observability: What’s the Difference?

There’s often confusion between monitoring and observability, but they are not the same. Understanding their differences is crucial for maintaining reliable and high-performing applications.

Monitoring: Watching the System 👀

Monitoring focuses on collecting predefined metrics, setting up alerts, and visualizing system health using dashboards. It helps answer "What is happening?"

✅ Example:

  • Tracking CPU utilization over the last 5 hours

  • Setting alerts when CPU usage exceeds a threshold

  • Creating dashboards for better visibility (e.g., using Prometheus + Grafana)

Observability: Understanding the System 🔍

Observability goes beyond monitoring by integrating three key pillars:
1️⃣ Metrics – Measure system health (e.g., request failure rate)
2️⃣ Logs – Explain why an issue occurred (e.g., error messages)
3️⃣ Traces – Show how requests travel across services (e.g., latency issues)

💡 Observability = Monitoring + Troubleshooting + Root Cause Analysis

Why Observability Matters for Your Business 🚀

Imagine you’ve launched an e-commerce website on an EKS cluster. A potential customer is comparing multiple platforms. To convince them, you highlight your Service Level Agreement (SLA):

99.9% availability → Only 0.01% chance of failure
10,000 requests at a time → 99.995% served within 30ms
Only 5 requests may fail → Proactively detected and resolved

🔴 What happens when a request fails?
If you have observability, you can quickly diagnose the issue:
🔹 Check metrics to see system performance
🔹 Analyze logs to find the root cause
🔹 Use traces to track the exact request path

Without observability, your customers move to competitors, leading to revenue loss. That’s why companies must invest in observability—issues get detected early and resolved faster.

Developers & DevOps: A Collaborative Effort 🤝

  • Developers instrument the system with metrics, logs, and traces

  • DevOps engineers set up monitoring tools like Prometheus, Grafana, and Jaeger to track system health

🚀 Observability is the key to a reliable and scalable platform!

Understanding the Difference Between Metrics and Monitoring with Real-Life and IT Examples

Imagine a hospital scenario where a patient is undergoing treatment. The doctor instructs the nurse to record the patient's heartbeat and blood pressure at regular intervals to assess their health status. The nurse notes down:

  • 10:00 AM – Heartbeat: 87 bpm

  • 10:10 AM – Heartbeat: 90 bpm

Similarly, the nurse records blood pressure and other vital signs over time. When the doctor arrives, they review these notes to understand the patient’s health trends.

💡 Key Insight:

  • The recorded heartbeat and BP values are metrics – raw historical data that provides insights.

  • If the nurse doesn’t record these metrics, the doctor won’t have visibility into the patient’s health.

  • A monitoring system could automate this by collecting and displaying these metrics on a dashboard for better visualization.

  • Additionally, an alert system can be set up to notify the doctor and nurse immediately if the patient’s heartbeat drops dangerously low, enabling timely intervention before a critical situation arises.

IT Example: Metrics & Monitoring in AWS & Kubernetes

Now, let's apply the same concept to an IT environment. Suppose you deploy an application on AWS using EKS (Elastic Kubernetes Service) within an isolated VPC (Virtual Private Cloud). Misconfigurations can lead to performance issues, and to understand system health, you need metrics such as:

🔹 Infrastructure Metrics:

  • CPU and memory utilization of nodes (VMs)

  • Network latency and traffic

🔹 Kubernetes Cluster Metrics:

  • Pod status (e.g., how often a pod goes into CrashLoopBackOff)

  • Deployment status and ReplicaSet count

🔹 Application Metrics:

  • Number of HTTP requests received

  • User actions like sign-ups and logins

The monitoring system (e.g., Prometheus, Grafana, CloudWatch) collects these metrics and visualizes them on a dashboard for easy readability.

💡 Key Benefits:

  • The monitoring system automatically scrapes metrics and presents them in a graphical format.

  • If CPU utilization spikes or HTTP requests surge unexpectedly, an alert manager (e.g., Slack, PagerDuty) can send notifications to engineers, allowing quick action before major issues occur.

Key Takeaway: Metrics Drive Effective Monitoring

Metrics are a subset of monitoring, and monitoring depends on metrics. Without metrics, monitoring has no data to analyze or display. Just like a doctor relies on a nurse's records, IT teams rely on monitoring tools to ensure system reliability and performance. 🚀

Why Use Prometheus?

Prometheus is one of the top open-source monitoring tools, widely used for collecting and visualizing metrics in a readable graphical format. It enables real-time monitoring and alerting, making it a powerful solution for infrastructure and application observability.

Key Features of Prometheus:

Metric Collection & Storage: Prometheus scrapes metrics from exporters and Pushgateway, storing them in a time-series database with timestamps.

Powerful Querying with PromQL: Using PromQL (Prometheus Query Language), you can easily retrieve and analyze metrics—for example, fetching data from the last 30 minutes.

Alerting with Alertmanager: Set up Alertmanager to send notifications when specific thresholds are breached, ensuring proactive monitoring.

Service Discovery for Dynamic Environments: In large environments like Kubernetes, where hundreds of applications run, Prometheus can automatically discover and scrape metrics from specific targets using Service Discovery.

Built-in HTTP Server & UI: Prometheus provides an intuitive web interface to visualize queries and interact with collected metrics effortlessly.

With its efficient architecture and seamless integration capabilities, Prometheus is an essential tool for monitoring cloud-native applications, microservices, and containerized environments. 🚀

Ensure you have AWS CLI, eksctl, and kubectl installed

lAmazon EKS (Elastic Kubernetes Service) simplifies Kubernetes deployment on AWS, allowing scalable, managed clusters. In this guide, we will:

  1. Deploy an EKS cluster using AWS CLI

  2. Create a node group

  3. Update kubeconfig

  4. Install Prometheus for monitoring

  5. Set up Alertmanager

  6. Access Prometheus, Grafana, and Alertmanager

Step 1: Create an Amazon EKS Cluster

Use the following eksctl command to create an EKS cluster without a node group:

eksctl create cluster --name=observability-cluster \
                      --region=us-east-1 \
                      --zones=us-east-1a,us-east-1b \
                      --without-nodegroup

By default, AWS does not trust Kubernetes-created resources. To allow AWS services to interact with the cluster, we must associate the IAM OIDC provider:

eksctl utils associate-iam-oidc-provider \
    --region us-east-1 \
    --cluster observability-cluster \
    --approve

Step 2: Create a Node Group

Now, create a managed node group for your cluster:

eksctl create nodegroup --cluster=observability-cluster \
                        --region=us-east-1 \
                        --name=observability-ng-private \
                        --node-type=t3.medium \
                        --nodes-min=2 \
                        --nodes-max=3 \
                        --node-volume-size=20 \
                        --managed \
                        --asg-access \
                        --external-dns-access \
                        --full-ecr-access \
                        --appmesh-access \
                        --alb-ingress-access \
                        --node-private-networking

After the node group is created, update the EKS cluster configuration locally:

aws eks update-kubeconfig --region us-east-1 --name observability-cluster

Step 3: Install Prometheus for Monitoring

Add Prometheus Helm Repository:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Create a Namespace for Monitoring:

kubectl create ns monitoring

Install Prometheus and Alertmanager using Helm:

helm install monitoring prometheus-community/kube-prometheus-stack \
-n monitoring \
-f ./custom_kube_prometheus_stack.yml

Verify the installation:

kubectl get all -n monitoring

Step 4: Access Prometheus, Grafana, and Alertmanager

Use port forwarding to access these monitoring tools:

Access Prometheus:

kubectl port-forward service/prometheus-operated -n monitoring 9090:9090

Access Prometheus at http://localhost:9090

Access Grafana:

kubectl port-forward service/monitoring-grafana -n monitoring 8080:80

Access Grafana at http://localhost:8080

Login using:

  • Username: admin

  • Password: prom-operator

Access Alertmanager:

kubectl port-forward service/alertmanager-operated -n monitoring 9093:9093

Access Alertmanager at http://localhost:9093

Understanding How Prometheus Collects Metrics

Prometheus scrapes metrics from various sources such as Node Exporter, Kube-State Metrics, and custom application metrics. These metrics are stored as key-value pairs with timestamps, forming a time-series database.

However, how does an end user utilize these metrics? Prometheus runs an HTTP server that receives user requests and responds with metrics. To query this data, we use PromQL (Prometheus Query Language).

Checking Node Exporter Metrics

If you are using EKS, connect to the instance using Session Manager, then use the following curl command to fetch metrics from Node Exporter:

curl 10.100.162.208:9100/metrics

This returns key metrics like CPU utilization, memory usage, and disk utilization.

Checking Kube-State Metrics

To check Kubernetes state metrics (e.g., pod restarts, pod status, container status), use:

curl <kube-state-metrics-cluster-ip>:8080/metrics

For example, you can create a pod in a crash-loop state to observe restart metrics:

kubectl run  busybox --image=busybox

Since the container will face a CrashLoopBackOff, you can monitor restarts using:

kube_pod_container_status_restarts_total{namespace="default", container="busybox"}

Visualizing Metrics with Grafana

While Prometheus requires manual queries to extract data, Grafana provides default dashboards with enriched visualization. It easily integrates with Prometheus, Nagios, and other monitoring tools to create custom dashboards.

By leveraging PromQL queries and Grafana dashboards, you can gain deep insights into cluster performance, detect anomalies, and set up alerts for proactive monitoring.

Congratulations! 🎉 You have successfully deployed an Amazon EKS cluster, created a node group, set up Prometheus for monitoring, and integrated Grafana for visualization. Now, you can efficiently monitor your Kubernetes cluster and respond to system anomalies in real-time.

What Are Custom Instrumented Metrics?

Observability is a shared responsibility between DevOps engineers and developers. While DevOps engineers set up monitoring tools like Prometheus, Grafana, EFK stack, and Jaeger, they rely on developers to provide metrics, logs, and traces from the application.

By default, Prometheus can only collect system-level metrics using exporters like:

  • Node Exporter → CPU, memory, disk usage, etc.

  • Kube-State-Metrics → Kubernetes cluster health

However, for application-specific metrics (e.g., number of HTTP requests, API response time), developers must instrument their applications using Prometheus client libraries like prom-client in Node.js

Common Prometheus Metric Types

Not all metrics can be represented in the same way. Prometheus supports four key metric types:

Metric TypeDescriptionExample
CounterIncreases over time, never decreaseshttp_requests_total (total HTTP requests received)
GaugeCan increase or decreasecpu_utilization, memory_usage
HistogramBuckets data into rangeshttp_request_duration_seconds (request duration in 5ms, 10ms, etc.)
SummarySimilar to Histogram, but provides quantiles90th percentile of request duration

Example:

  • Counter: Total number of logins

  • Gauge: Current CPU utilization

  • Histogram: Response time of HTTP requests in different time ranges

How to Implement Custom Instrumented Metrics

In the GitHub repo (Day-4 folder), we have a Node.js application that uses prom-client to expose metrics:

1.Clone the repository:

git clone https://github.com/poojadevops1/observability-zero-to-hero.git

2.Deploy the application in Kubernetes:

kubectl create ns dev
kubectl apply -k kubernetes-manifest
kubectl apply -k alerts-alertmanager-servicemonitor-manifest
  1. Once the application is running, expose the /metrics endpoint.

  2. By default, Prometheus doesn’t know which application to scrape, so we use ServiceMonitor:

    • serviceMonitor.yaml tells Prometheus to scrape metrics from /metrics endpoint.

Setting Up Alerts

To receive alerts when CPU usage is high or a pod restarts multiple times:

  1. Update alertmanagerconfig.yml with your email ID.

  2. Generate an App Password from Gmail (for authentication).

  3. Encode the password using base64:

echo "your_generated_password" | base64
  1. Update email-secrets.yml with the base64-encoded password.

  2. Apply the changes

kubectl apply -k .
##or
kubectl apply -k alerts-alertmanager-servicemonitor-manifest

To test the alerts, crash the application:

curl http://your-loadbalancer-dns/crash

This will restart the pod multiple times, triggering an alert.

Logging with EFK Stack

The EFK stack (Elasticsearch, FluentBit, Kibana) helps in log collection and visualization:

  • Elasticsearch: Stores logs.

  • FluentBit: Forwards logs to Elasticsearch.

  • Kibana: Provides a UI for log analysis.

Steps to Deploy EFK in Kubernetes:

create i am role for service account

eksctl create iamserviceaccount \
    --name ebs-csi-controller-sa \
    --namespace kube-system \
    --cluster observability-cluster \
    --role-name AmazonEKS_EBS_CSI_DriverRole \
    --role-only \
    --attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy \
    --approve

retrive the iam role for service account

ARN=$(aws iam get-role --role-name AmazonEKS_EBS_CSI_DriverRole --query 'Role.Arn' --output text)

Deploy the EBS CSI driver:

eksctl create addon --cluster observability-cluster  --name aws-ebs-csi-driver --version latest \
    --service-account-role-arn $ARN --force

The EBS CSI driver allows Kubernetes to dynamically provision and manage Amazon EBS volumes, ensuring persistent storage for Elasticsearch.

create namespace to setup the efk

kubectl create namespace logging

Deploy Elasticsearch:

helm repo add elastic https://helm.elastic.co
helm install elasticsearch \
 --set replicas=1 \
 --set volumeClaimTemplate.storageClassName=gp2 \
 --set persistence.labels.enabled=true elastic/elasticsearch -n logging

in this we add elastic repo managed by helm then we will install elastic search in each node it create one replicas then it modify the volume storage type and enable label to ensure store data in presistent storage

Secure Fluent Bit Authentication for Elasticsearch

Fluent Bit forwards logs to Elasticsearch, requiring authentication to ensure authorized access. We retrieve and decode the username and password from Kubernetes secrets securely.

# for username
kubectl get secrets --namespace=logging elasticsearch-master-credentials -ojsonpath='{.data.username}' | base64 -d
# for password
kubectl get secrets --namespace=logging elasticsearch-master-credentials -ojsonpath='{.data.password}' | base64 -d

Deploy Kibana (UI for logs):

helm install kibana --set service.type=LoadBalancer elastic/kibana -n logging

Install Fluentbit with Custom Values/Configurations

  • 👉 Note: Please update the HTTP_Passwd field in the fluentbit-values.yml file with the password retrieved earlier in Secure Fluent Bit Authentication for Elasticsearch step : (i.e NJyO47UqeYBsoaEU)
helm repo add fluent https://fluent.github.io/helm-charts
helm install fluent-bit fluent/fluent-bit -f fluentbit-values.yaml -n logging

Important Configuration for Fluent Bit

  • Ensure TLS is enabled in fluentbit-values.yaml to allow secure communication with Elasticsearch.

  • If logs are not visible in Kibana, ensure that an application is running in the cluster. Logs will only appear when applications generate them.

Access Kibana

  1. Use the LoadBalancer DNS to open Kibana.

  2. Log in with username elastic and the base64-decoded password from Elasticsearch.

  3. Navigate to Data Integration → Discovery → Data View to see logs with timestamps.

Fluent Bit Configuration Breakdown

Fluent Bit has four key sections:

  • Service: Defines how Fluent Bit is exposed. It can be NodePort, ClusterIP, or LoadBalancer based on requirements. In this setup, we use ClusterIP.

  • Input: Collects logs from all container logs.

  • Filters: Processes logs, e.g., using a Lua script to ignore logs from the logging namespace.

  • Output: Forwards logs to Elasticsearch with authentication.

Reading the Fluent Bit configuration carefully is crucial for correct log forwarding and filtering..

Jaeger for Distributed Tracing

Understanding Tracing with an Example

Imagine you're traveling from Hyderabad to Saket, Delhi, for a meeting. You prepare an itinerary:

  • Cab to Hyderabad airport → 30 min

  • Flight to Delhi → 20 min

  • Cab to Saket, Delhi → 1 hr 30 min

Total expected travel time: 1 hr 80 min. However, you arrive 20 minutes late. Upon discussing with a friend, you realize the cab driver took a wrong route, adding 20 minutes. The correct route would have taken 1 hr 60 min.

Similar to this scenario, Jaeger traces service requests across multiple hops, identifying delays and optimizing performance.

Jaeger Architecture

Jaeger consists of four key components:

  1. Jaeger Client: Instruments applications to capture trace data.

  2. Collector: Receives and processes trace data.

  3. Storage: Stores traces (e.g., Elasticsearch, Cassandra, etc.).

  4. UI: Visualizes spans and latency across services.

Setting Up Jaeger in Kubernetes

Retrieve CA Certificate (for secure communication)

kubectl get secret elasticsearch-master-certs -n logging -o jsonpath='{.data.ca\.crt}' | base64 --decode > ca-cert.pem

Creates a new Kubernetes namespace called tracing if it doesn't already exist, where Jaeger components will be installed.

kubectl create ns tracing

Creates a ConfigMap in the tracing namespace, containing the CA certificate to be used by Jaeger for TLS.

kubectl create configmap jaeger-tls --from-file=ca-cert.pem -n tracing

Creates a Kubernetes Secret in the tracing namespace, containing the CA certificate for Elasticsearch TLS communication.

kubectl create secret generic es-tls-secret --from-file=ca-cert.pem -n tracing

adds the official Jaeger Helm chart repository to your Helm setup, making it available for installations.

helm repo add jaegertracing https://jaegertracing.github.io/helm-charts

helm repo update

Please update the password field and other related field in the jaeger-values.yaml file with the password retrieved earlier in elastic search

helm install jaeger jaegertracing/jaeger -n tracing --values jaeger-values.yaml

Command forwards port 8080 on your local machine to the Jaeger Query service, allowing you to access the Jaeger UI locally.

kubectl port-forward svc/jaeger-query 8080:80 -n tracing

Conclusion

  • Metrics (Prometheus & Grafana) → Monitor performance

  • Logs (EFK stack) → Debug errors

  • Traces (Jaeger) → Analyze service latency and dependencies

Jaeger enables developers, DevOps, and SRE teams to diagnose performance bottlenecks by tracking request flows across microservices. 🚀

1
Subscribe to my newsletter

Read articles from Pooja Manellore directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Pooja Manellore
Pooja Manellore

I have completed my B.Sc. in Computer Science in 2024 and have gained skills in Data Analytics, HTML, and CSS. I am currently advancing my expertise by learning DevOps, aiming to secure a role as a DevOps Engineer. I am eager to join a company immediately where I can apply my skills and continue growing in this field