Observability Like a Pro: Using Prometheus, Grafana, Jaeger & EFK to Monitor Kubernetes


Observability: Understanding and Fixing System Issues
Observability is a critical aspect of modern infrastructure and application monitoring. It helps track the health of your internal systems, including application status, infrastructure requirements, and network performance. But observability goes beyond just monitoring—it doesn’t just tell what happened, but also why it happened and how to fix it.
For example, with observability, you can track:
✅ Disk utilization over 24 hours
✅ CPU and memory usage
✅ Number of successful vs. failed requests
The Three Pillars of Observability
1️⃣ Metrics – Quantifies the state of the system over time.
Example: "In the last 30 minutes, 10 HTTP requests failed."
Helps track trends like CPU spikes, memory leaks, or traffic surges.
2️⃣ Logs – Provides detailed insights into what happened and why.
- Example: By checking logs at 10:00 AM, you can identify which user made the request, which part of the application was accessed, and why it failed.
3️⃣ Traces – Maps the exact request path to find performance bottlenecks.
- Example: A trace follows the request journey from the client → load balancer → front-end → back-end → database, identifying delays and failures along the way.
Why Observability Matters
Observability is not just about detecting failures; it’s about proactively preventing them. By leveraging metrics, logs, and traces, teams can quickly diagnose issues and implement fixes before they impact users.
🚀 With observability, you don’t just see the problem—you understand and solve it!
Monitoring vs. Observability: What’s the Difference?
There’s often confusion between monitoring and observability, but they are not the same. Understanding their differences is crucial for maintaining reliable and high-performing applications.
Monitoring: Watching the System 👀
Monitoring focuses on collecting predefined metrics, setting up alerts, and visualizing system health using dashboards. It helps answer "What is happening?"
✅ Example:
Tracking CPU utilization over the last 5 hours
Setting alerts when CPU usage exceeds a threshold
Creating dashboards for better visibility (e.g., using Prometheus + Grafana)
Observability: Understanding the System 🔍
Observability goes beyond monitoring by integrating three key pillars:
1️⃣ Metrics – Measure system health (e.g., request failure rate)
2️⃣ Logs – Explain why an issue occurred (e.g., error messages)
3️⃣ Traces – Show how requests travel across services (e.g., latency issues)
💡 Observability = Monitoring + Troubleshooting + Root Cause Analysis
Why Observability Matters for Your Business 🚀
Imagine you’ve launched an e-commerce website on an EKS cluster. A potential customer is comparing multiple platforms. To convince them, you highlight your Service Level Agreement (SLA):
✅ 99.9% availability → Only 0.01% chance of failure
✅ 10,000 requests at a time → 99.995% served within 30ms
✅ Only 5 requests may fail → Proactively detected and resolved
🔴 What happens when a request fails?
If you have observability, you can quickly diagnose the issue:
🔹 Check metrics to see system performance
🔹 Analyze logs to find the root cause
🔹 Use traces to track the exact request path
Without observability, your customers move to competitors, leading to revenue loss. That’s why companies must invest in observability—issues get detected early and resolved faster.
Developers & DevOps: A Collaborative Effort 🤝
Developers instrument the system with metrics, logs, and traces
DevOps engineers set up monitoring tools like Prometheus, Grafana, and Jaeger to track system health
🚀 Observability is the key to a reliable and scalable platform!
Understanding the Difference Between Metrics and Monitoring with Real-Life and IT Examples
Imagine a hospital scenario where a patient is undergoing treatment. The doctor instructs the nurse to record the patient's heartbeat and blood pressure at regular intervals to assess their health status. The nurse notes down:
10:00 AM – Heartbeat: 87 bpm
10:10 AM – Heartbeat: 90 bpm
Similarly, the nurse records blood pressure and other vital signs over time. When the doctor arrives, they review these notes to understand the patient’s health trends.
💡 Key Insight:
The recorded heartbeat and BP values are metrics – raw historical data that provides insights.
If the nurse doesn’t record these metrics, the doctor won’t have visibility into the patient’s health.
A monitoring system could automate this by collecting and displaying these metrics on a dashboard for better visualization.
Additionally, an alert system can be set up to notify the doctor and nurse immediately if the patient’s heartbeat drops dangerously low, enabling timely intervention before a critical situation arises.
IT Example: Metrics & Monitoring in AWS & Kubernetes
Now, let's apply the same concept to an IT environment. Suppose you deploy an application on AWS using EKS (Elastic Kubernetes Service) within an isolated VPC (Virtual Private Cloud). Misconfigurations can lead to performance issues, and to understand system health, you need metrics such as:
🔹 Infrastructure Metrics:
CPU and memory utilization of nodes (VMs)
Network latency and traffic
🔹 Kubernetes Cluster Metrics:
Pod status (e.g., how often a pod goes into CrashLoopBackOff)
Deployment status and ReplicaSet count
🔹 Application Metrics:
Number of HTTP requests received
User actions like sign-ups and logins
The monitoring system (e.g., Prometheus, Grafana, CloudWatch) collects these metrics and visualizes them on a dashboard for easy readability.
💡 Key Benefits:
The monitoring system automatically scrapes metrics and presents them in a graphical format.
If CPU utilization spikes or HTTP requests surge unexpectedly, an alert manager (e.g., Slack, PagerDuty) can send notifications to engineers, allowing quick action before major issues occur.
Key Takeaway: Metrics Drive Effective Monitoring
Metrics are a subset of monitoring, and monitoring depends on metrics. Without metrics, monitoring has no data to analyze or display. Just like a doctor relies on a nurse's records, IT teams rely on monitoring tools to ensure system reliability and performance. 🚀
Why Use Prometheus?
Prometheus is one of the top open-source monitoring tools, widely used for collecting and visualizing metrics in a readable graphical format. It enables real-time monitoring and alerting, making it a powerful solution for infrastructure and application observability.
Key Features of Prometheus:
✅ Metric Collection & Storage: Prometheus scrapes metrics from exporters and Pushgateway, storing them in a time-series database with timestamps.
✅ Powerful Querying with PromQL: Using PromQL (Prometheus Query Language), you can easily retrieve and analyze metrics—for example, fetching data from the last 30 minutes.
✅ Alerting with Alertmanager: Set up Alertmanager to send notifications when specific thresholds are breached, ensuring proactive monitoring.
✅ Service Discovery for Dynamic Environments: In large environments like Kubernetes, where hundreds of applications run, Prometheus can automatically discover and scrape metrics from specific targets using Service Discovery.
✅ Built-in HTTP Server & UI: Prometheus provides an intuitive web interface to visualize queries and interact with collected metrics effortlessly.
With its efficient architecture and seamless integration capabilities, Prometheus is an essential tool for monitoring cloud-native applications, microservices, and containerized environments. 🚀
Ensure you have AWS CLI, eksctl, and kubectl installed
lAmazon EKS (Elastic Kubernetes Service) simplifies Kubernetes deployment on AWS, allowing scalable, managed clusters. In this guide, we will:
Deploy an EKS cluster using AWS CLI
Create a node group
Update kubeconfig
Install Prometheus for monitoring
Set up Alertmanager
Access Prometheus, Grafana, and Alertmanager
Step 1: Create an Amazon EKS Cluster
Use the following eksctl command to create an EKS cluster without a node group:
eksctl create cluster --name=observability-cluster \
--region=us-east-1 \
--zones=us-east-1a,us-east-1b \
--without-nodegroup
By default, AWS does not trust Kubernetes-created resources. To allow AWS services to interact with the cluster, we must associate the IAM OIDC provider:
eksctl utils associate-iam-oidc-provider \
--region us-east-1 \
--cluster observability-cluster \
--approve
Step 2: Create a Node Group
Now, create a managed node group for your cluster:
eksctl create nodegroup --cluster=observability-cluster \
--region=us-east-1 \
--name=observability-ng-private \
--node-type=t3.medium \
--nodes-min=2 \
--nodes-max=3 \
--node-volume-size=20 \
--managed \
--asg-access \
--external-dns-access \
--full-ecr-access \
--appmesh-access \
--alb-ingress-access \
--node-private-networking
After the node group is created, update the EKS cluster configuration locally:
aws eks update-kubeconfig --region us-east-1 --name observability-cluster
Step 3: Install Prometheus for Monitoring
Add Prometheus Helm Repository:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
Create a Namespace for Monitoring:
kubectl create ns monitoring
Install Prometheus and Alertmanager using Helm:
helm install monitoring prometheus-community/kube-prometheus-stack \
-n monitoring \
-f ./custom_kube_prometheus_stack.yml
Verify the installation:
kubectl get all -n monitoring
Step 4: Access Prometheus, Grafana, and Alertmanager
Use port forwarding to access these monitoring tools:
Access Prometheus:
kubectl port-forward service/prometheus-operated -n monitoring 9090:9090
Access Prometheus at http://localhost:9090
Access Grafana:
kubectl port-forward service/monitoring-grafana -n monitoring 8080:80
Access Grafana at http://localhost:8080
Login using:
Username: admin
Password: prom-operator
Access Alertmanager:
kubectl port-forward service/alertmanager-operated -n monitoring 9093:9093
Access Alertmanager at http://localhost:9093
Understanding How Prometheus Collects Metrics
Prometheus scrapes metrics from various sources such as Node Exporter, Kube-State Metrics, and custom application metrics. These metrics are stored as key-value pairs with timestamps, forming a time-series database.
However, how does an end user utilize these metrics? Prometheus runs an HTTP server that receives user requests and responds with metrics. To query this data, we use PromQL (Prometheus Query Language).
Checking Node Exporter Metrics
If you are using EKS, connect to the instance using Session Manager, then use the following curl
command to fetch metrics from Node Exporter:
curl 10.100.162.208:9100/metrics
This returns key metrics like CPU utilization, memory usage, and disk utilization.
Checking Kube-State Metrics
To check Kubernetes state metrics (e.g., pod restarts, pod status, container status), use:
curl <kube-state-metrics-cluster-ip>:8080/metrics
For example, you can create a pod in a crash-loop state to observe restart metrics:
kubectl run busybox --image=busybox
Since the container will face a CrashLoopBackOff
, you can monitor restarts using:
kube_pod_container_status_restarts_total{namespace="default", container="busybox"}
Visualizing Metrics with Grafana
While Prometheus requires manual queries to extract data, Grafana provides default dashboards with enriched visualization. It easily integrates with Prometheus, Nagios, and other monitoring tools to create custom dashboards.
By leveraging PromQL queries and Grafana dashboards, you can gain deep insights into cluster performance, detect anomalies, and set up alerts for proactive monitoring.
Congratulations! 🎉 You have successfully deployed an Amazon EKS cluster, created a node group, set up Prometheus for monitoring, and integrated Grafana for visualization. Now, you can efficiently monitor your Kubernetes cluster and respond to system anomalies in real-time.
What Are Custom Instrumented Metrics?
Observability is a shared responsibility between DevOps engineers and developers. While DevOps engineers set up monitoring tools like Prometheus, Grafana, EFK stack, and Jaeger, they rely on developers to provide metrics, logs, and traces from the application.
By default, Prometheus can only collect system-level metrics using exporters like:
Node Exporter → CPU, memory, disk usage, etc.
Kube-State-Metrics → Kubernetes cluster health
However, for application-specific metrics (e.g., number of HTTP requests, API response time), developers must instrument their applications using Prometheus client libraries like prom-client
in Node.js
Common Prometheus Metric Types
Not all metrics can be represented in the same way. Prometheus supports four key metric types:
Metric Type | Description | Example |
Counter | Increases over time, never decreases | http_requests_total (total HTTP requests received) |
Gauge | Can increase or decrease | cpu_utilization , memory_usage |
Histogram | Buckets data into ranges | http_request_duration_seconds (request duration in 5ms, 10ms, etc.) |
Summary | Similar to Histogram, but provides quantiles | 90th percentile of request duration |
Example:
Counter: Total number of logins
Gauge: Current CPU utilization
Histogram: Response time of HTTP requests in different time ranges
How to Implement Custom Instrumented Metrics
In the GitHub repo (Day-4 folder), we have a Node.js application that uses prom-client
to expose metrics:
1.Clone the repository:
git clone https://github.com/poojadevops1/observability-zero-to-hero.git
2.Deploy the application in Kubernetes:
kubectl create ns dev
kubectl apply -k kubernetes-manifest
kubectl apply -k alerts-alertmanager-servicemonitor-manifest
Once the application is running, expose the /metrics endpoint.
By default, Prometheus doesn’t know which application to scrape, so we use ServiceMonitor:
serviceMonitor.yaml
tells Prometheus to scrape metrics from/metrics
endpoint.
Setting Up Alerts
To receive alerts when CPU usage is high or a pod restarts multiple times:
Update
alertmanagerconfig.yml
with your email ID.Generate an App Password from Gmail (for authentication).
Encode the password using base64:
echo "your_generated_password" | base64
Update
email-secrets.yml
with the base64-encoded password.Apply the changes
kubectl apply -k .
##or
kubectl apply -k alerts-alertmanager-servicemonitor-manifest
To test the alerts, crash the application:
curl http://your-loadbalancer-dns/crash
This will restart the pod multiple times, triggering an alert.
Logging with EFK Stack
The EFK stack (Elasticsearch, FluentBit, Kibana) helps in log collection and visualization:
Elasticsearch: Stores logs.
FluentBit: Forwards logs to Elasticsearch.
Kibana: Provides a UI for log analysis.
Steps to Deploy EFK in Kubernetes:
create i am role for service account
eksctl create iamserviceaccount \
--name ebs-csi-controller-sa \
--namespace kube-system \
--cluster observability-cluster \
--role-name AmazonEKS_EBS_CSI_DriverRole \
--role-only \
--attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy \
--approve
retrive the iam role for service account
ARN=$(aws iam get-role --role-name AmazonEKS_EBS_CSI_DriverRole --query 'Role.Arn' --output text)
Deploy the EBS CSI driver:
eksctl create addon --cluster observability-cluster --name aws-ebs-csi-driver --version latest \
--service-account-role-arn $ARN --force
The EBS CSI driver allows Kubernetes to dynamically provision and manage Amazon EBS volumes, ensuring persistent storage for Elasticsearch.
create namespace to setup the efk
kubectl create namespace logging
Deploy Elasticsearch:
helm repo add elastic https://helm.elastic.co
helm install elasticsearch \
--set replicas=1 \
--set volumeClaimTemplate.storageClassName=gp2 \
--set persistence.labels.enabled=true elastic/elasticsearch -n logging
in this we add elastic repo managed by helm then we will install elastic search in each node it create one replicas then it modify the volume storage type and enable label to ensure store data in presistent storage
Secure Fluent Bit Authentication for Elasticsearch
Fluent Bit forwards logs to Elasticsearch, requiring authentication to ensure authorized access. We retrieve and decode the username and password from Kubernetes secrets securely.
# for username
kubectl get secrets --namespace=logging elasticsearch-master-credentials -ojsonpath='{.data.username}' | base64 -d
# for password
kubectl get secrets --namespace=logging elasticsearch-master-credentials -ojsonpath='{.data.password}' | base64 -d
Deploy Kibana (UI for logs):
helm install kibana --set service.type=LoadBalancer elastic/kibana -n logging
Install Fluentbit with Custom Values/Configurations
- 👉 Note: Please update the
HTTP_Passwd
field in thefluentbit-values.yml
file with the password retrieved earlier in Secure Fluent Bit Authentication for Elasticsearch step : (i.e NJyO47UqeYBsoaEU)
helm repo add fluent https://fluent.github.io/helm-charts
helm install fluent-bit fluent/fluent-bit -f fluentbit-values.yaml -n logging
Important Configuration for Fluent Bit
Ensure TLS is enabled in
fluentbit-values.yaml
to allow secure communication with Elasticsearch.If logs are not visible in Kibana, ensure that an application is running in the cluster. Logs will only appear when applications generate them.
Access Kibana
Use the LoadBalancer DNS to open Kibana.
Log in with username
elastic
and the base64-decoded password from Elasticsearch.Navigate to Data Integration → Discovery → Data View to see logs with timestamps.
Fluent Bit Configuration Breakdown
Fluent Bit has four key sections:
Service: Defines how Fluent Bit is exposed. It can be
NodePort
,ClusterIP
, orLoadBalancer
based on requirements. In this setup, we useClusterIP
.Input: Collects logs from all container logs.
Filters: Processes logs, e.g., using a Lua script to ignore logs from the
logging
namespace.Output: Forwards logs to Elasticsearch with authentication.
Reading the Fluent Bit configuration carefully is crucial for correct log forwarding and filtering..
Jaeger for Distributed Tracing
Understanding Tracing with an Example
Imagine you're traveling from Hyderabad to Saket, Delhi, for a meeting. You prepare an itinerary:
Cab to Hyderabad airport → 30 min
Flight to Delhi → 20 min
Cab to Saket, Delhi → 1 hr 30 min
Total expected travel time: 1 hr 80 min. However, you arrive 20 minutes late. Upon discussing with a friend, you realize the cab driver took a wrong route, adding 20 minutes. The correct route would have taken 1 hr 60 min.
Similar to this scenario, Jaeger traces service requests across multiple hops, identifying delays and optimizing performance.
Jaeger Architecture
Jaeger consists of four key components:
Jaeger Client: Instruments applications to capture trace data.
Collector: Receives and processes trace data.
Storage: Stores traces (e.g., Elasticsearch, Cassandra, etc.).
UI: Visualizes spans and latency across services.
Setting Up Jaeger in Kubernetes
Retrieve CA Certificate (for secure communication)
kubectl get secret elasticsearch-master-certs -n logging -o jsonpath='{.data.ca\.crt}' | base64 --decode > ca-cert.pem
Creates a new Kubernetes namespace called tracing if it doesn't already exist, where Jaeger components will be installed.
kubectl create ns tracing
Creates a ConfigMap in the tracing namespace, containing the CA certificate to be used by Jaeger for TLS.
kubectl create configmap jaeger-tls --from-file=ca-cert.pem -n tracing
Creates a Kubernetes Secret in the tracing namespace, containing the CA certificate for Elasticsearch TLS communication.
kubectl create secret generic es-tls-secret --from-file=ca-cert.pem -n tracing
adds the official Jaeger Helm chart repository to your Helm setup, making it available for installations.
helm repo add jaegertracing https://jaegertracing.github.io/helm-charts
helm repo update
Please update the password
field and other related field in the jaeger-values.yaml
file with the password retrieved earlier in elastic search
helm install jaeger jaegertracing/jaeger -n tracing --values jaeger-values.yaml
Command forwards port 8080 on your local machine to the Jaeger Query service, allowing you to access the Jaeger UI locally.
kubectl port-forward svc/jaeger-query 8080:80 -n tracing
Conclusion
Metrics (Prometheus & Grafana) → Monitor performance
Logs (EFK stack) → Debug errors
Traces (Jaeger) → Analyze service latency and dependencies
Jaeger enables developers, DevOps, and SRE teams to diagnose performance bottlenecks by tracking request flows across microservices. 🚀
Subscribe to my newsletter
Read articles from Pooja Manellore directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Pooja Manellore
Pooja Manellore
I have completed my B.Sc. in Computer Science in 2024 and have gained skills in Data Analytics, HTML, and CSS. I am currently advancing my expertise by learning DevOps, aiming to secure a role as a DevOps Engineer. I am eager to join a company immediately where I can apply my skills and continue growing in this field