Build a Kubernetes Observability Stack with Prometheus, Loki & Grafana

In this project, you’ll learn how to build a complete observability stack on Kubernetes using industry-standard tools like Prometheus, Loki, and Grafana, integrated with a lightweight but powerful demo app: Podinfo.

This setup mirrors real-world production environments and gives you hands-on experience with:

📊 Real-time metrics and performance dashboards
📜 Centralized structured logging
🔁 Cross-layer insights with metrics and logs combined

Whether you’re an SRE, DevOps engineer, or platform developer — this project will help you understand how to monitor, debug, and operate modern cloud-native workloads effectively.

GitHub Repo → https://github.com/neamulkabiremon/k8s-monitoring-project.git

🧰 Tools Used

Helm – For deploying Kubernetes apps
Prometheus – For collecting metrics
Loki + Promtail – For centralised logging
Grafana – For dashboards and visualisation
Podinfo – Sample app with built-in observability

Step 1: 🏁 Getting Started

You’ll provision a Kubernetes cluster (GKE), deploy the observability stack using Helm, install a sample app, and finally create production-grade dashboards in Grafana for both metrics and logs. This setup is perfect for learning, demos, or as a foundation for real-world monitoring solutions. Let’s get started!

1️⃣ Clone the Repository

git clone https://github.com/neamulkabiremon/k8s-monitoring-project.git
cd k8s-monitoring-project

This repository contains all the infrastructure code, Kubernetes manifests, Argo CD apps, and observability tools needed for a production-grade setup.

2️⃣ Provision the EKS Cluster with Terraform

cd terraform/gke-cluster

Initialise the Terraform working directory:

terraform init

Apply the configuration to provision the Kubernetes cluster:

terraform apply -auto-approve

⏳ Note: This process takes approximately 10–20 minutes, depending on your cloud provider, region and network conditions.

3️⃣ Authenticate with the EKS Cluster

Once Terraform completes successfully, authenticate your local kubectl context with the newly created GKE Kubernetes cluster:

gcloud container clusters get-credentials gcp-devops-project --zone us-central1-a --project serious-physics-452107-d1

✅ Replace <your-region> (e.g., us-east-1) and <your-cluster-name> (e.g., ultimate-devops-cluster) with your actual configuration values.

4️⃣ Verify Cluster Access

Check if your Kubernetes context is properly configured and the cluster is accessible:

kubectl get nodes

You should see a list of nodes with a STATUS of Ready. This confirms that your kubectl is connected and the cluster is up and running.

Step 2: Install the Observability Stack

✅ Step 1: One-Click Observability Installation

To make the installation process seamless, I’ve created a bash script that automates everything:

📁 grafana-prom-loki.sh

#!/bin/bash

# Create namespace
kubectl create namespace monitoring || true

# Add helm repos
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# Install Loki stack (loki + promtail)
helm install loki grafana/loki-stack \
  --namespace monitoring \
  --set loki.enabled=true,promtail.enabled=true

# Install kube-prometheus-stack with Grafana, Prometheus, Alertmanager
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  -f ./custom_kube_prometheus_stack.yml

🛠️ Step 2: Custom Configuration with

This YAML file ensures that Grafana, Prometheus, and Alertmanager are set up correctly with high availability and the required data sources.

Key Sections in the YAML:

custom_kube_prometheus_stack.yml

alertmanager:
  alertmanagerSpec:
    alertmanagerConfigSelector:
      matchLabels:
        release: monitoring
    replicas: 2
    alertmanagerConfigMatcherStrategy:
      type: None

✅ replicas: 2 – ensures high availability of alerting service
✅ matchLabels – links Alertmanager with its CRDs
✅ None strategy – avoids complex matching pitfalls

📡 Grafana: Auto-Add Prometheus & Loki as Data Sources

grafana:
  additionalDataSources:
    - name: Loki
      type: loki
      url: http://loki:3100
      access: proxy
      jsonData:
        maxLines: 1000

🔗 Loki – Enables real-time log exploration in Grafana
📈 Prometheus – Scrapes and visualizes application metrics
🌟 isDefault: true – Makes Prometheus the default data source for Grafana panels

✅ Run this script with:

chmod +x grafana-prom-loki.sh
./grafana-prom-loki.sh

✅ Step 3: Verify Installation

Once the script completes, run:

kubectl get svc -n monitoring

You should see services like:

monitoring-grafana
monitoring-kube-prometheus-prometheus
loki

🌐 Expose Grafana

Patch the Grafana service to make it accessible externally:

kubectl patch svc monitoring-grafana -n monitoring -p '{"spec": {"type": "LoadBalancer"}}'
# Change 'type: ClusterIP' to 'LoadBalancer'

Check the new external IP:

kubectl get svc monitoring-grafana -n monitoring

You should see the External LoadBalancer IP Address:

Copy the LoadBalancer EXTERNAL-IP and log in to Grafana:

Access Grafana via browser using the external IP.

Username: admin
Password: prom-operator

✅ 2. Add Prometheus Data Source in Grafana

In Grafana:

Navigate to Settings → Data Sources → Prometheus
Add this internal Prometheus endpoint:

http://monitoring-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090

Save and test the connection. ✅ You’re now collecting metrics!

🎯 Note: Loki (for logs) is already auto-configured via the install script.

Step 3: Deploy the Demo Application (Podinfo)

In this section, we’ll deploy the Podinfo application in our Kubernetes cluster as a demo microservice. It’s lightweight, production-ready, and comes pre-instrumented with Prometheus, OpenTelemetry, and structured logging—perfect for observability testing with Grafana, Loki, and Prometheus.

Step 1: Add the Helm Repository

First, add the official Helm chart repository for Podinfo:

helm repo add podinfo https://stefanprodan.github.io/podinfo
helm repo update

Step 2: Create Namespace (Optional but Recommended)

To keep resources isolated and organized, we’ll deploy Podinfo in its own namespace:

kubectl create namespace podinfo

Step 3: Deploy Podinfo with Helm

Use the official Helm chart to deploy Podinfo into the podinfo namespace:

helm upgrade --install podinfo podinfo/podinfo \
  --namespace podinfo \
  --create-namespace \
  --set replicaCount=2 \
  --set service.port=9898

This setup will:

Deploy 2 replicas for basic high availability
Expose the HTTP service on port 9898
Rely on Prometheus ServiceMonitor for metrics discovery (no need for annotations)

✅ Step 4: Verify Podinfo Deployment

Check if the Podinfo pods are running successfully:

kubectl get pods -n podinfo

You should see both replicas in a Running state.

🌐 Step 5: Access the Podinfo App (via Port Forwarding)

You can access the app locally by forwarding the service port:

kubectl port-forward svc/podinfo 8080:9898 -n podinfo

This maps: localhost:8080 → podinfo service port 9898

Now open your browser and navigate to:

http://127.0.0.1:8080

You should see the Podinfo web UI running. ✅ The Application is running

Step 4: Configure Prometheus to Scrape Podinfo

To enable Prometheus to scrape metrics from the app, define a ServiceMonitor resource:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: podinfo-servicemonitor
  namespace: monitoring
  labels:
    release: monitoring  # must match your Prometheus release label
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: podinfo  # must match the service labels
  namespaceSelector:
    matchNames:
      - podinfo  # the namespace where podinfo is deployed
  endpoints:
    - port: http        # matches the port name in the service
      path: /metrics
      interval: 15s

Apply it:

kubectl apply -f podinfo-servicemonitor.yaml

✅ Success!
You now have a full observability pipeline set up:

🔍 Metrics via Prometheus
📊 Dashboards in Grafana
📜 Centralized logs via Loki
🧪 Sample app to test real-world scenarios

Step 5: Simulate Traffic and Visualize Metrics

Before we dive into Grafana, let’s simulate real-world traffic so we can actually see data flowing into our dashboards.

Generate Load on Podinfo
Run the provided traffic generator script to create realistic traffic patterns:

chmod +x generate-traffic.sh
./generate-traffic.sh

This script continuously sends requests to various endpoints of the Podinfo app, including:

Health checks and environment details
Simulated 200, 404, and 500 responses
Delays and chunked responses
POST requests for logging and caching
Token issuance and validation

6: Import & Customize Grafana Dashboards

Then open Grafana http://34.121.206.30/ in your browser.

Username: admin
Password: admin (or the one you set)

8.1. Import a Production-Ready Dashboard

In Grafana, click the ”+” → Import.
Use the dashboard ID 6671 (or search for “Go Processes”).
Select Prometheus as your data source.
Hit Import.

You’ll now see live runtime metrics from your Podinfo app, including memory usage, goroutines, garbage collection stats, and more — everything you’d expect in a real-world Go service!

We’re tracking key runtime metrics from each Podinfo instance:

Memory usage: both resident and virtual memory over time
Heap and stack usage: monitored via Go’s internal memory stats
Open file descriptors: useful for identifying resource exhaustion
Goroutines: active Go routines per pod, helping us spot leaks or spikes
Garbage collection duration: GC pauses that can impact performance

8.2: Create a Custom Grafana Dashboard for HTTP Metrics

Let’s now build a custom dashboard tailored for HTTP behavior — the real-world signals you care about most.

We’ll visualise:

✅ Total HTTP requests
❌ Failed requests (4xx/5xx)
⏱ Request duration (latency)
📊 Success rate percentage

🛠 Dashboard Setup

Open Grafana at http://localhost:3000
In the left sidebar, click the “+” → Dashboard
Click “Add a new Visualisation”

📈 Panel 1: Total HTTP Requests

Query:

sum(rate(http_requests_total[1m]))

Legend: Total Requests
Visualization: Time series

⚠️ Panel 2: Failed Requests

Query:

sum(rate(http_requests_total{status=~"4..|5.."}[1m]))

Legend: Failed Requests
Use a red color to highlight errors visually

⏱ Panel 3: Average Request Duration

Query:

rate(http_request_duration_seconds_sum[1m]) 
/ 
rate(http_request_duration_seconds_count[1m])

Legend: Avg Duration (s)
Set Y-axis unit to seconds

📊 Panel 4: Success Rate (%)

Query:

(
  sum(rate(http_requests_total{status!~"5.."}[1m]))
  /
  sum(rate(http_requests_total[1m]))
) * 100

Legend: Success Rate (%)
Y-axis unit: Percentage
Add a threshold line at 99% if you want to monitor SLOs

Once completed, you’ll have a live dashboard showing production-grade HTTP observability in real time. This gives you critical insight into traffic, latency, and app health — all from Prometheus metrics.

8.2 : Create a Custom Dashboard for Logs with Loki

Now that metrics are flowing, let’s take it one step further and centralize application logs using Grafana Loki. This helps you investigate real-time issues, correlate logs with metrics, and debug with precision — just like in production setups.

Step-by-Step: Build Your Logging Dashboard

Go to Grafana Dashboards

Click ➕ Create → Dashboard
Add a Panel

Click “Add a new panel”
Select Data Source: Loki

🪵 Panel 1: All Logs from Podinfo

Query:

{app_kubernetes_io_name="podinfo"}

Title: All Logs (Live)
Use the Logs visualization.
Enable Live Mode if desired for real-time streaming.

🔴 Panel 2: Error Logs

Query:

{namespace="podinfo"} |= "error"

Title: Error Logs

🟡 Panel 3: Warning Logs

{namespace="podinfo"} |= "warn"

Title: Warning Logs

🧾 Panel 4: HTTP Request Logs

Query:

Title: HTTP Requests

⚠️ Note on Log Types: Since the podinfo application is a lightweight demo app, it doesn’t emit detailed business-level logs like: login/login failed, access denied, user activity, database queries

However, in a real-world app, you can extend your log pipeline by:

Structuring logs with log levels and context fields
Adding app-specific messages for each major operation
Parsing them in Promtail or Fluent Bit with json stages to extract log level and labels

8.3 : Monitor Kubernetes Cluster with Prebuilt Grafana Dashboards

When you deploy kube-prometheus-stack, it automatically ships with a set of prebuilt dashboards — designed by the community and trusted by production teams. These dashboards give you instant observability across your Kubernetes cluster, nodes, and critical components.

🛠️ How to Access These Dashboards

Open Grafana at http://34.121.206.30/dashboards
Go to the left sidebar → “Dashboards” → “Browse”
You’ll see a list of dashboards categorized by folders (e.g. Compute Resources/Cluster, kubernetes-mixin, node-exporter-mixin, etc.)
Click any dashboard to explore metrics, set filters, or customize panels

💡 These dashboards are automatically provisioned under the Dashboards folder when you deploy kube-prometheus-stack.

🎯 What You Can Monitor at Cluster Level

Node pressure (CPU throttling, memory saturation)
Disk IO & PVC health
Network traffic by pod/workload
Scheduler latency & pending pod queues
Control plane metrics (API server, scheduler, kube-proxy)
System-level metrics (load, CPU idle, context switches)

✅ Conclusion

Congratulations! 🎉 You’ve just built a production-grade observability stack on Kubernetes — the kind used by real DevOps and SRE teams around the world.

With just a few Helm charts, Prometheus, Loki, and Grafana, you now have:

📈 Live Metrics Dashboards to monitor app health, performance, and traffic
📜 Structured Logs for debugging, auditing, and tracing behavior
🔍 Prebuilt Kubernetes Dashboards for deep cluster introspection
🧪 Real-World Traffic Simulation to validate your monitoring setup in action

This setup gives you everything you need to:

Detect performance regressions ⚠️
Spot failures early 💥
Investigate issues with precision 🕵️‍♂️
Correlate logs and metrics effortlessly 📊+📜
And lay the groundwork for SLIs, SLOs, and alerts

💡 What’s Next?

In the next phase of this journey, I’ll be implementing the following to take observability to the next level:

✅ Alertmanager + Custom Alert Rules

For real-time, actionable alerts that detect anomalies before users do.

✅ OpenTelemetry Integration

To unlock full distributed tracing across services, ideal for debugging complex interactions

✅ GitOps with ArgoCD

To manage infrastructure and application delivery using declarative, version-controlled workflows.

✅ Real Microservices Expansion

I’ll replace the demo app with production-style Node.js microservices, tracking full request lifecycles and user journeys across services.

These upgrades will be featured in a new GitHub project, offering a complete, production-like environment for DevOps, SRE, and platform engineering use cases.

Stay tuned — this is just the beginning. 🚀

Production-Ready Observability with Prometheus, Loki & Grafana