End-to-End Kubernetes Observability with ArgoCD, Prometheus, and Grafana on KinD using Terraform

Balraj SinghBalraj Singh
14 min read

This guide will help you set up Prometheus and Grafana on Kubernetes with ArgoCD, ideal for both beginners and experienced DevOps professionals. The project focuses on monitoring Kubernetes clusters using Prometheus (for metrics collection) and Grafana (for visualizing those metrics). We'll also briefly cover ArgoCD for continuous delivery and how to integrate these tools to create a robust observability platform.

What is KinD?

KinD (Kubernetes in Docker)

Overview:

Purpose: KinD is a tool for running local Kubernetes clusters using Docker container “nodes”.
Use Case: Ideal for testing and developing Kubernetes features in a test environment.

Advantages:

  • Simplicity: Creating and managing a local Kubernetes cluster is straightforward. Ephemeral Nature: Clusters can be easily deleted after testing.

Limitations:

  • Not for Production: KinD is not designed for production use.

  • Lacks Patching and Upgrades: No supported way to maintain patching and upgrades.

  • Resource Restrictions: Unable to provide functioning OOM metrics or other resource restriction features.

  • Minimal Security: No additional security configurations are implemented. Single Machine: Not intended to span multiple physical machines or be long-lived.

Suitable Use-Cases:

  • Kubernetes Development: Developing Kubernetes itself.

  • Testing Changes: Testing application or Kubernetes changes in a continuous integration environment.

  • Local Development: Local Kubernetes cluster application development.

  • Cluster API: Bootstrapping Cluster API.

If you would like to learn more about KinD and its use cases, I recommend reading the KinD project scope document which outlines the project’s design and development priorities.

Prerequisites for This Project

Before starting, make sure you have the following ready:

  • Terraform Code

  • App Code Repo

  • Kubernetes Cluster: A working Kubernetes cluster (KIND cluster or any managed Kubernetes service like EKS or GKE).

  • Docker Installed: To run KIND clusters inside Docker containers.

  • Basic Understanding of Kubernetes: Knowledge of namespaces, pods, and services.

  • Kubectl Installed: For interacting with your Kubernetes cluster.

  • ArgoCD Installed: If you want to use GitOps practices for deploying applications.

Key Steps and Highlights

Here’s what you’ll be doing in this project:

  • Port Forwarding with Prometheus: Bind Prometheus to port 9090 and forward it to 0.0.0.0:9090. This allows you to access the Prometheus dashboard from your machine.

  • Expose Ports on Security Groups: Open necessary ports (like 9090 for Prometheus and 3000 for Grafana) in your instance's security group to access them from outside.

  • Monitoring Services: Verify that Kubernetes is sending data to Prometheus. You can check this in the Targets section of Prometheus.

  • PromQL Queries: Use PromQL queries to monitor metrics like container CPU usage, network traffic, etc. You can execute these queries to analyze performance metrics.

  • Visualizing in Grafana: After adding Prometheus as a data source in Grafana, create dashboards to visualize CPU, memory, and network usage over time.

  • Real-Time Monitoring: Generate traffic or CPU load in your applications (like the voting app) to observe how Prometheus collects metrics and Grafana displays them.

  • Custom Dashboards: Use pre-built Grafana dashboards by importing them through dashboard IDs, making it easier to visualize complex data without starting from scratch.

Setting Up the Environment

I have created a Terraform file to set up the entire environment, including installing required applications, and tools, and the ArgoCD cluster automatically created.

Setting Up the Virtual Machines (EC2)

First, we'll create the necessary virtual machines using terraform.

Below is a terraform configuration:

Once you clone the repo, go to folder "12.Real-Time-DevOps-Project/Terraform_Code" and run the terraform command.

cd Terraform_Code/

$ ls -l
Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
da---l          25/09/24   7:32 PM                Terraform_Code

Note ⇒ Make sure to run main.tf from inside the folders.

cd 11.Real-Time-DevOps-Project/Terraform_Code"

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
-a---l          25/09/24   2:56 PM            500 .gitignore
-a---l          25/09/24   7:29 PM           4287 main.tf

You need to run main.tf the file using the following terraform command.

Now, run the following command.

terraform init
terraform fmt
terraform validate
terraform plan
terraform apply 
# Optional <terraform apply --auto-approve>

Once you run the terraform command, then we will verify the following things to make sure everything is set up via a terraform.

Inspect the Cloud-Init logs:

Once connected to the EC2 instance then you can check the status of the user_data script by inspecting the log files.

# Primary log file for cloud-init
sudo tail -f /var/log/cloud-init-output.log
  • If the user_data script runs successfully, you will see output logs and any errors encountered during execution.

  • If there’s an error, this log will provide clues about what failed.

Verify the Docker version
ubuntu@ip-172-31-95-197:~$ docker --version
Docker version 24.0.7, build 24.0.7-0ubuntu4.1


docker ps -a
ubuntu@ip-172-31-94-25:~$ docker ps
CONTAINER ID   IMAGE                  COMMAND                  CREATED         STATUS         PORTS                       NAMES
8436c820f163   kindest/node:v1.31.0   "/usr/local/bin/entr…"   4 minutes ago   Up 3 minutes   127.0.0.1:35147->6443/tcp   kind-control-plane
0d6ed793da0e   kindest/node:v1.31.0   "/usr/local/bin/entr…"   4 minutes ago   Up 3 minutes                               kind-worker
5146cf4dfd20   kindest/node:v1.31.0   "/usr/local/bin/entr…"   4 minutes ago   Up 3 minutes                               kind-worker2
Verify the KIND cluster:

After Terraform deploys the instance and the cluster is set up, you can SSH into the instance and run:

kubectl get nodes
kubectl cluster-info
kubectl config get-contexts
kubectl cluster-info --context kind-kind

Managing Docker and Kubernetes Pods

Check Docker containers running:
docker ps
ubuntu@ip-172-31-81-94:~$ docker ps
CONTAINER ID   IMAGE                  COMMAND                  CREATED         STATUS         PORTS                       NAMES
fb53c1a4ae3d   kindest/node:v1.31.0   "/usr/local/bin/entr…"   5 minutes ago   Up 4 minutes                               kind-worker2
0c9416ad738c   kindest/node:v1.31.0   "/usr/local/bin/entr…"   5 minutes ago   Up 4 minutes                               kind-worker
e2ff2e214404   kindest/node:v1.31.0   "/usr/local/bin/entr…"   5 minutes ago   Up 4 minutes   127.0.0.1:35879->6443/tcp   kind-control-plane
List all Kubernetes pods in all namespaces:
kubectl get pods -A
  • To get the existing namespace
kubectl get namespace
ubuntu@ip-172-31-81-94:~$ kubectl get namespace
NAME                   STATUS   AGE
argocd                 Active   7m11s
default                Active   7m24s
kube-node-lease        Active   7m24s
kube-public            Active   7m24s
kube-system            Active   7m24s
kubernetes-dashboard   Active   7m1s
local-path-storage     Active   7m20s
monitoring             Active   6m54s

Setup Argo CD

Verify all nodes in namespace argocd
kubectl get pods -n argocd

Troubleshooting for error CreateContainerConfigError/ImagePullBackOff
as I am getting CreateContainerConfigError/ImagePullBackOff, if you getting the same then follow the below procedure to troubleshoot.
kubectl get pods -n argocd
NAME                                                READY   STATUS                       RESTARTS   AGE
argocd-application-controller-0                     0/1     CreateContainerConfigError   0          6m17s
argocd-applicationset-controller-744b76d7fd-b4p4f   1/1     Running                      0          6m17s
argocd-dex-server-5bf5dbc64d-8lp29                  0/1     Init:ImagePullBackOff        0          6m17s
argocd-notifications-controller-84f5bf6896-ztq6q    1/1     Running                      0          6m17s
argocd-redis-74b8999f94-x5cp5                       0/1     Init:ImagePullBackOff        0          6m17s
argocd-repo-server-57f4899557-b4t87                 0/1     CreateContainerConfigError   0          6m17s
argocd-server-7bc7b97977-vh4kg                      0/1     ImagePullBackOff             0          6m17s

Below is the steps to futher Troubleshooting Check Image Details: Run the following command to inspect the pod details:

kubectl describe pod  -n argocd

Specifically, look for the Events section for more details about why the image cannot be pulled.

Review Logs for More Clues: You can also check the logs of the pods to see if there are more specific error messages that can help pinpoint the issue:

kubectl logs  -n argocd

Network or DNS Issues: If your EC2 instance does not have proper internet access, it will not be able to pull images from DockerHub or other container registries. Ensure that your EC2 instance has internet access and can reach external container registries.

You can test connectivity by running:

curl -I https://registry.hub.docker.com

If there are connectivity issues, check the VPC configuration, security groups, and routing tables.

After researching the issue, it was discovered that there was insufficient disk space on the device. Previously, I was utilizing 8GB and edited that Terraform file to make it available 30GB. alt text

Check services in Argo CD namespace:
kubectl get svc -n argocd

  • Currently, it is set to clusterIP and we will change it to NodePort

  • Expose Argo CD server using NodePort:

kubectl patch svc argocd-server -n argocd -p '{"spec": {"type": "NodePort"}}'

Forward ports to access Argo CD server:
kubectl port-forward -n argocd service/argocd-server 8443:443 --address=0.0.0.0 &
# kubectl port-forward -n argocd service/argocd-server 8443:443 &

Now, we will open in browser to access argocd. https://<public IP Address of EC2>: 8443

https://18.212.179.217:8443

  • Argo CD Initial Admin Password
Grab the password to login to the dashboard/Retrieve Argo CD admin password:
kubectl get secret -n argocd argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d && echo

Configure the application in argocd

click on > applications>new apps>
Application Name: voting-app>
Project Name: default
Sync Policy: Automatic
Source>repogitory URL : https://github.com/mrbalraj007/k8s-kind-voting-app.git
Revision: main
Path: k8s-specifications # path for your manifest file
DESTINATION> Cluster URL : https://kubernetes.default.svc
Namespace: default

Validate the pods
kubectl get pods

Verify the deployment
kubectl get deployments

Verify the service
kubectl get svc
kubectl get svc
NAME         TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
db           ClusterIP   10.96.201.0     <none>        5432/TCP         2m12s
kubernetes   ClusterIP   10.96.0.1       <none>        443/TCP          13m
redis        ClusterIP   10.96.65.124    <none>        6379/TCP         2m12s
result       NodePort    10.96.177.39    <none>        5001:31001/TCP   2m12s
vote         NodePort    10.96.239.136   <none>        5000:31000/TCP   2m12s

Now, we will do the port-forward to service Vote and results

kubectl port-forward svc/vote 5000:5000 --address=0.0.0.0 &
kubectl port-forward svc/result 5001:5001 --address=0.0.0.0 &
Verify the vote & results in browser.

Now, we will open in browser to access services.

http://<public IP Address of EC2>: 5000 (vote)
http://<public IP Address of EC2>: 5001 (result)

For Vote:

For result:

Click on any of the items on the voting page to see the results on the result page. I clicked on CATS, and the results are presented below.

We have successfully set up Argo CD on a Kubernetes cluster & deployed the applications automatically.

Setup Kubernetes dashboard:

List all Kubernetes pods in all namespaces:
kubectl get pods -A
  • To get the existing namespace
kubectl get namespace
ubuntu@ip-172-31-81-94:~$ kubectl get namespace
NAME                   STATUS   AGE
argocd                 Active   18m
default                Active   18m
kube-node-lease        Active   18m
kube-public            Active   18m
kube-system            Active   18m
kubernetes-dashboard   Active   18m
local-path-storage     Active   18m
monitoring             Active   18m

Create a admin user for Kubernetes dashboard: dashboard-adminuser.yml
sudo tee dashboard-adminuser.yml > /dev/null <<EOF
apiVersion: v1
kind: ServiceAccount
metadata:
  name: admin-user
  namespace: kubernetes-dashboard
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: admin-user
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
- kind: ServiceAccount
  name: admin-user
  namespace: kubernetes-dashboard
EOF
kubectl apply -f dashboard-adminuser.yml

ubuntu@ip-172-31-81-94:~$ kubectl get namespace
NAME                   STATUS   AGE
argocd                 Active   23m
default                Active   23m
kube-node-lease        Active   23m
kube-public            Active   23m
kube-system            Active   23m
kubernetes-dashboard   Active   23m
local-path-storage     Active   23m
monitoring             Active   23m
Create a token for dashboard access:
kubectl -n kubernetes-dashboard create token admin-user

token:
eyJhbGciOiJSUzI1NiIsImtpZCI6ImVUMnZfV3dTVFRZVmw4Z2ZUOXRtY0ZiUVVDY1ZYb2R1VTdLaXh3LWxVa2cifQ.eyJhdWQiOlsiaHR0cHM6Ly9rdWJlcm5ldGVzLmRlZmF1bHQuc3ZjLmNsdXN0ZXIubG9jYWwiXSwiZXhwIjoxNzI3MzIzOTU5LCJpYXQiOjE3MjczMjAzNTksImlzcyI6Imh0dHBzOi8va3ViZXJuZXRlcy5kZWZhdWx0LnN2Yy5jbHVzdGVyLmxvY2FsIiwianRpIjoiY2M3NzBhMWUtN2VlNi00Y2UwLTg4YzEtODk2NWUwYjExODc3Iiwia3ViZXJuZXRlcy5pbyI6eyJuYW1lc3BhY2UiOiJrdWJlcm5ldGVzLWRhc2hib2FyZCIsInNlcnZpY2VhY2NvdW50Ijp7Im5hbWUiOiJhZG1pbi11c2VyIiwidWlkIjoiNjRkZGY1YWYtZWU5Ny00MmE2LWE0OTctMTZhYTEzOTBjZThlIn19LCJuYmYiOjE3MjczMjAzNTksInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDprdWJlcm5ldGVzLWRhc2hib2FyZDphZG1pbi11c2VyIn0.UqIZ2GMrrpHF6a820JnXW_WSBXSmmg-B70RvTmUhi2iPXyIF9lvbcZpZ0_FmwcsrRamAtE3ZEzJw_K7_8u4eYTMRftNbiUoIyuFk2Dll-EQC47xYXzlcycdJ6gTbYLDfZecP9-IniroIm3qvo9oAiOBO4ftHr1KYcuKYp21kMMmsC6uxxIQ2iwloqLBuqjZT7oUB3T4ooXQo1lQP_4PwzFaDGPe5XYKmsHCefckpkNVpeSyMYkfvGGWPw8AmvqopRCbEK-GxosSZ1R18zGHzdDYQxEsuCLjPC2A4dJYpqT0q4JaKs6IWjd6yo0WS0sKraYN3Xea8sG3dFE1-djbsHQ

Now, we have to port-forward to kubernets dashboard as well.

kubectl get svc -n kubernetes-dashboard

kubectl port-forward svc/kubernetes-dashboard -n kubernetes-dashboard 8080:443 --address=0.0.0.0 &
Verify the dashboard in browser.

Now, we will open in browser to access services.

https://<public IP Address of EC2>: 8080

You have to paste the token you generated above.

Here is the final dashboard view:

Setting Up the Monitoring

Verify Helm Version

helm version

Verify Prometheus charts in Helm,namespace, and services

kubectl get svc -n monitoring
kubectl get namespace
To view all pods in the monitoring namespace.
kubectl get pods -n monitoring

Over all Pods details

Now, we have to port-forward for kind-prometheus-kube-prome-prometheus & kind-prometheus-grafana

kubectl get svc -n monitoring
 kubectl get svc -n monitoring
NAME                                       TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)                         AGE
alertmanager-operated                      ClusterIP   None           <none>        9093/TCP,9094/TCP,9094/UDP      6m17s
kind-prometheus-grafana                    NodePort    10.96.184.88   <none>        80:31000/TCP                    6m23s
kind-prometheus-kube-prome-alertmanager    NodePort    10.96.44.175   <none>        9093:32000/TCP,8080:31594/TCP   6m23s
kind-prometheus-kube-prome-operator        ClusterIP   10.96.88.60    <none>        443/TCP                         6m23s
kind-prometheus-kube-prome-prometheus      NodePort    10.96.47.99    <none>        9090:30000/TCP,8080:32323/TCP   6m23s
kind-prometheus-kube-state-metrics         ClusterIP   10.96.120.53   <none>        8080/TCP                        6m23s
kind-prometheus-prometheus-node-exporter   NodePort    10.96.34.125   <none>        9100:32001/TCP                  6m23s
prometheus-operated                        ClusterIP   None           <none>        9090/TCP                        6m15s

For Prometheus

kubectl port-forward svc/kind-prometheus-kube-prome-prometheus -n monitoring 9090:9090 --address=0.0.0.0 &
Verify the prometheus in browser.

Now, we will open the browser to access services.
http://<public IP Address of EC2>:9090
http://<public IP Address of EC2>:9090/metrics # It is sent to Prometheus http://18.206.155.5:9090/metrics

Prometheus Queries

paste the below query in expression on Prometheus and click on execute

sum (rate (container_cpu_usage_seconds_total{namespace="default"}[1m])) / sum (machine_cpu_cores) * 100

Now, click on the graph after executing the command

For network query-

sum(rate(container_network_receive_bytes_total{namespace="default"}[5m])) by (pod)
sum(rate(container_network_transmit_bytes_total{namespace="default"}[5m])) by (pod)

Now, we will use our vote service to see the traffic in prometheous.

kubectl get svc
NAME         TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
db           ClusterIP   10.96.201.0     <none>        5432/TCP         68m
kubernetes   ClusterIP   10.96.0.1       <none>        443/TCP          79m
redis        ClusterIP   10.96.65.124    <none>        6379/TCP         68m
result       NodePort    10.96.177.39    <none>        5001:31001/TCP   68m
vote         NodePort    10.96.239.136   <none>        5000:31002/TCP   68m

Exposing the Voteapp

kubectl port-forward svc/vote 5000:5000 --address=0.0.0.0 &

If it is already exposed then don't need to perform.

Configure the Grafana

kubectl get svc -n monitoring
NAME                                       TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)                         AGE
alertmanager-operated                      ClusterIP   None           <none>        9093/TCP,9094/TCP,9094/UDP      33m
kind-prometheus-grafana                    NodePort    10.96.184.88   <none>        80:31000/TCP                    33m
kind-prometheus-kube-prome-alertmanager    NodePort    10.96.44.175   <none>        9093:32000/TCP,8080:31594/TCP   33m
kind-prometheus-kube-prome-operator        ClusterIP   10.96.88.60    <none>        443/TCP                         33m
kind-prometheus-kube-prome-prometheus      NodePort    10.96.47.99    <none>        9090:30000/TCP,8080:32323/TCP   33m
kind-prometheus-kube-state-metrics         ClusterIP   10.96.120.53   <none>        8080/TCP                        33m
kind-prometheus-prometheus-node-exporter   NodePort    10.96.34.125   <none>        9100:32001/TCP                  33m
prometheus-operated                        ClusterIP   None           <none>        9090/TCP                        33m

For grafana (expose port)

kubectl port-forward svc/kind-prometheus-grafana -n monitoring 3000:80 --address=0.0.0.0 &
Verify the Grafana in browser.

Now, we will open the browser to access services.
http://<public IP Address of EC2>:3000
http://18.206.155.5:3000
username: admin
Password: prom-operator

Note--> In Grafana, we have to add data sources and create a dashboard.

Creating a user in Grafana
  • Now, we will create a test user and will give permission accordingly.
    Home> Administration> Users and access> Users> create user

  • You can change the role as well, as per the below screenshot.

  • Creating a dashboard in Grafana

    Home> Connections> Data sources >Build a dashboard>add visualization> Select the Prometheus

    select the matrix, and label filter... Hit on run query>click save.

  • If you want to use the custom dashboard then do- open Google and reach "grafana dashboard"I choose this dashboard Click on Copy ID to clickboard

  • Now, go to the Grafana dashboard. click on new> select Import> type the dashboard ID "15661" and click on load> select the Prometheus as a data source and click on import.

  • Final Dashboard

  • AWS Resources:

  • Environment Cleanup:

    Delete ArgoCD

    To remove ArgoCD resources, it is recommended to delete the corresponding argocd namespace within the Kubernetes environment. This can be achieved by executing the following command:

      kubectl delete namespace argocd
    
    Delete KinD Cluster

    Once you have completed your testing, you may delete the KinD cluster by executing the following command, which will remove the associated Docker containers and volumes.

      kind delete cluster --name argocd
    

    Note⇒:It is critical that the name of the cluster to be destroyed be supplied. In the current demonstration, a cluster with the name argocd was established, hence the same name must be supplied during the deletion procedure. Failure to supply the cluster name would result in the deletion of the default cluster, even though it does not exist. It is crucial to note that in such a scenario, no error notice will be provided, and the prompt will state that the cluster was successfully deleted. To minimize any potential interruption to test pipelines, we have conducted conversations with the KinD maintainers team and agreed that it is critical to mention the actual name of the cluster when deletion.

    Time to delete the Virtual machine.
  • Note- As we are using Terraform, we will use the following command to delete the environment

    Go to folder "12.Real-Time-DevOps-Project/Terraform_Code" and run the Terraform command.

      cd Terraform_Code/
    
      $ ls -l
      Mode                 LastWriteTime         Length Name
      ----                 -------------         ------ ----
      da---l          25/09/24   9:48 PM                Terraform_Code
    
      Terraform destroy --auto-approve
    

    Key Takeaways

    • Prometheus for Metrics Collection: It's a time-series database that scrapes data from Kubernetes services.

    • Grafana for Visualization: Allows you to create beautiful dashboards from the metrics data stored in Prometheus.

    • Real-Time Monitoring: You can observe real-time performance metrics for CPU, memory, and network using the dashboards.

    • GitOps with ArgoCD: ArgoCD makes deploying and managing applications on Kubernetes easier by automating continuous deployment.

What to Avoid

  • Incorrect Port Forwarding: Make sure you expose the correct ports for both Prometheus (9090) and Grafana (3000). Missing this can lead to connection issues.

  • Skipping Security Groups Configuration: Always open the necessary ports in your cloud provider's security groups. Otherwise, the services won't be accessible externally.

  • Not Using PromQL Effectively: PromQL can seem complex, but it's essential to retrieve valuable metrics like CPU or memory usage. Avoid using generic queries without understanding the context of your services.

Benefits of Using This Setup

  • Improved Observability: Combining Prometheus and Grafana gives you full observability over your Kubernetes clusters.

  • Real-Time Metrics: You can monitor your services in real-time, making troubleshooting faster and more efficient.

  • Scalability: This setup grows with your infrastructure, allowing you to monitor large-scale deployments with ease.

  • Cost-Efficient: Both Prometheus and Grafana are open-source, providing a cost-effective monitoring solution.

  • Ease of Use: With Grafana's pre-built dashboards and PromQL's flexibility, it’s easy to get started and scale monitoring solutions.

Ref Link

0
Subscribe to my newsletter

Read articles from Balraj Singh directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Balraj Singh
Balraj Singh