In today's fast-paced DevOps-driven environments, ensuring systems are running smoothly and diagnosing issues quickly is more important than ever. This is where Observability comes in.

Observability is the ability to understand what's happening inside a system based on the data it produces like logs, metrics, and traces. It goes beyond traditional monitoring by not only telling you what is wrong, but helping you figure out why it's wrong.

Why Monitoring?

Detecting issues early (e.g. high CPU usage, failed requests)
Alerting you in real-time so you can act before users are affected
Providing historical data to spot trends and bottlenecks

Does Observability Cover Monitoring?

Yes Monitoring is subset of Observability
Observability is a broader concept that includes monitoring as one of its components.
monitoring focuses on tracking specific metrics and alerting on predefined conditions
Monitoring: “My house alarm is going off.”
Observability: “Why is the alarm going off? Did someone break in? Was it a sensor error? Did my dog trigger it?”

Observing on Bare-Metal Servers vs. Observing Kubernetes

Bare-Metal Servers:

Simpler Observability: Easier to collect and correlate logs, metrics, and traces due to fewer components and layers.

Kubernetes:

Complex Observability: Requires sophisticated tools to handle the dynamic and distributed nature of containers and microservices.
Integration: Necessitates the integration of multiple observability tools to get a complete picture of the system.

What are the Tools Available?

Monitoring Tools: Prometheus, Grafana, Nagios.
Observability Tools: ELK Stack (Elasticsearch, Logstash, Kibana), EFK Stack (Elasticsearch, FluentBit, Kibana) Splunk, Jaeger, Zipkin, New Relic, Dynatrace, Datadog.

Monitoring

Metrics vs Monitoring

Metrics are the data points you collect (like CPU usage, memory, request count), while Monitoring is the process of using those metrics to understand system health and performance.

Think of it like this:
📊 Metrics = Raw numbers
👀 Monitoring = Watching those numbers to catch issues

Prometheus

Prometheus is an open-source monitoring and alerting tool designed for recording real-time metrics and generating alerts based on them.
🕵️‍♂️ Prometheus = Metric collector + Alert system for your apps and infrastructure.

Prometheus Architecture

Prometheus Server

Pulls metrics from targets and stores them in a time-series database.
Retrieval: This module handles the scraping of metrics from endpoints, which are discovered either through static configurations or dynamic service discovery methods.
TSDB (Time Series Database): The data scraped from targets is stored in the TSDB, which is designed to handle high volumes of time-series data efficiently.
HTTP Server: This provides an API for querying data using PromQL, retrieving metadata, and interacting with other components of the Prometheus ecosystem.

Service Discovery

It helps Prometheus automatically find things to monitor, like apps or servers, without you telling it every time.

For example:
If a new app starts, Prometheus can find it by itself and start collecting data.

🧠 Think of it like:
Prometheus has a map that updates itself whenever something new appears.

Pushgateway

Used for short-lived jobs to push metrics to Prometheus.
It's particularly useful for batch jobs or tasks that have a limited lifespan and would otherwise not have their metrics collected.

Alertmanager

Handles alerts sent by Prometheus and routes them to email, Slack, etc

Exporters

Exporters are small applications that collect metrics from various third-party systems and expose them in a format Prometheus can scrape. They are essential for monitoring systems that do not natively support Prometheus.

Prometheus Web UI

Lets you query and visualize data using PromQL.

Grafana

Grafana is a powerful dashboard and visualization tool that integrates with Prometheus to provide rich, customizable visualizations of the metrics data.

API Clients

API clients interact with Prometheus through its HTTP API to fetch data, query metrics, and integrate Prometheus with other systems or custom applications.

🛠️ Installation & Configurations

Step 1: Create EKS Cluster

we can also run it in minikube cluster.But here we will see how to create an EKS Cluster and configure prometheus in it.

Prerequisites

Download and Install AWS Cli
Setup and configure AWS CLI using the aws configure command.
Install and configure eksctl using the steps mentioned here.
Install and configure kubectl as mentioned here.

#Create cluster using the below command

eksctl create cluster --name=observability \
                      --region=us-east-1 \
                      --zones=us-east-1a,us-east-1b \
                      --without-nodegroup

#Sets up trust between AWS IAM and your EKS cluster

eksctl utils associate-iam-oidc-provider \
    --region us-east-1 \
    --cluster observability \
    --approve

#Next Create Node Group
eksctl create nodegroup --cluster=observability \
                        --region=us-east-1 \
                        --name=observability-ng-private \
                        --node-type=t3.medium \
                        --nodes-min=2 \
                        --nodes-max=3 \
                        --node-volume-size=20 \
                        --managed \
                        --asg-access \
                        --external-dns-access \
                        --full-ecr-access \
                        --appmesh-access \
                        --alb-ingress-access \
                        --node-private-networking

# Update ./kube/config file
aws eks update-kubeconfig --name observability

Step 2: Install kube-prometheus-stack

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Step 3: Deploy the chart into a new namespace "monitoring"

kubectl create ns monitoring

Steps 4: Alertmanager Configuration for kube-prometheus-stack

create a file named custom_kube_prometheus_stack.yml using vim editor and paste the below code:

alertmanager:
  alertmanagerSpec:
    # Selects Alertmanager configuration based on these labels. Ensure that the Alertmanager configuration has matching labels.
    # ✅ Solves error: Misconfigured Alertmanager selectors can lead to missing alert configurations.
    # ✅ Solves error: Alertmanager wasn't able to findout the applied CRD (kind: Alertmanagerconfig)
    alertmanagerConfigSelector:
      matchLabels:
        release: monitoring

    # Sets the number of Alertmanager replicas to 3 for high availability.
    # ✅ Solves error: Single replica can cause alerting issues during pod failures.
    # ✅ Solves error: Alertmanager Cluster Status is Disabled (GitHub issue)
    replicas: 2

    # Sets the strategy for matching Alertmanager configurations. 'None' means no specific matching strategy.
    # ✅ Solves error: Incorrect matcher strategy can lead to unhandled alert configurations.
    # ✅ Solves error: Get rid of namespace matchers when creating AlertManagerConfig (GitHub issue)
    alertmanagerConfigMatcherStrategy:
      type: None

Step 5: Verify the Installation

kubectl get all -n monitoring

Prometheus UI:

kubectl port-forward service/prometheus-operated -n monitoring 9090:9090

NOTE: If you are using an EC2 Instance or Cloud VM, you need to pass --address 0.0.0.0 to the above command. Then you can access the UI on instance-ip:port

Grafana UI: password is prom-operator

kubectl port-forward service/monitoring-grafana -n monitoring 8080:80

Alertmanager UI:

kubectl port-forward service/alertmanager-operated -n monitoring 9093:9093

Step 5: Clean UP

Uninstall helm chart:

helm uninstall monitoring --namespace monitoring

Delete namespace:

kubectl delete ns monitoring

Delete Cluster & everything else:

eksctl delete cluster --name observability

📊 Metrics in Prometheus

Metrics in Prometheus are time-stamped data points that describe the performance and behavior of a system over time. They help you monitor, analyze, and alert based on how your applications and infrastructure are performing.

Example:

container_cpu_usage_seconds_total{namespace="kube-system", endpoint="https-metrics"}

container_cpu_usage_seconds_total is the metric.
{namespace="kube-system", endpoint="https-metrics"} are the labels.

What is PromQL?

PromQL (Prometheus Query Language) is a powerful and flexible query language used to query data from Prometheus.
It allows you to retrieve and manipulate time series data, perform mathematical operations, aggregate data, and much more.

Basic Examples of PromQL

container_cpu_usage_seconds_total
- Return all time series with the metric container_cpu_usage_seconds_total
container_cpu_usage_seconds_total{namespace="kube-system",pod=~"kube-proxy.*"}
- Return all time series with the metric container_cpu_usage_seconds_total and the given namespace and pod labels.

Aggregation & Functions in PromQL

Total HTTP requests across all jobs: sum(http_requests_total)
Average CPU usage per instance: avg(rate(node_cpu_seconds_total[5m])) by (instance)
**rate() Function:**The rate() function calculates the per-second average rate of increase of the time series in a specified range ex. rate(container_cpu_usage_seconds_total[5m])
increase() Function: The increase() function returns the increase in a counter over a specified time range eg. increase(kube_pod_container_status_restarts_total[1h])

Instrumentation

Instrumentation refers to the process of adding monitoring capabilities to your applications, systems, or services.
This involves embedding/Writting code or using tools to collect metrics, logs, or traces that provide insights into how the system is performing.

Types of Metrics in Prometheus
Counter: A Counter is a cumulative metric that represents a single numerical value that only ever goes up. It is used for counting events like the number of HTTP requests, errors, or tasks completed.

Example: Counting the number of times a container restarts in your Kubernetes cluster
Gauge: A Gauge is a metric that represents a single numerical value that can go up and down. It is typically used for things like memory usage, CPU usage, or the current number of active users.

Example: Monitoring the memory usage of a container in your Kubernetes cluster.
Histogram:A Histogram samples observations (usually things like request durations or response sizes) and counts them in configurable buckets.It also provides a sum of all observed values and a count of observations.

Example: Measuring the response time of Kubernetes API requests in various time buckets.
Summary: Similar to a Histogram, a Summary samples observations and provides a total count of observations, their sum, and configurable quantiles (percentiles).

Example: Monitoring the 95th percentile of request durations to understand high latency in your Kubernetes API.

Introduction to Observability

Why Monitoring?

Does Observability Cover Monitoring?

Observing on Bare-Metal Servers vs. Observing Kubernetes

Bare-Metal Servers:

Kubernetes:

What are the Tools Available?

Monitoring

Metrics vs Monitoring

Prometheus

Prometheus Architecture

Prometheus Server

Service Discovery

Pushgateway

Alertmanager

Exporters

Prometheus Web UI

Grafana

API Clients

🛠️ Installation & Configurations

Step 1: Create EKS Cluster

Prerequisites

Step 2: Install kube-prometheus-stack

Step 3: Deploy the chart into a new namespace "monitoring"

Steps 4: Alertmanager Configuration for kube-prometheus-stack

Step 5: Verify the Installation

Step 5: Clean UP

📊 Metrics in Prometheus

Example:

What is PromQL?

Basic Examples of PromQL

Aggregation & Functions in PromQL

Total HTTP requests across all jobs: `sum(http_requests_total)`

Instrumentation

Types of Metrics in Prometheus

Subscribe to my newsletter

Ashish Chaudhary

Ashish Chaudhary

Introduction to Observability

Why Monitoring?

Does Observability Cover Monitoring?

Observing on Bare-Metal Servers vs. Observing Kubernetes

Bare-Metal Servers:

Kubernetes:

What are the Tools Available?

Monitoring

Metrics vs Monitoring

Prometheus

Prometheus Architecture

Prometheus Server

Service Discovery

Pushgateway

Alertmanager

Exporters

Prometheus Web UI

Grafana

API Clients

🛠️ Installation & Configurations

Step 1: Create EKS Cluster

Prerequisites

Step 2: Install kube-prometheus-stack

Step 3: Deploy the chart into a new namespace "monitoring"

Steps 4: Alertmanager Configuration for kube-prometheus-stack

Step 5: Verify the Installation

Step 5: Clean UP

📊 Metrics in Prometheus

Example:

What is PromQL?

Basic Examples of PromQL

Aggregation & Functions in PromQL

Total HTTP requests across all jobs: sum(http_requests_total)

Instrumentation

Types of Metrics in Prometheus

Subscribe to my newsletter

Ashish Chaudhary

Ashish Chaudhary

Total HTTP requests across all jobs: `sum(http_requests_total)`