Introduction to Observability

In today's fast-paced DevOps-driven environments, ensuring systems are running smoothly and diagnosing issues quickly is more important than ever. This is where Observability comes in.
Observability is the ability to understand what's happening inside a system based on the data it produces like logs, metrics, and traces. It goes beyond traditional monitoring by not only telling you what is wrong, but helping you figure out why it's wrong.
Why Monitoring?
Detecting issues early (e.g. high CPU usage, failed requests)
Alerting you in real-time so you can act before users are affected
Providing historical data to spot trends and bottlenecks
Does Observability Cover Monitoring?
Yes Monitoring is subset of Observability
Observability is a broader concept that includes monitoring as one of its components.
monitoring focuses on tracking specific metrics and alerting on predefined conditions
Monitoring: “My house alarm is going off.”
Observability: “Why is the alarm going off? Did someone break in? Was it a sensor error? Did my dog trigger it?”
Observing on Bare-Metal Servers vs. Observing Kubernetes
Bare-Metal Servers:
- Simpler Observability: Easier to collect and correlate logs, metrics, and traces due to fewer components and layers.
Kubernetes:
Complex Observability: Requires sophisticated tools to handle the dynamic and distributed nature of containers and microservices.
Integration: Necessitates the integration of multiple observability tools to get a complete picture of the system.
What are the Tools Available?
Monitoring Tools: Prometheus, Grafana, Nagios.
Observability Tools: ELK Stack (Elasticsearch, Logstash, Kibana), EFK Stack (Elasticsearch, FluentBit, Kibana) Splunk, Jaeger, Zipkin, New Relic, Dynatrace, Datadog.
Monitoring
Metrics vs Monitoring
Metrics are the data points you collect (like CPU usage, memory, request count), while Monitoring is the process of using those metrics to understand system health and performance.
Think of it like this:
📊 Metrics = Raw numbers
👀 Monitoring = Watching those numbers to catch issues
Prometheus
Prometheus is an open-source monitoring and alerting tool designed for recording real-time metrics and generating alerts based on them.
🕵️♂️ Prometheus = Metric collector + Alert system for your apps and infrastructure.
Prometheus Architecture
Prometheus Server
Pulls metrics from targets and stores them in a time-series database.
Retrieval: This module handles the scraping of metrics from endpoints, which are discovered either through static configurations or dynamic service discovery methods.
TSDB (Time Series Database): The data scraped from targets is stored in the TSDB, which is designed to handle high volumes of time-series data efficiently.
HTTP Server: This provides an API for querying data using PromQL, retrieving metadata, and interacting with other components of the Prometheus ecosystem.
Service Discovery
It helps Prometheus automatically find things to monitor, like apps or servers, without you telling it every time.
For example:
If a new app starts, Prometheus can find it by itself and start collecting data.
🧠 Think of it like:
Prometheus has a map that updates itself whenever something new appears.
Pushgateway
Used for short-lived jobs to push metrics to Prometheus.
It's particularly useful for batch jobs or tasks that have a limited lifespan and would otherwise not have their metrics collected.
Alertmanager
Handles alerts sent by Prometheus and routes them to email, Slack, etc
Exporters
Exporters are small applications that collect metrics from various third-party systems and expose them in a format Prometheus can scrape. They are essential for monitoring systems that do not natively support Prometheus.
Prometheus Web UI
Lets you query and visualize data using PromQL.
Grafana
Grafana is a powerful dashboard and visualization tool that integrates with Prometheus to provide rich, customizable visualizations of the metrics data.
API Clients
API clients interact with Prometheus through its HTTP API to fetch data, query metrics, and integrate Prometheus with other systems or custom applications.
🛠️ Installation & Configurations
Step 1: Create EKS Cluster
we can also run it in minikube cluster.But here we will see how to create an EKS Cluster and configure prometheus in it.
Prerequisites
Download and Install AWS Cli
Setup and configure AWS CLI using the
aws configure
command.Install and configure eksctl using the steps mentioned here.
Install and configure kubectl as mentioned here.
#Create cluster using the below command
eksctl create cluster --name=observability \
--region=us-east-1 \
--zones=us-east-1a,us-east-1b \
--without-nodegroup
#Sets up trust between AWS IAM and your EKS cluster
eksctl utils associate-iam-oidc-provider \
--region us-east-1 \
--cluster observability \
--approve
#Next Create Node Group
eksctl create nodegroup --cluster=observability \
--region=us-east-1 \
--name=observability-ng-private \
--node-type=t3.medium \
--nodes-min=2 \
--nodes-max=3 \
--node-volume-size=20 \
--managed \
--asg-access \
--external-dns-access \
--full-ecr-access \
--appmesh-access \
--alb-ingress-access \
--node-private-networking
# Update ./kube/config file
aws eks update-kubeconfig --name observability
Step 2: Install kube-prometheus-stack
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
Step 3: Deploy the chart into a new namespace "monitoring"
kubectl create ns monitoring
Steps 4: Alertmanager Configuration for kube-prometheus-stack
create a file named custom_kube_prometheus_stack.yml using vim editor and paste the below code:
alertmanager:
alertmanagerSpec:
# Selects Alertmanager configuration based on these labels. Ensure that the Alertmanager configuration has matching labels.
# ✅ Solves error: Misconfigured Alertmanager selectors can lead to missing alert configurations.
# ✅ Solves error: Alertmanager wasn't able to findout the applied CRD (kind: Alertmanagerconfig)
alertmanagerConfigSelector:
matchLabels:
release: monitoring
# Sets the number of Alertmanager replicas to 3 for high availability.
# ✅ Solves error: Single replica can cause alerting issues during pod failures.
# ✅ Solves error: Alertmanager Cluster Status is Disabled (GitHub issue)
replicas: 2
# Sets the strategy for matching Alertmanager configurations. 'None' means no specific matching strategy.
# ✅ Solves error: Incorrect matcher strategy can lead to unhandled alert configurations.
# ✅ Solves error: Get rid of namespace matchers when creating AlertManagerConfig (GitHub issue)
alertmanagerConfigMatcherStrategy:
type: None
Step 5: Verify the Installation
kubectl get all -n monitoring
Prometheus UI:
kubectl port-forward service/prometheus-operated -n monitoring 9090:9090
NOTE: If you are using an EC2 Instance or Cloud VM, you need to pass --address 0.0.0.0
to the above command. Then you can access the UI on instance-ip:port
Grafana UI: password is prom-operator
kubectl port-forward service/monitoring-grafana -n monitoring 8080:80
Alertmanager UI:
kubectl port-forward service/alertmanager-operated -n monitoring 9093:9093
Step 5: Clean UP
- Uninstall helm chart:
helm uninstall monitoring --namespace monitoring
- Delete namespace:
kubectl delete ns monitoring
- Delete Cluster & everything else:
eksctl delete cluster --name observability
📊 Metrics in Prometheus
Metrics in Prometheus are time-stamped data points that describe the performance and behavior of a system over time. They help you monitor, analyze, and alert based on how your applications and infrastructure are performing.
Example:
container_cpu_usage_seconds_total{namespace="kube-system", endpoint="https-metrics"}
container_cpu_usage_seconds_total
is the metric.{namespace="kube-system", endpoint="https-metrics"}
are the labels.
What is PromQL?
PromQL (Prometheus Query Language) is a powerful and flexible query language used to query data from Prometheus.
It allows you to retrieve and manipulate time series data, perform mathematical operations, aggregate data, and much more.
Basic Examples of PromQL
container_cpu_usage_seconds_total
- Return all time series with the metric container_cpu_usage_seconds_total
container_cpu_usage_seconds_total{namespace="kube-system",pod=~"kube-proxy.*"}
- Return all time series with the metric
container_cpu_usage_seconds_total
and the givennamespace
andpod
labels.
- Return all time series with the metric
Aggregation & Functions in PromQL
Total HTTP requests across all jobs:
sum(http_requests_total)
Average CPU usage per instance:
avg(rate(node_cpu_seconds_total[5m])) by (instance)
**rate() Function:**The rate() function calculates the per-second average rate of increase of the time series in a specified range ex.
rate(container_cpu_usage_seconds_total[5m])
increase() Function: The increase() function returns the increase in a counter over a specified time range eg.
increase(kube_pod_container_status_restarts_total[1h])
Instrumentation
Instrumentation refers to the process of adding monitoring capabilities to your applications, systems, or services.
This involves embedding/Writting code or using tools to collect metrics, logs, or traces that provide insights into how the system is performing.
Types of Metrics in Prometheus
Counter: A Counter is a cumulative metric that represents a single numerical value that only ever goes up. It is used for counting events like the number of HTTP requests, errors, or tasks completed.
Example: Counting the number of times a container restarts in your Kubernetes cluster
Gauge: A Gauge is a metric that represents a single numerical value that can go up and down. It is typically used for things like memory usage, CPU usage, or the current number of active users.
Example: Monitoring the memory usage of a container in your Kubernetes cluster.
Histogram:A Histogram samples observations (usually things like request durations or response sizes) and counts them in configurable buckets.It also provides a sum of all observed values and a count of observations.
Example: Measuring the response time of Kubernetes API requests in various time buckets.
Summary: Similar to a Histogram, a Summary samples observations and provides a total count of observations, their sum, and configurable quantiles (percentiles).
Example: Monitoring the 95th percentile of request durations to understand high latency in your Kubernetes API.
Subscribe to my newsletter
Read articles from Ashish Chaudhary directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
