I hope you read my previous blog on Prometheus, which covers the basics. Let me know what you found most helpful in it. In this blog, we will cover the fundamentals of Prometheus, PromQL, dashboarding and visualization, service discovery, Push Gateway, and monitoring Kubernetes. So, let's get started without wasting any time!

Prometheus Architecture:

Who Does What?

Prometheus works like a data collection and alerting system that continuously pulls metrics from various sources and stores them for monitoring and querying.

🔹 Key Components:
1️⃣ Prometheus Server – The core brain 🧠
2️⃣ Targets/Exporters – The data providers 📡
3️⃣ Time-Series Database (TSDB) – The storage 🗄️
4️⃣ PromQL (Prometheus Query Language) – The data analyzer 📊
5️⃣ Alertmanager – The notifier 🚨
6️⃣ Grafana – The dashboard viewer 📺

🟢 1. Prometheus Server (Core Component)

Function: Pulls data from targets, stores it, and processes queries.
Contains:
- Scraper: Fetches data from predefined targets (like your apps, servers, or containers).
- Storage (TSDB): Stores data as time-series (timestamped records).
- Query Engine (PromQL): Allows data analysis using queries.

👉 Example:

Every 15 seconds, Prometheus asks an app: "Hey, how’s your CPU usage?"
The app replies: "Right now, it's 30%."

Prometheus saves this info as:

  cpu_usage{instance="server1"} 30  # Metric value 30 at timestamp

📡 2. Targets & Exporters (Where Data Comes From)

Function: Expose metrics in a format Prometheus understands.
Types:
1️⃣ Direct Targets → Apps that expose /metrics (e.g., a Go app with Prometheus client library).
2️⃣ Exporters → Convert third-party data into Prometheus format.
- Node Exporter → Monitors Linux system metrics.
- cAdvisor → Monitors Docker container metrics.
- Kube-State-Metrics → Monitors Kubernetes workloads.

👉 Example:

A web app runs on http://app:8000/metrics, and Prometheus scrapes it every 10s.
If the app doesn’t expose /metrics, we use an Exporter to bridge the gap.

🗄️ 3. Time-Series Database (TSDB)

Function: Stores collected metrics efficiently.
Structure:
- Timestamp (When the data was recorded).
- Metric Name (What is being measured).
- Labels (Extra details like instance="server1").
- Value (The actual measurement).

👉 Example (TSDB Entry):

http_requests_total{method="GET", status="200"} 1250  # 1250 GET requests recorded

📊 4. PromQL (Query Engine)

Function: Allows analysis of collected metrics.
Queries:
- up → Shows which targets are working.
- rate(http_requests_total[5m]) → Requests per second in the last 5 mins.

👉 Example Query & Output:
Query:

sum(rate(cpu_usage_seconds_total[1m])) by (instance)

Output:

instance="server1" → 0.5  (50% CPU usage in the last 1 min)

📡 5. Service Discovery Mechanism (Finder of Targets)

Function: Keeps track of which instances (pods, containers, or VMs) are running.
Where does it look?
- Kubernetes API ☸️
- AWS EC2 API ☁️
- Docker Swarm API 🐳
- Consul, Etcd, Zookeeper 🔗
- Custom HTTP SD API 🌐

👉 Example:

If a new Pod starts in Kubernetes, Prometheus automatically detects and starts scraping it.
If a server shuts down on AWS, it stops scraping it.

🚨 6. Alertmanager (Sends Notifications)

Function: Sends alerts when something goes wrong.

Example Alert Rule (CPU Usage High):

  groups:
    - name: high_cpu_alerts
      rules:
        - alert: HighCPU
          expr: sum(rate(cpu_usage_seconds_total[5m])) > 0.9
          for: 2m
          labels:
            severity: critical
          annotations:
            description: "High CPU usage detected!"

Alerts can be sent to: Slack, Email, PagerDuty, etc.

📺 7. Grafana (Visualization)

Function: Displays Prometheus data on dashboards.
Example Dashboard:
- CPU, Memory, Network usage graphs.
- Alerts when server load is too high.

📌 How Everything Works Together

Targets/Exporters expose data →
Prometheus Server scrapes data every X seconds →
TSDB stores the metrics →
PromQL is used to query & analyze data →
Alertmanager triggers alerts if needed →
Grafana visualizes the data 📊

Prometheus pulls data from targets (it doesn’t wait for data to be pushed).
It stores metrics as time-series and allows powerful queries.
It can trigger alerts when something is wrong.
Grafana can be used to display dashboards beautifully.

Prometheus Fundamentals:

Node Exporter 🖥️:

Node Exporter is a Prometheus exporter that collects system-level metrics (CPU, memory, disk, network, etc.) from a machine (server, VM, or local system) and exposes them to Prometheus for monitoring.

By default, Prometheus cannot directly monitor system metrics like CPU usage or memory consumption. Node Exporter solves this by exposing those metrics in a Prometheus-compatible format.

🔹 System Monitoring → Tracks CPU, memory, disk, network, and more
🔹 Lightweight & Efficient → Runs as a small background process
🔹 Prometheus-Compatible → Exposes metrics via HTTP (localhost:9100/metrics)
Authentication and Encryption:

By default, when Prometheus is set up to scrape data from a node, it does not enforce authentication. This means that anyone with access to the target can retrieve the exposed metrics, which could lead to unauthorized data access. To prevent this, it’s crucial to implement authentication and encryption.

Metrics in Prometheus:

Metric Name
A descriptive name for the thing being measured (e.g., http_requests_total, cpu_usage_seconds_total).
Labels (Optional)
Key-value pairs used to differentiate different dimensions of the same metric (e.g., method="GET", status="200").
Timestamp
When the data point was recorded (often automatically managed by Prometheus).
Value
The actual numeric value of the metric at that time.

Example:

http_requests_total{method="GET", handler="/api", status="200"} 1287

Metric name: http_requests_total
Labels: method="GET", handler="/api", status="200"
Value: 1287 (number of successful GET requests to /api)
Timestamp: (implicitly recorded by Prometheus when scraped)

What is PromQL?

PromQL stands for Prometheus Query Language — it's the powerful and flexible language you use to query, filter, and analyze metrics stored in Prometheus.

🔹 What You Can Do with PromQL:

Select metrics (e.g., http_requests_total)
Filter by labels (e.g., method="GET")
Perform calculations (e.g., rate of change, averages, percentages)
Aggregate data (e.g., by instance, job, or other labels)
Generate graphs and alerts

🔸 Basic Syntax Examples

🔍 1. Select a Metric

http_requests_total

Shows all time series with that metric name.

🎯 2. Filter by Label

http_requests_total{method="GET", status="200"}

Filters only the GET requests with status 200.

⏱️ 3. Calculate Rate of Increase

rate(http_requests_total[1m])

Shows the per-second rate of requests over the last 1 minute.

📊 4. Aggregate by Label

sum(rate(http_requests_total[5m])) by (job)

Shows total request rate per job (like frontend, backend, etc.)

🧠 5. Calculate CPU Usage

100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Shows CPU usage (as a %) per instance by subtracting idle time.

PromQL:

🔍 1. Selectors

Selectors are used to specify which time series (metrics) you want to query. There are two main types:

👉 Instant Vector Selector

Selects the latest sample for each matching time series.

http_requests_total{method="GET", status="200"}

Metric: http_requests_total
Labels: method="GET", status="200"

This returns the current value of all time series that match the labels.

👉 Range Vector Selector

Selects time series data over a time range.

rate(http_requests_total[5m])

http_requests_total[5m] = all data points from the past 5 minutes.
Used inside functions like rate() or avg_over_time().

🏷️ 2. Matchers

Matchers are used inside label selectors (the {} block) to filter which time series you want.

Common Matchers:

Matcher	Meaning	Example
`=`	Equals	`job="api-server"`
`!=`	Not equal	`method!="POST"`
`=~`	Regex match	`instance=~"server.*"`
`!~`	Regex does not match	`status!~"4.."` (not 4xx errors)

Example:

http_requests_total{job="frontend", status=~"5.."}

Selects all 5xx status codes from the frontend job.

🛠️ 3. Modifiers

Modifiers change how a function behaves or how results are grouped.

a. `by` / `without` (Aggregation Modifiers)

Used with aggregations like sum, avg, etc.

by(...): keep these labels
without(...): drop these labels

sum(rate(http_requests_total[5m])) by (job)

Sum the rate, but group by job.

sum(rate(http_requests_total[5m])) without (instance)

Sum the rate and ignore instance label.

b. `on` / `ignoring` (Binary Operator Modifiers)

Used when combining two metrics.

http_requests_total / on(instance) up

Joins metrics only where instance matches.

http_requests_total / ignoring(job) up

Joins metrics but ignores job label during match.

c. `offset`

offset shifts the evaluation time backward by a given duration. “Show me the value of this metric X time ago.”

<metric_name>[<range>] offset <duration>

OR (for instant vectors):

<metric_name> offset <duration>

⏱ Prometheus Time Units Table

Unit	Suffix	Meaning	Example Usage
Seconds	`s`	1 second	`offset 30s`
Minutes	`m`	60 seconds	`offset 5m`
Hours	`h`	60 minutes	`offset 1h`
Days	`d`	24 hours	`offset 2d`
Weeks	`w`	7 days	`offset 1w`
Years	`y`	365 days (not leap)	`offset 1y` (rare)

✅ Examples in Context

rate(http_requests_total[5m] offset 1h)

→ Rate of requests 1 hour ago over a 5-minute window

up offset 2d

→ Status of targets exactly 2 days ago

avg_over_time(cpu_usage[10m] offset 7d)

→ CPU usage last week at this time, averaged over 10 minutes

Getting Started with Prometheus: A Guide

Table of contents

Prometheus Architecture:

Who Does What?

🟢 1. Prometheus Server (Core Component)

📡 2. Targets & Exporters (Where Data Comes From)

🗄️ 3. Time-Series Database (TSDB)

📊 4. PromQL (Query Engine)

📡 5. Service Discovery Mechanism (Finder of Targets)

🚨 6. Alertmanager (Sends Notifications)

📺 7. Grafana (Visualization)

📌 How Everything Works Together

Prometheus Fundamentals:

Node Exporter 🖥️:

Authentication and Encryption:

Metrics in Prometheus:

Example:

What is PromQL?

🔹 What You Can Do with PromQL:

🔸 Basic Syntax Examples

🔍 1. Select a Metric

🎯 2. Filter by Label

⏱️ 3. Calculate Rate of Increase

📊 4. Aggregate by Label

🧠 5. Calculate CPU Usage

PromQL:

🔍 1. Selectors

👉 Instant Vector Selector

👉 Range Vector Selector

🏷️ 2. Matchers

Common Matchers:

Example:

🛠️ 3. Modifiers

a. by / without (Aggregation Modifiers)

b. on / ignoring (Binary Operator Modifiers)

c. offset

⏱ Prometheus Time Units Table

✅ Examples in Context

Subscribe to my newsletter

Sahil Naik

Sahil Naik

a. `by` / `without` (Aggregation Modifiers)

b. `on` / `ignoring` (Binary Operator Modifiers)

c. `offset`