Understanding Prometheus: A Comprehensive Guide

46 min read

Introduction to the Prometheus Monitoring System
Getting Started with Prometheus
Understanding Prometheus Metric Types
PromQL Data Selection Explained
Understanding Counter Rates and Increases in PromQL
Understanding "up" and Friends in Prometheus
Understanding Prometheus Histograms
Creating Grafana Dashboards for Prometheus
Monitoring Linux Host Metrics with Prometheus
Don’t Make These 6 Prometheus Monitoring Mistakes
Exposing Custom Host Metrics Using the Prometheus Node Exporter
Relabeling in Prometheus
Grafana Heatmaps for Prometheus Histograms
- 1. Adding and Configuring a Heatmap Panel for Prometheus Histograms
- 2. Using and Interpreting the Heatmap Panel

Introduction to the Prometheus Monitoring System

Prometheus is an open-source monitoring and alerting toolkit widely used for cloud-native and microservices environments. Designed originally by SoundCloud in 2012 and now part of the Cloud Native Computing Foundation (CNCF), Prometheus excels in collecting time-series data, enabling real-time alerting and powerful metric analysis.

1. 🔍 What is Prometheus?

Prometheus is a time-series database and monitoring system. It works by scraping metrics from instrumented targets at specified intervals and storing them in a highly efficient time-series database. Prometheus is widely adopted for its multi-dimensional data model, simple yet powerful query language, and standalone nature—it doesn’t rely on external storage systems or message queues.

Key points:

Pull-based data collection via HTTP.
Stores data as time-stamped metrics.
Comes with a built-in expression browser and alerting.
Scales well for most monitoring needs.

2. 🧩 System Architecture

Uploaded image

2.1 Metric Sources (Targets)

These are the systems Prometheus collects metrics from:

Applications
Databases
Linux Hosts
Containers

These systems expose metrics endpoints (usually /metrics) in a format Prometheus can scrape.

2.2 Service Discovery

To automatically discover the metric sources, Prometheus uses:

Kubernetes
Consul

This allows Prometheus to dynamically find services/instances to scrape, rather than manually configuring static targets.

2.3 Prometheus Server (Core)

This is the central brain of the architecture and does most of the heavy lifting:

Data Retriever
Pulls (scrapes) metrics from the discovered targets.
TSDB (Time Series Database)
Stores all the scraped metrics as time-series data.
HTTP Server
Allows users, tools, and dashboards to query Prometheus data using PromQL (Prometheus Query Language).

Prometheus pulls metrics from targets (not push by default — although there are workarounds for pushing).

2.4 Querying and Visualization Tools

These tools interact with Prometheus to visualize and analyze the data:

Web UI — Built-in Prometheus UI.
SigNoz — Open-source observability platform.
Prom Lens — Advanced query-building tool.
Grafana — Popular visualization tool (dashboards and graphs).

These tools query data from Prometheus via PromQL.

2.5 Alerting

Prometheus can evaluate alerting rules and:

Sends alerts to Alert Manager.
Alert Manager handles deduplication, grouping, and routing of alerts.
Alert Manager forwards alerts to notification channels like:
- Email
- Slack
- PagerDuty

2.6 Remote/Local Storage Integration

Prometheus can forward samples to external storage systems for long-term retention or more scalable solutions — referred to as Remote/Local Storage. This is optional and useful for scaling or compliance needs.

Summary of the Flow

Prometheus discovers targets via Kubernetes/Consul.
Prometheus scrapes (pulls) metrics from targets.
Data is stored in Prometheus’s TSDB.
Users/tools query metrics (via Web UI, Grafana, etc.).
Prometheus sends alerts to Alert Manager based on rules.
Alert Manager forwards alerts to email, Slack, PagerDuty.
Optionally, Prometheus forwards samples to remote storage.

In short:

Prometheus scrapes, stores, analyzes, and alerts on time-series metrics from various sources — while integrating with external tools for visualization and notifications.

3. ✨ Core Features Overview

Prometheus offers a wide range of features for modern monitoring:

Multi-dimensional data model using time series identified by metric names and key/value pairs (labels).
Powerful PromQL (Prometheus Query Language) for slicing and dicing data.
No reliance on external storage—all data is stored locally in a custom TSDB.
Pull-based scraping over HTTP.
Integrated Alerting with Alertmanager.
Flexible Service Discovery supporting Kubernetes, Consul, EC2, and static targets.
Built-in Web UI for ad-hoc queries and visualization.

4. 🧬 Prometheus Data Model

Prometheus’s data model revolves around time series.

A metric is a set of time series that share the same name and differ by their label sets.
Each time series is identified by a metric name and a set of key/value labels.

For example:

  pgsqlCopyEdithttp_requests_total{method="POST", handler="/api/order"}

Each time series consists of:
- Timestamps
- Floating-point samples
Labels allow for high-cardinality querying and filtering.

This data model supports flexible and high-dimensional querying without predefined schemas.

5. 🧾 Metrics Transfer Format

Prometheus uses a text-based exposition format over HTTP.

The standard format is a simple plaintext format, exposed by the /metrics endpoint on each target.

Example:

  bashCopyEdit# HELP http_requests_total Total number of HTTP requests
  # TYPE http_requests_total counter
  http_requests_total{method="GET", code="200"} 1027

Metrics types include:
- Counter: Monotonic increasing values (e.g., requests served).
- Gauge: Values that go up and down (e.g., temperature).
- Histogram: Measures distributions (e.g., request durations).
- Summary: Similar to histograms but focused on quantiles.

Prometheus scrapes this endpoint regularly and parses the data for storage and query.

6. 🔎 Query Language (PromQL)

PromQL (Prometheus Query Language) is a powerful and expressive language for querying time series data.

Used in the web UI, Grafana, alert rules, and API calls.
Supports:
- Instant vector: snapshot at a time point.
- Range vector: data over a time range.
- Arithmetic operations between metrics.
- Aggregation: sum, avg, max, min, count, etc.
- Filtering based on labels.

Examples:

http_requests_total – fetch all time series of this metric.
sum(rate(http_requests_total[1m])) by (method) – total requests per method in the last minute.

PromQL makes Prometheus incredibly flexible for real-time analytics.

7. 🚨 Integrated Alerting

Prometheus comes with built-in alerting capabilities:

Alert Rules are defined using PromQL.
Prometheus evaluates these rules at regular intervals and fires alerts.
Alerts are sent to Alertmanager, which:
- Deduplicates alerts
- Groups related alerts
- Sends notifications via email, Slack, PagerDuty, etc.
- Supports silencing and routing policies

Example rule:

yamlCopyEditgroups:
- name: example
  rules:
  - alert: HighRequestRate
    expr: rate(http_requests_total[1m]) > 100
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "High request rate detected"

8. 🌐 Service Discovery Support

Prometheus can dynamically discover scrape targets using service discovery integrations, avoiding the need for static configs.

Supported methods include:

Kubernetes (pods, services, endpoints)
Consul
EC2
Azure
GCE
Docker Swarm
File-based SD (watching JSON/YAML files)

This enables automatic discovery of new services and ensures Prometheus always monitors the correct set of targets, even in dynamic cloud-native environments.

Getting Started with Prometheus

This section will walk you through the first steps of installing, configuring, and running Prometheus. You'll also learn how to explore its web UI, monitor targets, and query data using PromQL.

1. ⬇️ Downloading Prometheus

Prometheus can be downloaded directly from its official website:

Visit: https://prometheus.io/download
Select the appropriate binary for your operating system (e.g., Linux, macOS, Windows).
Example (Linux, x86_64):

bashCopyEditwget https://github.com/prometheus/prometheus/releases/download/v2.51.1/prometheus-2.51.1.linux-amd64.tar.gz

Make sure to always download the latest stable version.

2. 📦 Unpacking and Inspecting the Tarball

Once downloaded, unpack the tarball using the following command:

bashCopyEdittar -xvf prometheus-2.51.1.linux-amd64.tar.gz
cd prometheus-2.51.1.linux-amd64

Inside the extracted folder, you'll see:

prometheus – the main binary
promtool – tool to check config files
prometheus.yml – default config file
console_libraries/ – libraries for console templates
consoles/ – example console templates

This directory structure can be moved or customized based on your deployment setup.

3. ⚙️ Configuring Prometheus

Prometheus is configured via a YAML file (prometheus.yml), which defines global settings, scrape targets, alerting, and more.

Example prometheus.yml:

yamlCopyEditglobal:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

Key config sections:

global: Sets the default scrape interval, evaluation interval, etc.
scrape_configs: Defines monitoring targets, job names, relabeling, etc.
alerting: Configures Alertmanager integration.
rule_files: Specifies rule files for alerts or recording.

Use promtool check config prometheus.yml to validate your configuration.

4. 🏁 Command-Line Flags and Defaults

Prometheus supports many command-line flags to customize runtime behavior. Commonly used ones:

bashCopyEdit./prometheus \
  --config.file=prometheus.yml \
  --storage.tsdb.path=data/ \
  --web.listen-address=":9090"

Useful flags:

--config.file: Path to the config file (default: prometheus.yml)
--storage.tsdb.path: Directory for storing metrics data (default: data/)
--web.listen-address: Port on which Prometheus serves the UI and API
--log.level: Set log level (e.g., info, debug, error)

You can view all flags by running:

bashCopyEdit./prometheus --help

5. ▶️ Running Prometheus

To start Prometheus:

bashCopyEdit./prometheus --config.file=prometheus.yml

You should see logs indicating that Prometheus is starting and loading targets. By default, the web UI will be accessible at:

arduinoCopyEdithttp://localhost:9090

Make sure port 9090 is open and not blocked by firewalls or other services.

6. 🌐 Web Interface

Prometheus includes a built-in web UI accessible via the browser.

Features:

Home dashboard with system status
Expression browser for querying metrics
Visualization of raw time-series data
Target health and label inspection
Alerts and rules display

To access it:

arduinoCopyEdithttp://localhost:9090

Useful tabs:

Status > Targets: See active targets and scrape status
Graph: Run PromQL queries and visualize data
Alerts: View firing and pending alerts

7. 🎯 Targets Page

The Targets page shows all the configured jobs and their respective scrape endpoints.

Navigate to:

bashCopyEdithttp://localhost:9090/targets

You’ll see:

Job name
Endpoint
Last scrape time
Scrape duration
Scrape status (UP/DOWN)

If a target is down, check:

If the service is running
If the endpoint is reachable
If the config is correct

This page is essential for debugging connectivity and monitoring issues.

8. 🔎 Querying Metrics with PromQL

Prometheus supports PromQL (Prometheus Query Language) for querying and analyzing time-series data.

To try it out:

Go to http://localhost:9090/graph
Enter a query, e.g.:
```
  promqlCopyEditup
```
This checks if targets are up (1 = UP, 0 = DOWN).

Examples:

node_cpu_seconds_total: View total CPU time
rate(http_requests_total[1m]): View request rate over the past minute
sum by (instance)(rate(http_requests_total[5m])): Total requests per instance

You can visualize results as graphs, tables, or export as JSON using the HTTP API.

Understanding Prometheus Metric Types

Prometheus supports four core metric types that represent different patterns of data collection. These types—Gauges, Counters, Summaries, and Histograms—allow developers and SREs to instrument and monitor applications with precision and clarity. Each type has specific characteristics and use cases.

1. Gauges

A Gauge is a metric that represents a single numerical value that can arbitrarily go up and down. Use gauges for things like current memory usage, number of active goroutines, or temperature readings.

Example use cases:

Current CPU temperature
Active sessions
Queue length
Free memory

2. Gauge Instrumentation Methods

Prometheus client libraries (e.g., Python, Go, Java) provide methods to work with gauges:

In Go:

goCopyEditvar temperature = prometheus.NewGauge(
    prometheus.GaugeOpts{
        Name: "room_temperature_celsius",
        Help: "Current room temperature in Celsius.",
    },
)
temperature.Set(22.5)
temperature.Inc()
temperature.Dec()

In Python:

pythonCopyEditfrom prometheus_client import Gauge
temperature = Gauge('room_temperature_celsius', 'Current room temperature')
temperature.set(22.5)

Common methods:

set(value)
inc(), dec()
set_to_current_time()

3. Gauges in the Exposition Format

The exposition format is the plain text output served on the /metrics endpoint.

Example:

nginxCopyEdit# HELP room_temperature_celsius Current room temperature
# TYPE room_temperature_celsius gauge
room_temperature_celsius 22.5

The format is human-readable and easily parseable by Prometheus scrapers.

4. Querying Gauges

Use PromQL to directly view the current value of a gauge:

promqlCopyEditroom_temperature_celsius

You can apply mathematical operations:

promqlCopyEditroom_temperature_celsius * 1.8 + 32  // Convert to Fahrenheit

5. Gauges Containing Timestamps

Gauges can also include explicit timestamps in exposition format, although it's not typical.

Example:

nginxCopyEditroom_temperature_celsius 22.5 1683023900000

However, this is discouraged unless absolutely necessary, as it can complicate time-series storage.

6. Counters

Counters are cumulative metrics that can only increase (or be reset to zero on restart). Use counters to track things like:

Total HTTP requests
Errors
Bytes transferred

They are strictly monotonically increasing.

7. Counter Resets

Counters can reset to zero, typically after a service restart. Prometheus handles this gracefully by detecting sudden drops to zero and adjusting calculations (e.g., using rate()).

For example:

promqlCopyEditrate(http_requests_total[5m])

This accounts for resets by calculating per-second increase over time.

8. Counter Instrumentation Methods

In Go:

goCopyEditvar requests = prometheus.NewCounter(
    prometheus.CounterOpts{
        Name: "http_requests_total",
        Help: "Total HTTP requests",
    },
)
requests.Inc()
requests.Add(3)

In Python:

pythonCopyEditfrom prometheus_client import Counter
requests = Counter('http_requests_total', 'Total HTTP requests')
requests.inc()

9. Counters in the Exposition Format

yamlCopyEdit# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total 1543

This value never decreases unless there’s a reset.

10. Querying Counters (Absolute Values vs. Rates)

Absolute value: http_requests_total
Rate of increase: rate(http_requests_total[1m])

rate() returns per-second average increase:

promqlCopyEditsum(rate(http_requests_total[5m])) by (method)

Useful for dashboards and alerting thresholds.

11. Summaries

Summaries are used to track observations (e.g., request durations, response sizes) and produce:

Quantiles (e.g., 0.5, 0.9, 0.99)
Sum of all observations
Count of observations

12. Constructing Summaries

In Go:

goCopyEditsummary := prometheus.NewSummary(prometheus.SummaryOpts{
    Name:       "request_duration_seconds",
    Help:       "Request duration in seconds",
    Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
})
summary.Observe(1.2)

13. Summary Instrumentation Methods

observe(value): Add a new observation
Predefined objectives (quantiles)

Summaries provide real-time quantile approximations (but with trade-offs like memory usage and no aggregation across labels).

14. Querying Summaries

Summary metrics are split into:

_count: number of observations
_sum: total value
_quantile: quantile estimations

Example:

promqlCopyEditrate(request_duration_seconds_sum[5m]) / rate(request_duration_seconds_count[5m])

This calculates average request duration over 5 minutes.

15. Histograms

Histograms group observations into configurable buckets and count how many fall into each.

They provide:

Bucketed counts
_count and _sum
Percentile approximations (via Prometheus, not client)

16. Cumulative Histograms

Histograms use cumulative buckets:

textCopyEditrequest_duration_seconds_bucket{le="0.1"} 240
request_duration_seconds_bucket{le="0.5"} 756
request_duration_seconds_bucket{le="1"}   999
request_duration_seconds_bucket{le="+Inf"} 1024

Each bucket includes the count for that threshold and below.

17. Constructing Histograms

In Go:

goCopyEdithist := prometheus.NewHistogram(prometheus.HistogramOpts{
    Name:    "request_duration_seconds",
    Help:    "Histogram of request durations",
    Buckets: prometheus.LinearBuckets(0.1, 0.1, 10),
})
hist.Observe(0.3)

Choose buckets wisely to match the distribution of your data.

18. Histogram Instrumentation Methods

observe(value): Records a value
Buckets must be set at creation and are immutable

19. Histograms in the Exposition Format

textCopyEdit# HELP request_duration_seconds Histogram of request durations
# TYPE request_duration_seconds histogram
request_duration_seconds_bucket{le="0.1"} 123
request_duration_seconds_bucket{le="0.5"} 456
request_duration_seconds_count 789
request_duration_seconds_sum 105.6

Prometheus computes quantiles during query time (not client side like summaries).

20. Querying Histograms

Example: 95th percentile from histogram:

promqlCopyEdithistogram_quantile(0.95, rate(request_duration_seconds_bucket[5m]))

Average duration:

promqlCopyEditrate(request_duration_seconds_sum[5m]) / rate(request_duration_seconds_count[5m])

21. Average Request Latencies

Both Summaries and Histograms can calculate average latency:

promqlCopyEditrate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])

Histograms are preferred for aggregations across labels, which summaries cannot do.

22. Native Histogram

Introduced in Prometheus v2.40+, Native Histograms are experimental and designed to:

Reduce memory usage
Support better quantile approximation
Be more efficient at high cardinality

Native histograms are exposed using a new metric type and are collected via the OpenMetrics format.

They are configured in Prometheus using:

yamlCopyEditenable_feature: native-histograms

Unlike regular histograms, native histograms don’t require pre-defined buckets and dynamically adapt based on the distribution of data.

PromQL Data Selection Explained

Prometheus Query Language (PromQL) is a powerful tool used for querying time series data. At the heart of PromQL are selectors—constructs that define what data to fetch, from which series, and over what time range.

This section focuses on both instant vector and range vector selectors, label matchers, and all modifiers that affect how and when data is retrieved and evaluated.

1. Instant Vector Selectors

An instant vector selector retrieves the latest sample for each time series at a single point in time (usually "now").

Syntax:

promqlCopyEdithttp_requests_total

This fetches all time series with the metric name http_requests_total at the current moment.

2. Label Matchers

Label matchers refine vector selectors by filtering based on metric labels.

Types of matchers:

Matcher	Description	Example
`=`	Equals	`{job="api"}`
`!=`	Not equals	`{status!="500"}`
`=~`	Regex match	`{method=~"GET
`!~`	Negative regex match	`{job!~"dev-.*"}`

Example:

promqlCopyEdithttp_requests_total{job="api", status=~"2.."}

3. Visualizing Instant Vector Selector Behavior (Lookback Delta)

Prometheus doesn't scrape metrics exactly at the evaluation moment. It looks backwards in time using the lookback delta (default: 5m).

If no sample exists within 5 minutes, Prometheus drops the series from the result.

So:

promqlCopyEditmetric_name

...returns the most recent value within the last 5 minutes.

4. Staleness Markers and Staleness Handling

Prometheus uses staleness markers to detect when a series stops being reported (e.g., app crashed). If a time series disappears, Prometheus marks it as stale.

Staleness is used to:

Stop evaluating old data
Avoid misleading results

These markers are invisible in PromQL but affect evaluation.

5. Range Vector Selectors

A range vector selector retrieves all samples for each time series over a specified time interval.

Syntax:

promqlCopyEdithttp_requests_total[5m]

This selects all values in the last 5 minutes for each series. The output is a range vector, which can be passed into functions like rate().

6. Visualizing Range Vector Selector Behavior

Range vectors return a set of time-stamped samples.

For example:

promqlCopyEditrate(http_requests_total[1m])

Behind the scenes, Prometheus:

Selects all samples in the last 1 minute per series
Calculates per-second rate of increase

Each data point in a graph represents a separate evaluation of the range.

7. Relative Offsets (`offset` Modifier)

The offset modifier shifts the evaluation time back in time.

Example:

promqlCopyEdithttp_requests_total offset 1h

Returns the value of http_requests_total from 1 hour ago (either as instant or range vector depending on selector type).

Can be combined with range vectors:

promqlCopyEditrate(http_requests_total[5m] offset 1h)

This gives the 5-minute rate calculated 1 hour ago.

8. Visualizing Offsets for Instant Vector Selectors

If now is 16:00, then:

promqlCopyEditmetric_name offset 1h

...evaluates the value of metric_name at 15:00.

It works like a time machine for metrics.

9. Offset Use Cases

Use offsets to:

Compare current data to past performance
Detect regressions
Create "previous week" or "same time yesterday" graphs

Example:

promqlCopyEdit(rate(http_requests_total[5m]) - rate(http_requests_total[5m] offset 1d)) / rate(http_requests_total[5m] offset 1d)

This shows the percentage change from yesterday.

10. Visualizing Offsets for Range Vector Selectors

Example:

promqlCopyEditrate(metric[1h] offset 2h)

Assume current time is 18:00:

Evaluation time: 18:00
Range: 1h
Offset: 2h

➡️ Evaluates over 15:00 to 16:00

11. Absolute Evaluation Timestamps (`@` Modifier)

The @ modifier lets you run a query as if it were evaluated at an exact timestamp.

Syntax:

promqlCopyEdithttp_requests_total @ 1714606800

Uses Unix timestamp in seconds
Only available in Prometheus >= v2.33

Use cases:

Forensics
Debugging exact past states
Deterministic exports

12. Visualizing Absolute Evaluation Timestamps

Imagine this query:

promqlCopyEditrate(http_requests_total[5m]) @ 1714606800

Prometheus computes the 5-minute rate at the exact time 1714606800.

This enables reproducibility of data snapshots and avoids skew from real-time evaluations.

13. Syntactic Order for Modifiers

When combining modifiers (offset, @), their order matters.

Correct order:

promqlCopyEditmetric[5m] offset 1h @ 1714606800

offset is applied before @
Read it like: "take 5-minute range 1h ago, evaluated at timestamp"

If order is incorrect, Prometheus throws a parse error.

Understanding Counter Rates and Increases in PromQL

Prometheus counters represent monotonically increasing values—such as the number of requests processed or bytes transferred. Understanding how to interpret, calculate, and query these counters accurately is essential for time-series analytics.

1. Absolute Counter Values and Why We Want Rates

🔹 Absolute Values:

Counters like http_requests_total grow over time. They show the total amount of something that has occurred.

Example:

promqlCopyEdithttp_requests_total

This shows the current cumulative count of HTTP requests—but doesn't tell how fast they're coming in.

🔸 Why We Want Rates:

Absolute values don’t show trends or activity levels. We usually want:

How many requests per second?
How fast is the traffic increasing?

Thus, we compute rates (changes per time unit).

2. The Three Counter Increase Functions

Prometheus provides 3 main functions to evaluate counter growth over time:

Function	Description
`rate()`	Calculates per-second average rate
`increase()`	Calculates absolute increase over a period
`irate()`	Calculates per-second rate using last 2 points (instant rate)

3. Behavior of `rate()` and `increase()`

`rate()`:

Used with range vectors, gives the average rate per second over the range.

Syntax:

promqlCopyEditrate(http_requests_total[5m])

This calculates how many requests per second happened on average over the last 5 minutes.

`increase()`:

Calculates total increase over a time range.

Syntax:

promqlCopyEditincrease(http_requests_total[5m])

If 100 requests were made during the 5-minute window, this returns 100.

4. Handling Counter Resets

Prometheus counters may reset (e.g., due to app restart). Prometheus automatically detects this by identifying a lower value than before.

PromQL functions like rate() and increase():

Detect these resets
Skip invalid segments
Continue calculating using valid portions

🔸 If a reset is detected:

textCopyEdit... -> 950 -> 980 -> 10 (reset) -> 50

increase() computes: (980 - 950) + (50 - 10) = 30 + 40 = 70

5. Calculating the `rate()` and `increase()` Slope

Prometheus interpolates values at the start and end of the range window and fits a linear regression line to the points.

Example:

For increase(http_requests_total[5m]), Prometheus:

Gathers all samples in the 5-minute window
Interpolates a value at the start
Interpolates a value at the end
Computes the difference between them

Mathematically:

textCopyEditincrease = value_end - value_start
rate = increase / duration_in_seconds

6. Extrapolating the Return Value for the `increase()` Function

Prometheus doesn't just blindly subtract endpoints. It extrapolates when samples don't exist exactly at the boundaries.

If data points are sparse, Prometheus interpolates start/end points, then extrapolates the growth based on rate to cover the entire window.

This avoids underestimating counters when scrapes are missed or irregular.

7. Confusing Extrapolating for Slow-Moving Counters

Slow-moving counters (e.g., errors that happen once per hour) can confuse users.

Example:

promqlCopyEditincrease(errors_total[5m])

If one error occurred 4m ago, Prometheus extrapolates to assume a partial contribution across the 5-minute window.
It may look like a fractional increase (e.g., 0.2), which surprises users expecting whole numbers.

🔹 Prometheus is mathematically correct, but interpretation requires caution for low-frequency events.

8. Limiting Extrapolating to Zero Sample Values

If no samples are found in a range, increase() and rate() return 0, not NaN.

This is crucial for graph continuity—otherwise, dashboards would show gaps.

But be careful: zero increase ≠ zero traffic. It might mean:

No data scraped
Metric not emitted
Actual zero traffic

📌 Use alerting rules or metadata checks to detect missing data.

9. The `irate()` Function

irate() (instant rate) computes the rate between the two most recent samples in a range.

Syntax:

promqlCopyEditirate(http_requests_total[5m])

Uses just the last two data points
No interpolation, no smoothing
Ideal for spiky, fast-changing signals

⚠️ Use with caution on slow counters—it can be misleading if data is sparse.

10. Which Function Should You Use?

Use Case	Function
Trends, averages, smoothing	`rate()`
Absolute counts over time	`increase()`
Current/instantaneous values	`irate()`
Alerting (on spikes, errors)	`rate()` or `irate()`
SLO calculation	`increase()` (e.g., over a day/week)

🔸 Rule of thumb:

Dashboards: use rate() for visual stability.
SLO math: use increase() to count events.
High-frequency alerting: use irate() if latency is critical.

Understanding "up" and Friends in Prometheus

1. Prometheus Server Configuration

Before exploring up and other auto-generated metrics, it's crucial to understand how Prometheus is configured to monitor targets:

Configuration File: Prometheus uses a prometheus.yml configuration file to define scraping jobs.
scrape_configs: Within this file, the scrape_configs block defines how Prometheus should discover and collect metrics from targets.

Example:

yamlCopyEditscrape_configs:
  - job_name: 'node_exporter'
    static_configs:
      - targets: ['localhost:9100']

Each job defines:
- job_name: A label for the scrape group.
- targets: IP addresses or hostnames of endpoints exposing metrics.
- metrics_path (default: /metrics)
- Optional: relabeling, authentication, TLS, and timeouts.

When Prometheus starts, it uses this configuration to initialize target discovery and begin scraping metrics.

2. Inspecting Targets in Prometheus

To verify that Prometheus is correctly scraping your services:

Navigate to the Targets Page:
- URL: http://<your-prometheus-host>:9090/targets
This page displays:
- Job names and their associated targets.
- Scrape status (up/down).
- Last scrape duration and timestamp.
- Labels associated with each target.
Importance: This helps you quickly see which targets are reachable and why some may be down.
Health Status: The field last scrape error or the color-coded status lets you identify failures in real time.

3. Showing All Auto-Generated Metrics

Prometheus automatically generates some internal metrics about its own operation, especially for each target it scrapes. These are exposed at:

http://localhost:9090/metrics

To view a list of all available metrics in the UI:

Go to http://localhost:9090/graph
Click on the "insert metric at cursor" dropdown or start typing in the expression field.
Metrics like up, scrape_duration_seconds, and scrape_samples_post_metric_relabeling appear.

4. The "up" Metric

This is the most important health metric in Prometheus.

Definition: up is a gauge metric automatically generated by Prometheus to indicate whether a target is reachable.
Values:
- 1: The scrape was successful (target is UP).
- 0: The scrape failed (target is DOWN or unreachable).

Labels:

  textCopyEditup{job="node_exporter", instance="localhost:9100"} 1

Use Case:
You can use this in alerting rules:

  yamlCopyEditalert: TargetDown
  expr: up == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Target {{ $labels.instance }} is down"

Internals: up is computed based on whether Prometheus received a valid HTTP 200 response and successfully parsed the metrics from the target.

5. Other Auto-Generated Metrics

Prometheus exposes several internal metrics for diagnostics and performance monitoring:

Metric	Description
`scrape_duration_seconds`	Time taken to scrape a target
`scrape_samples_scraped`	Number of samples scraped in the last scrape
`scrape_samples_post_metric_relabeling`	Number of samples retained after relabeling
`scrape_series_added`	Number of series added in the scrape
`scrape_timeout_seconds`	Timeout setting per scrape
`prometheus_sd_*`	Service discovery subsystem metrics
`prometheus_target_*`	Metrics on target health and discovery
`prometheus_engine_*`	Query engine performance
`prometheus_tsdb_*`	Storage subsystem metrics (compaction, WAL, memory usage)

Example:

textCopyEditscrape_duration_seconds{job="node_exporter", instance="localhost:9100"} 0.023

These can be used to:

Detect scrape performance issues
Analyze ingestion rate
Tune Prometheus server configuration

6. Auto-Generated Metrics in the Prometheus Documentation

Prometheus maintains complete documentation of its internal metrics:

Official Reference:
- Prometheus Internal Metrics Documentation
The documentation includes:
- Metric name
- Type (gauge, counter)
- Description
- Associated labels
- Subsystem/component
Use Case: These are especially useful for:
- Monitoring Prometheus server health
- Creating dashboards (e.g., Grafana Prometheus dashboards)
- Debugging ingestion issues
- Auditing scrape errors

🔍 Summary Table: Key Auto-Generated Metrics

Metric Name	Type	Purpose
`up`	Gauge	Indicates if the target was successfully scraped
`scrape_duration_seconds`	Gauge	Scrape latency
`scrape_samples_scraped`	Gauge	Number of metrics collected per scrape
`prometheus_target_interval_length_seconds`	Gauge	Actual vs expected interval duration
`prometheus_engine_query_duration_seconds`	Histogram	Duration of PromQL queries
`prometheus_tsdb_head_series`	Gauge	Total active series in TSDB

Understanding Prometheus Histograms

1. Motivation and Histogram Basics

Histograms in Prometheus are used to observe and record the distribution of events over a set of predefined buckets. They are particularly useful for understanding the behavior of applications, such as response times, request sizes, or any measurable quantity that can be categorized.

2. Need to Measure Request Durations/Latency

Monitoring request durations or latency is crucial for:

Performance Analysis: Understanding how fast your application responds.
SLA/SLO Compliance: Ensuring response times meet agreed standards.
Bottleneck Identification: Detecting slow components in your system.

Histograms allow you to see not just averages but the distribution of response times, which is vital for comprehensive performance monitoring.

3. Downsides of Using Event Logging

While event logging provides detailed insights, it has limitations:

High Overhead: Logging every event can consume significant resources.
Complex Analysis: Aggregating and analyzing logs for metrics is cumbersome.
Latency: Real-time analysis is challenging due to the volume of data.

Histograms offer a more efficient way to monitor metrics like latency without the overhead of detailed logging.

4. Why a Single Gauge Doesn't Help Us

A gauge represents a single numerical value that can go up or down. Using a gauge for metrics like request duration is inadequate because:

Lack of Distribution: Gauges show only the current value, not the spread.
No Historical Context: They don't provide insights into past performance.
Inability to Calculate Percentiles: Gauges can't be used to compute percentiles like the 95th percentile.

5. Downsides of Using Prometheus Summary Metrics

Summaries in Prometheus can calculate quantiles but have drawbacks:

Client-Side Calculation: Quantiles are calculated on the client, limiting flexibility.
No Aggregation Across Instances: Summaries can't be aggregated across multiple instances.
Static Configuration: Quantile objectives must be predefined.

Histograms, on the other hand, allow server-side aggregation and dynamic quantile calculation.

6. Prometheus Histogram Example for Tracking Request Durations

To track request durations:

goCopyEdithttpDuration := prometheus.NewHistogram(prometheus.HistogramOpts{
    Name:    "http_request_duration_seconds",
    Help:    "A histogram of the HTTP request durations.",
    Buckets: prometheus.DefBuckets,
})

This setup records the duration of HTTP requests into predefined buckets, enabling detailed analysis of response times.

7. How Can We Expose Histograms as Time Series to Prometheus?

Prometheus histograms are exposed as multiple time series:

<metric>_bucket{le="..."}: Cumulative count of observations less than or equal to the bucket's upper bound.
<metric>_sum: Sum of all observed values.
<metric>_count: Total number of observations.

These time series allow Prometheus to store and query histogram data effectively.

8. Cumulative Histogram Representation

This chart plots:

X-axis: Duration in milliseconds (e.g., 25ms, 50ms, 100ms, etc.).
Y-axis: Count of observations that fall within a specific bucket.
Each bar height represents the number of observations between two bounds.

Bucket Counts (as shown):

Bucket Range (ms)	Count
≤ 25	31
25–50	32
50–100	105
100–250	617
> 250	215

This means, for example, 617 requests took between 100ms and 250ms.

Prometheus stores histograms in a cumulative format rather than the regular format shown in the image.

A cumulative histogram gives the running total of observations up to each bucket's upper bound:

Bucket (`le` = "less than or equal to")	Cumulative Count
`le="25"`	31
`le="50"`	63 (31+32)
`le="100"`	168 (63+105)
`le="250"`	785 (168+617)
`le="+Inf"`	1000 (785+215)

So instead of individual bars, each bucket value contains the total number of observations less than or equal to the upper bound.

Summary of the Difference

Feature	Regular Histogram (Image)	Cumulative Histogram (Prometheus)
Bucket Value	Observations within a range	Observations up to a bound
Data Representation	Independent bar heights	Accumulated total at each threshold
Example	105 requests took 50–100ms	168 requests took ≤100ms

9. The Special "le" (Less-Than-Or-Equal) Bucket Upper Bound Label

In Prometheus, histograms use bucketed counts to record how many observations fall below certain thresholds.

Each bucket is labeled with:

iniCopyEditle = X

Which means:

“Count of observations less than or equal to X.”

For example:

le="25" → number of observations ≤ 25 ms
le="50" → number of observations ≤ 50 ms
...
le="+Inf" → total count of all observations (since everything is ≤ ∞)

From the image:

Bucket (`le`)	Cumulative Count
≤ 25 ms	31
≤ 50 ms	63
≤ 100 ms	168
≤ 250 ms	785
≤ +Inf	1000

Interpretation:

From 0–25 ms: 31 requests completed
25–50 ms: 63 - 31 = 32 requests
50–100 ms: 168 - 63 = 105 requests
100–250 ms: 785 - 168 = 617 requests
250–∞ ms: 1000 - 785 = 215 requests

Summary

The le label tells you the upper bound of the bucket.
These buckets are cumulative: each includes all lower durations.
Subtracting adjacent bucket values gives the number of samples in that range.
The bucket with le="+Inf" always contains the total number of samples.

10. Time Series Exposed from a Histogram Metric

This cumulative histogram displays the duration (in seconds) of observed events, using bucket boundaries (e.g. le="0.025") along the X-axis, and the cumulative count of observations along the Y-axis.

The time series exposed by a Prometheus histogram metric named http_request_duration_seconds_bucket would look like:

promqlCopyEdithttp_request_duration_seconds_bucket{le="0.025"}
http_request_duration_seconds_bucket{le="0.05"}
http_request_duration_seconds_bucket{le="0.1"}
http_request_duration_seconds_bucket{le="0.25"}
http_request_duration_seconds_bucket{le="+Inf"}

Each of these is a separate time series, and their values increase cumulatively as more events fall into that bucket or smaller.

🧠 How to Interpret `le`

Each le value is an upper boundary, meaning:

le="0.025" → all durations ≤ 25 ms
le="0.05" → all durations ≤ 50 ms
le="0.1" → all durations ≤ 100 ms
...
le="+Inf" → all observations (total count)

📐 Behind the Scenes: Prometheus Histogram Export

A histogram metric in Prometheus (like http_request_duration_seconds) exposes 3 types of time series automatically:

Series Type	Purpose
`*_bucket{le="..."}`	Buckets by `le`, cumulative counts
`*_count`	Total count of observations
`*_sum`	Total sum of all observed values

So for http_request_duration_seconds, you'll see:

promqlCopyEdithttp_request_duration_seconds_bucket{le="0.025"}
http_request_duration_seconds_bucket{le="0.05"}
...
http_request_duration_seconds_bucket{le="+Inf"}

http_request_duration_seconds_sum
http_request_duration_seconds_count

✅ Why It Matters

You can compute percentiles using these buckets (e.g. 95th percentile from histogram approximation).
Subtracting two adjacent buckets gives the count in that interval.
It enables time-based slicing (e.g. rate of slow responses over the last 5 minutes).

11. Instrumentation - Adding Histograms to Your Code

To instrument your code with histograms:

Define the Histogram:

goCopyEditvar requestDuration = prometheus.NewHistogram(prometheus.HistogramOpts{
    Name:    "http_request_duration_seconds",
    Help:    "Histogram of response time for handler.",
    Buckets: prometheus.LinearBuckets(0.05, 0.05, 20),
})

Register the Histogram:

goCopyEditprometheus.MustRegister(requestDuration)

Observe Values:

goCopyEditstart := time.Now()
// handle request
duration := time.Since(start).Seconds()
requestDuration.Observe(duration)

12. Adding Histograms Without Additional Labels

When adding histograms without additional labels:

Simplifies Aggregation: Easier to aggregate across instances.
Reduces Cardinality: Fewer unique time series, conserving resources.
Use Case: Suitable for global metrics where differentiation isn't necessary.

13. Adding Histograms With Additional Labels

Adding labels to histograms allows for more granular analysis:

goCopyEditvar requestDuration = prometheus.NewHistogramVec(
    prometheus.HistogramOpts{
        Name:    "http_request_duration_seconds",
        Help:    "Histogram of response time for handler.",
        Buckets: prometheus.LinearBuckets(0.05, 0.05, 20),
    },
    []string{"method", "endpoint"},
)

This setup enables you to analyze request durations by HTTP method and endpoint.

14. Querying Histograms with PromQL

PromQL provides functions to query histograms:

rate(): Calculates the per-second average rate of increase.
increase(): Calculates the total increase over a time range.
histogram_quantile(): Estimates quantiles from histogram buckets.

Example:

promqlCopyEdithistogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

This query estimates the 95th percentile of request durations over the last 5 minutes.

15. Querying All Bucket Series of a Histogram

To retrieve all bucket series:

promqlCopyEdithttp_request_duration_seconds_bucket

This returns all time series with the bucket suffix, allowing you to analyze the distribution across all buckets.

16. Querying Percentiles/Quantiles Using `histogram_quantile()`

The histogram_quantile() function estimates quantiles:

promqlCopyEdithistogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

This computes the 95th percentile by summing the rate of increase across all buckets and applying the quantile function.

17. Using `rate()` or `increase()` to Limit a Histogram to Recent Increases

To focus on recent data:

rate(): Provides the per-second average rate over a time range.
increase(): Gives the total increase over a time range.

Example:

promqlCopyEditrate(http_request_duration_seconds_bucket[5m])

This calculates the rate of increase for each bucket over the last 5 minutes.

18. Controlling the Smoothing Time Window

The time range specified in rate() or increase() functions controls the smoothing window:

Shorter Window: More responsive to recent changes but noisier.
Longer Window: Smoother results but less responsive to recent changes.

Choose the window size based on the desired balance between responsiveness and smoothness.

19. Aggregating Histograms and Percentiles Over Label Dimensions

To aggregate histograms across dimensions:

promqlCopyEditsum(rate(http_request_duration_seconds_bucket[5m])) by (le)

This sums the rate of increase for each bucket across all instances. You can then apply histogram_quantile() to compute percentiles:

promqlCopyEdithistogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

20. Errors of Quantile Calculation and Bucketing Schemas

Quantile estimation errors can arise due to:

Bucket Granularity: Coarse buckets lead to less accurate quantiles.
Data Distribution: Uneven distributions can skew results.
Interpolation Assumptions: histogram_quantile() assumes a uniform distribution within buckets.

To minimize errors:

Use Appropriate Buckets: Choose bucket boundaries that reflect your data distribution.
Monitor Bucket Usage: Ensure that most data falls within the defined buckets.

21. Showing Histograms as a Heatmap

Heatmaps provide a visual representation of histogram data over time:

X-Axis: Time.
Y-Axis: Bucket boundaries.
Color Intensity: Frequency of observations.

In Grafana:

Select Heatmap Panel.
Configure Data Source: Use Prometheus as the data source.
Enter Query: For example:

promqlCopyEditrate(http_request_duration_seconds_bucket[5m])

Adjust Visualization Settings: Set appropriate axes and color schemes.

Visual Aid:

22. Querying Request Rates Using `_count`

To calculate the rate of requests:

promqlCopyEditrate(http_request_duration_seconds_count[5m])

This provides the per-second rate of HTTP requests over the last 5 minutes.

23. Querying Average Request Durations Using `_sum` and `_count`

To compute the average request duration:

promqlCopyEditrate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])

This divides the total duration by the number of requests, yielding the average duration per request.

Creating Grafana Dashboards for Prometheus

1. Option A: Running Grafana Using Docker

Step-by-Step Instructions

✅ Prerequisites:

Docker installed on your system.
Prometheus already running (can also be in Docker).

🔧 Start Grafana using Docker:

bashCopyEditdocker run -d \
  -p 3000:3000 \
  --name=grafana \
  grafana/grafana

This command:

Runs Grafana in the background (-d)
Maps Grafana’s port 3000 to your local machine
Names the container grafana

🧪 Check if it’s running:

Visit http://localhost:3000

2. Option B: Running Grafana Using Pre-Built Binaries

✅ Prerequisites:

Installed Prometheus
Installed Grafana binary for your OS from:
🔗 https://grafana.com/grafana/download

🧰 Installation Steps:

👉 Windows:

Unzip the downloaded Grafana .zip file.
Open a terminal (cmd) and navigate to the bin folder inside the extracted directory.
Run:

bashCopyEditgrafana-server.exe

👉 Linux:

bashCopyEdittar -zxvf grafana-<version>.linux-amd64.tar.gz
cd grafana-<version>
./bin/grafana-server

Grafana will run on http://localhost:3000.

3. Logging into Grafana

Open browser → Visit: http://localhost:3000
Default credentials:
- Username: admin
- Password: admin

🔐 You’ll be asked to change the password on first login.

4. Creating a Prometheus Data Source

📡 Add Prometheus as a Data Source:

In the left sidebar, click gear icon (⚙️) → Data Sources
Click “Add data source”
Choose “Prometheus”
Under HTTP > URL, enter:

httpCopyEdithttp://localhost:9090

(Replace localhost:9090 with your actual Prometheus URL if it differs)

Click “Save & Test”
- You should see a green message: ✅ Data source is working

5. Creating a New Dashboard

🛠️ Steps to Create a Dashboard:

Click the “+” (plus) icon in the left sidebar → Dashboard
Click “Add new panel”
You’ll now see a new panel editor with default settings
At the top, name your dashboard (click on the title "New dashboard")
Click Save (floppy disk icon) in the top right → Give it a name → Save

6. Creating a Time Series Chart

📈 Steps to Add a Time Series Panel:

In your new dashboard, click “Add new panel”
Choose Visualization type: Time series (left-hand side)
In the Query section:
- Set Data Source: Prometheus
- Enter query:

promqlCopyEditrate(http_requests_total[5m])

Click Run to see the graph populate.
Customize:
- Panel title, units (like seconds, ms, etc.)
- Axes (logarithmic or linear)
- Legend display
Click Apply to save the panel to your dashboard

7. Creating a Gauge Panel

🎯 Steps to Add a Gauge:

Click “Add panel” → In the Visualization options, select Gauge
In the Query box, enter something like:

promqlCopyEdithttp_requests_total

or a value-producing metric like:

promqlCopyEditsum(rate(cpu_usage_seconds_total[1m]))

Configure:
- Min & Max range (example: 0 – 100 for percentages)
- Thresholds (to color the gauge: green/yellow/red)
- Unit: e.g., percent, seconds, req/sec
Click Apply

8. Creating a Table Panel

🧮 Steps to Add a Table Panel:

Click “Add panel”
Select Visualization → Table
In the query section, use a metric that returns multiple labels/values:

promqlCopyEdittopk(5, rate(http_requests_total[1m]))

Under Format:
- Set to “Table”
- Adjust time range, value format
Style:
- Add column aliases
- Apply unit types (seconds, bytes, %, etc.)
Click Apply

9. Adding Rows to the Dashboard

📋 Organize Panels Using Rows:

In the dashboard view, click the dropdown menu (three-dot icon) in the upper right
Select “Add row”
Enter a name for the row (e.g., “Performance Metrics”)
Drag and drop existing panels into this row
Use rows to group related panels:
- CPU Stats
- Memory Usage
- Latency Tracking

📌 Rows can be collapsed/expanded, improving usability in large dashboards.

Final Touches

Use “Dashboard Settings” (gear icon at the top) to:
- Set auto-refresh (e.g., every 10s, 30s, etc.)
- Set default time range
- Add dashboard-level variables

Monitoring Linux Host Metrics with Prometheus

1. Downloading and Unpacking the Node Exporter

The Node Exporter is an official Prometheus exporter for exposing hardware and OS metrics from *nix systems.

✅ Steps:

🔗 Download:

Go to: https://prometheus.io/download/#node_exporter

Or directly use:

bashCopyEditwget https://github.com/prometheus/node_exporter/releases/download/v1.8.0/node_exporter-1.8.0.linux-amd64.tar.gz

📦 Unpack:

bashCopyEdittar -xvf node_exporter-1.8.0.linux-amd64.tar.gz
cd node_exporter-1.8.0.linux-amd64

2. Node Exporter Command-Line Flags

The Node Exporter has many flags to control which metrics it exposes.

🔧 Common Flags:

bashCopyEdit./node_exporter \
  --web.listen-address=":9100" \
  --web.telemetry-path="/metrics" \
  --collector.cpu \
  --collector.meminfo \
  --collector.diskstats

📚 Flag Details:

Flag	Description
`--web.listen-address`	Address/port to serve metrics (default `:9100`)
`--web.telemetry-path`	Path where metrics are exposed (default `/metrics`)
`--collector.<name>`	Enable or disable individual collectors

You can list all collectors with:

bashCopyEdit./node_exporter --help

3. Running the Node Exporter

🟢 Start Node Exporter (basic way):

bashCopyEdit./node_exporter

It will start serving metrics on:
👉 http://localhost:9100/metrics

🚀 Run in Background (production):

bashCopyEditnohup ./node_exporter > node_exporter.log 2>&1 &

Or, create a systemd service (recommended for servers):

bashCopyEditsudo nano /etc/systemd/system/node_exporter.service

Paste:

iniCopyEdit[Unit]
Description=Node Exporter
After=network.target

[Service]
User=nobody
ExecStart=/path/to/node_exporter

[Install]
WantedBy=default.target

Enable & start:

bashCopyEditsudo systemctl daemon-reexec
sudo systemctl start node_exporter
sudo systemctl enable node_exporter

4. Inspecting the Node Exporter's /metrics Endpoint

Open in browser or curl:

bashCopyEditcurl http://localhost:9100/metrics

📋 You’ll see raw Prometheus metrics like:

bashCopyEdit# HELP node_cpu_seconds_total Seconds the CPUs spent in each mode.
node_cpu_seconds_total{cpu="0",mode="user"} 3452.92
node_memory_MemAvailable_bytes 123456789
node_filesystem_size_bytes{...} 1099511627776

These are the real-time system stats exposed to Prometheus.

5. Scraping the Node Exporter with Prometheus

🔧 Modify `prometheus.yml` config:

Add the Node Exporter as a static target:

yamlCopyEditscrape_configs:
  - job_name: 'node_exporter'
    static_configs:
      - targets: ['localhost:9100']

If Node Exporter runs on another host, replace localhost with that IP or hostname.

🔁 Restart Prometheus:

bashCopyEdit./prometheus --config.file=prometheus.yml

Or if using systemd:

bashCopyEditsudo systemctl restart prometheus

6. Verifying Successful Target Scrapes

✅ Go to Prometheus UI:

Visit: http://localhost:9090/targets

You should see:

yamlCopyEditjob: node_exporter
target: localhost:9100
last scrape: <time>
status: UP

This confirms Prometheus is successfully scraping metrics.

7. Querying Node Exporter Metrics (CPU and Network Usage)

🧠 Example PromQL Queries:

🧮 CPU Usage (total per mode):

promqlCopyEditrate(node_cpu_seconds_total{mode!="idle"}[5m])

🧠 Memory Available:

promqlCopyEditnode_memory_MemAvailable_bytes

📡 Network Received:

promqlCopyEditrate(node_network_receive_bytes_total[1m])

📤 Network Transmitted:

promqlCopyEditrate(node_network_transmit_bytes_total[1m])

💽 Disk Space Used:

promqlCopyEdit(node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes

8. Showing Host Metrics in Grafana

📺 Visualizing in Grafana:

📌 Prerequisites:

Prometheus added as a data source in Grafana.

Steps to Create a System Monitoring Dashboard:

Create a new dashboard → Add Panel
Use these queries:

🧠 CPU Load (Time series):

promqlCopyEditrate(node_cpu_seconds_total{mode="user"}[5m])

📡 Network Usage (Table or Graph):

promqlCopyEditrate(node_network_receive_bytes_total[1m])
rate(node_network_transmit_bytes_total[1m])

💽 Disk Usage (Gauge):

promqlCopyEdit100 * (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes

🧠 Memory Usage (Gauge):

promqlCopyEdit100 * (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes

Optional: Import Official Grafana Dashboard

Go to Grafana → Dashboards → Import
Use Dashboard ID: 1860 (Node Exporter Full)
Choose your Prometheus data source → Import

This provides a rich pre-built monitoring dashboard.

Don’t Make These 6 Prometheus Monitoring Mistakes

Mistake 1: Cardinality Bombs

🔥 Problem:

Creating a high number of unique time series by using too many or high-variance labels (e.g., user IDs, IP addresses, request paths) causes cardinality explosions, which:

Consume excessive memory and CPU
Slow down queries and alert evaluations
Can crash Prometheus

🧨 Example:

promqlCopyEdithttp_requests_total{user_id="1234", session_id="a9b8c7"}

If every user and session has unique IDs, this results in millions of time series.

✅ Best Practices:

Avoid using high-cardinality labels like user_id, session_id, request_path, etc.
Use static or bounded labels like status, method, instance.
Use aggregation or label_replace() to group data instead of exploding it.

Mistake 2: Aggregating Away Too Many Labels

⚠️ Problem:

When using sum() or avg() without carefully specifying by() labels, you lose context and might aggregate metrics incorrectly.

😵‍💫 Example:

promqlCopyEditsum(rate(http_requests_total[5m]))

This sums all requests from all endpoints, all instances, all statuses — losing all distinguishing information.

🧠 Solution:

promqlCopyEditsum by (job, instance, status) (rate(http_requests_total[5m]))

Keep important dimensional context
Aggregate only intentionally based on your alerting or visualization needs

Mistake 3: Unscoped Metric Selectors

💣 Problem:

Writing PromQL like this:

promqlCopyEditup

…without any scoping labels means querying every single up metric from all jobs, across all targets — including exporters and services you might not care about.

🔍 Consequences:

Wastes query time and resources
Can return noisy or misleading results
Hard to debug or tune alerts

✅ Solution:

Scope it!

promqlCopyEditup{job="my_service"}

Or:

promqlCopyEditup{job=~"api|frontend"}

Use scoped selectors to reduce noise and make queries faster and more accurate.

Mistake 4: Missing `for` Durations in Alerting Rules

🚨 Problem:

Creating alerts without a for: clause in the rule leads to instantaneous alerts that fire as soon as a condition is true — even briefly — leading to flapping or false positives.

🧨 Example:

yamlCopyEdit- alert: HighCPU
  expr: rate(node_cpu_seconds_total{mode="user"}[1m]) > 0.9

This could fire if CPU spikes just for a second.

✅ Solution:

Add for: to wait before alerting:

yamlCopyEdit- alert: HighCPU
  expr: rate(node_cpu_seconds_total{mode="user"}[1m]) > 0.9
  for: 2m

This ensures the alert only triggers if the condition holds continuously for 2 minutes.

Mistake 5: Too Short Rate Windows

📉 Problem:

Using short windows for rate() or increase() (like [30s]) leads to noisy or erratic results, especially for low-frequency metrics.

🧠 Why?

rate() needs multiple samples to give meaningful results
Short windows don’t smooth over variations or delays

❌ Bad:

promqlCopyEditrate(http_requests_total[30s])

✅ Good:

promqlCopyEditrate(http_requests_total[5m])

Longer windows provide more stable, statistically accurate results
For alerts, use windows like [2m] to [5m]
For dashboards, use dynamic durations like $__rate_interval in Grafana

Mistake 6: Using Functions With Incorrect Metric Types

😱 Problem:

Applying PromQL functions meant for one metric type (e.g., counters) to another type (e.g., gauges) leads to invalid or misleading results.

❌ Example:

promqlCopyEditrate(node_memory_MemAvailable_bytes[5m])

This is incorrect. node_memory_MemAvailable_bytes is a gauge, not a counter, so rate() doesn't make sense.

✅ Solution:

Use rate() or increase() only with monotonically increasing counters
Use raw gauge values for metrics like memory, disk, temperature

Function Compatibility:

Function	Works With	Description
`rate()`	Counters	Rate of increase over time window
`increase()`	Counters	Total increase over time window
`avg_over_time`	All	Average value over time
`max_over_time`	All	Maximum value over time

Summary Table

Mistake	Root Cause	Consequences	Fix
1. Cardinality Bombs	High-cardinality labels	Memory bloat, instability	Remove unbounded labels
2. Over-Aggregation	Aggregating all labels	Loss of detail, inaccurate alerts	Use `by(...)` carefully
3. Unscoped Selectors	No filtering in queries	Noisy, inefficient results	Use proper label filters
4. Missing `for`	No delay in alerts	False positives	Add `for:` to alert rules
5. Short Rate Windows	Tiny time ranges	Noisy or empty data	Use `[2m]` to `[5m]`
6. Wrong Function Use	Using `rate()` on gauges	Misleading results	Match function to metric type

Exposing Custom Host Metrics Using the Prometheus Node Exporter

1. 🔍 "textfile" Collector Module Basics

✅ What is the `textfile` collector?

A built-in module of the Node Exporter
Reads files containing Prometheus metric data in text exposition format
These files must be placed in a designated directory
Useful for ad-hoc, one-off, or custom metrics from scripts or non-Go code

📂 How It Works:

You create files like my_custom_metric.prom

Put them in the directory specified with:

  bashCopyEdit--collector.textfile.directory=/var/lib/node_exporter/textfile_collector/

Node Exporter reads those files at scrape time and exposes them as metrics under the /metrics endpoint

📌 Key Notes:

Each file should use the .prom extension and be valid Prometheus text format
Files are removed or rotated automatically by you (Node Exporter doesn’t do cleanup)
Avoid frequently rewriting large files (keep them small)

2. 🕒 Exposing a Custom Cron Job Metric

Suppose you want to measure the success/failure of a backup script run by cron.

👨‍💻 Bash Script Example:

bashCopyEdit#!/bin/bash

BACKUP_STATUS=1  # assume failure

if /usr/local/bin/backup.sh; then
  BACKUP_STATUS=0
fi

echo "# HELP backup_success Whether the backup succeeded (1) or failed (0)" > /var/lib/node_exporter/textfile_collector/backup.prom
echo "# TYPE backup_success gauge" >> /var/lib/node_exporter/textfile_collector/backup.prom
echo "backup_success $BACKUP_STATUS" >> /var/lib/node_exporter/textfile_collector/backup.prom

⏱ Cron Job Entry:

cronCopyEdit0 2 * * * /usr/local/bin/backup_metric.sh

This creates or updates /var/lib/node_exporter/textfile_collector/backup.prom every night at 2 AM. Node Exporter will serve that file as part of its /metrics.

🧪 You can query this in Prometheus:

promqlCopyEditbackup_success

3. 🧑‍💻 Generating Metric Text Files From Go

You can also generate .prom files from Go programs that gather and export custom metrics.

✅ Step-by-step Example:

Import Required Package

goCopyEditimport (
    "fmt"
    "os"
)

Write Metrics to File

goCopyEditfunc writeCustomMetric(filename string, metricName string, value float64) {
    file, err := os.Create(filename)
    if err != nil {
        panic(err)
    }
    defer file.Close()

    fmt.Fprintf(file, "# HELP %s Custom metric\n", metricName)
    fmt.Fprintf(file, "# TYPE %s gauge\n", metricName)
    fmt.Fprintf(file, "%s %f\n", metricName, value)
}

Usage

goCopyEditfunc main() {
    writeCustomMetric("/var/lib/node_exporter/textfile_collector/my_metric.prom", "my_custom_gauge", 42.0)
}

Run this Go program periodically (via cron or systemd timer) to update the metric.

4. 📁 "textfile" Collector Example Scripts Repository

There is an official community-maintained repo with example scripts:
🔗 https://github.com/prometheus/node-exporter-textfile-collector-scripts

✅ What's in the repo?

Prebuilt scripts to collect metrics like:
- SMART disk health
- RAID status
- Sensors temperature
- Filesystem usage
- Battery levels
Scripts written in bash, Python, or other languages
Designed to drop .prom files in the textfile_collector directory

📂 Directory Structure:

swiftCopyEdit/var/lib/node_exporter/textfile_collector/
├── smartctl.prom
├── sensors.prom
├── custom_ping_check.prom

Each .prom file contains one or more metrics with the appropriate format.

Example content of `smartctl.prom`:

bashCopyEdit# HELP smart_disk_ok Whether the disk passed SMART test
# TYPE smart_disk_ok gauge
smart_disk_ok{device="/dev/sda"} 1
smart_disk_ok{device="/dev/sdb"} 0

This allows you to alert on disk failure using PromQL.

Best Practices

Practice	Recommendation
File format	Use only `.prom` extension and proper Prometheus text format
File ownership	Ensure Node Exporter has read access to the files
Script errors	Avoid creating invalid or partial `.prom` files (use temp file then rename)
Performance	Don’t create too many metrics or files. Keep it lean.
Rotation	Manually rotate or overwrite files regularly to avoid stale metrics

Sample PromQL Queries

promqlCopyEditbackup_success == 0

Alert if your backup fails.

promqlCopyEditsmart_disk_ok == 0

Detect failing disks.

promqlCopyEditavg(node_custom_ping_latency_ms) by (target)

Get average ping latency from a script.

Relabeling in Prometheus

1. 🎯 Motivation for Relabeling

Prometheus scrapes targets and attaches labels to their metrics. Sometimes:

You want to modify these labels.
You want to drop or keep certain targets.
You want to rewrite target addresses.
You want to extract or clean up metadata from service discovery.

Relabeling provides a powerful and flexible way to transform labels or control scrape behavior.

2. ⚙️ Relabeling in the Prometheus Configuration File

Relabeling is configured in your prometheus.yml file under different contexts:

Section	Purpose
`relabel_configs`	Target relabeling – modifies targets before scraping
`metric_relabel_configs`	Metric relabeling – modifies individual metrics after scraping
`relabel_configs` under `remote_write`	Modifies labels before sending metrics to a remote storage

🔧 Example Layout:

yamlCopyEditscrape_configs:
  - job_name: 'example'
    static_configs:
      - targets: ['localhost:9100']
    relabel_configs:
      - source_labels: [__address__]
        regex: 'localhost:9100'
        target_label: instance
        replacement: 'my-node'

3. 🧭 Relabeling Steps and Flow

Target Relabeling Flow:

Service discovery (SD) returns a list of target groups.
Each target gets label metadata (like __address__, __meta_kubernetes_pod_name, etc.).
These labels go through relabeling steps (relabel_configs).
The resulting targets are scraped if they’re not dropped.

Metric Relabeling Flow:

After scraping, each metric passes through metric_relabel_configs.
Metrics can be dropped, relabeled, or kept based on the rules.

4. 🧱 Relabeling Rule Structure and Fields

Each relabeling rule is a YAML dictionary with:

Field	Description
`source_labels`	List of labels used as input
`separator`	String used to join multiple source label values (default: `;`)
`regex`	A regular expression to match against the joined string
`target_label`	The label to write the result to
`replacement`	String to use as replacement value
`action`	What to do: `replace`, `keep`, `drop`, `hashmod`, `labelmap`, etc.

🛠 Example:

yamlCopyEdit- source_labels: [__meta_kubernetes_pod_name]
  regex: '(.*)'
  target_label: pod
  replacement: '$1'
  action: replace

5. 🏷️ Target Metadata Labels

When using service discovery (e.g., Kubernetes), targets come with metadata labels, prefixed with __meta_.

Examples:

__meta_kubernetes_namespace
__meta_kubernetes_pod_name
__meta_kubernetes_pod_label_app

These are temporary labels used during relabeling and are discarded afterward unless explicitly copied.

6. 🧪 The Relabeling Visualizer Tool

🔗 Prometheus Relabel Debugger

This web-based tool lets you:

Paste raw label sets
Test relabeling rules interactively
See how each step transforms your labels
Extremely useful for Kubernetes SD debugging

7. 🧷 Example 1: Setting a Fixed Label Value

Add a new label env="prod" to all targets:

yamlCopyEdit- target_label: env
  replacement: prod
  action: replace

8. 🔁 Example 2: Overriding the Scrape Port

Force scraping on port 9100 regardless of what SD gives:

yamlCopyEdit- source_labels: [__address__]
  regex: '(.*):.*'
  target_label: __address__
  replacement: '${1}:9100'
  action: replace

9. 🔄 Example 3: Mapping Over Label Patterns

Map all labels with prefix __meta_kubernetes_pod_label_ into real labels:

yamlCopyEdit- action: labelmap
  regex: __meta_kubernetes_pod_label_(.+)

This will turn:

iniCopyEdit__meta_kubernetes_pod_label_app="nginx"

Into:

iniCopyEditapp="nginx"

10. ❌ Example 4: Dropping Scraped Samples

Use metric_relabel_configs to drop unwanted metrics:

yamlCopyEditmetric_relabel_configs:
  - source_labels: [__name__]
    regex: 'node_cpu_seconds_total'
    action: drop

Or drop entire targets from scrape:

yamlCopyEditrelabel_configs:
  - source_labels: [__meta_kubernetes_namespace]
    regex: 'test-namespace'
    action: drop

11. 🧩 Debugging Relabeling Rules

🔍 How to Debug:

Use /targets in Prometheus web UI
- Shows original labels and post-relabel labels
- Shows if a target was dropped
Use /api/v1/targets to fetch live target info
Use the PromLabs relabel debugger to simulate complex flows
Log level debug in Prometheus to see full relabeling logs

Summary

Feature	Description
`relabel_configs`	Change scrape targets and metadata
`metric_relabel_configs`	Filter or relabel individual metrics
`labelmap`	Rename multiple labels based on regex
`drop`, `keep`	Selectively drop/keep targets or metrics
Visual Debugging	Use relabeler.promlabs.com

Grafana Heatmaps for Prometheus Histograms

1. Adding and Configuring a Heatmap Panel for Prometheus Histograms

🔸 What Is a Heatmap Panel?

A heatmap is a two-dimensional chart where:

The X-axis usually represents time.
The Y-axis represents value buckets (e.g., request durations, response sizes).
Color intensity represents the frequency or count of occurrences.

In Prometheus, heatmaps are built from histogram metrics, specifically the _bucket time series from histogram instruments.

Prerequisites

You have Prometheus set up and scraping histogram metrics.
Example Prometheus metric: http_request_duration_seconds_bucket
Grafana is connected to Prometheus as a data source.

Step-by-Step: Adding a Heatmap Panel

🛠 Step 1: Open Grafana and Create/Edit a Dashboard

Go to Grafana (typically http://localhost:3000).
Click “+” → Dashboard.
Click “Add New Panel”.
From the panel type selector, choose "Heatmap".

Step 2: Write the PromQL Query

Use the histogram _bucket metric with a rate() or increase() function:

promqlCopyEditrate(http_request_duration_seconds_bucket[5m])

Replace the metric with your own histogram bucket metric.

rate() shows frequency per second.
increase() is used for absolute count over a time window.

Step 3: Group by Bucket (`le`) and Label

You must group by the le label (less-than-or-equal) to segment by bucket:

promqlCopyEditsum by (le) (
  rate(http_request_duration_seconds_bucket[5m])
)

If you have other labels (e.g., job, instance), include them as needed.

Step 4: Panel Settings

A. Data Format

Format as: Time series buckets (NOT regular time series).

B. Visualization Settings

Set the Y-axis to “logarithmic” if your buckets vary widely.
Set the Y-axis unit**: seconds (s), milliseconds (ms), or your metric unit.
Choose Color scheme: usually gradient or spectrum.
Adjust Bucket sort: ascending (for duration buckets).

C. Binning Options

In Display > Binning:

X-Axis (time): automatically binned
Y-Axis (bucket boundaries):
- Choose “Series” mode for Prometheus
- Binning mode: “Auto” or specify your own bucket steps (optional)

Step 5: Save and Observe

Click Apply to save the panel.
Observe how your metric is distributed across buckets over time.

2. Using and Interpreting the Heatmap Panel

🔎 Understanding What You See

The heatmap shows how frequently values fall into different buckets over time.

Each horizontal slice (row) = one bucket (e.g., request ≤ 0.3s, ≤ 0.5s, etc.)

Each vertical column = a time slice (e.g., every minute)

Each cell color = frequency (how many requests fell in that range)

Common Use Cases

Use Case	How Heatmap Helps
Request Latency Analysis	View if most requests fall into <0.5s or spike into higher buckets
Memory Usage	See how memory allocations vary and group over thresholds
Response Size	Analyze spikes in payload size over time
Application Load	View load distribution across histogram buckets

Typical Interpretation Patterns

Darker cells: More frequent values in that bucket/time.
Sudden color changes: Traffic spike or regression.
Wider spread of color across buckets: Latency variability or inconsistent performance.

Example Histogram Metric

If you're using the default Prometheus Go client:

promqlCopyEditsum by (le) (
  rate(http_request_duration_seconds_bucket{job="my-api"}[5m])
)

This query feeds into a heatmap that shows request duration patterns.

Compare with Other Metrics

Combine heatmap with:

_sum: Total request time (for avg calculation).
_count: Total request count.
Use these with PromQL like:

promqlCopyEditrate(http_request_duration_seconds_sum[5m])
/
rate(http_request_duration_seconds_count[5m])

To get average duration, and cross-check the heatmap’s validity.

Generated image

Subscribe to my newsletter

Read articles from Arijit Das directly inside your inbox. Subscribe to the newsletter, and don't miss out.

#prometheus prometheus installation Devops prometheus guide prometheus operator helm guide #prometheus-eks Devops articles

Written by

Understanding Prometheus: A Comprehensive Guide

Table of contents

Introduction to the Prometheus Monitoring System

1. 🔍 What is Prometheus?

2. 🧩 System Architecture

2.1 Metric Sources (Targets)

2.2 Service Discovery

2.3 Prometheus Server (Core)

2.4 Querying and Visualization Tools

2.5 Alerting

2.6 Remote/Local Storage Integration

Summary of the Flow

3. ✨ Core Features Overview

4. 🧬 Prometheus Data Model

5. 🧾 Metrics Transfer Format

6. 🔎 Query Language (PromQL)

7. 🚨 Integrated Alerting

8. 🌐 Service Discovery Support

Getting Started with Prometheus

1. ⬇️ Downloading Prometheus

2. 📦 Unpacking and Inspecting the Tarball

3. ⚙️ Configuring Prometheus

4. 🏁 Command-Line Flags and Defaults

5. ▶️ Running Prometheus

6. 🌐 Web Interface

7. 🎯 Targets Page

8. 🔎 Querying Metrics with PromQL

Understanding Prometheus Metric Types

1. Gauges

2. Gauge Instrumentation Methods

In Go:

In Python:

3. Gauges in the Exposition Format

4. Querying Gauges

5. Gauges Containing Timestamps

6. Counters

7. Counter Resets

8. Counter Instrumentation Methods

In Go:

In Python:

9. Counters in the Exposition Format

10. Querying Counters (Absolute Values vs. Rates)

11. Summaries

12. Constructing Summaries

In Go:

13. Summary Instrumentation Methods

14. Querying Summaries

15. Histograms

16. Cumulative Histograms

17. Constructing Histograms

In Go:

18. Histogram Instrumentation Methods

19. Histograms in the Exposition Format

20. Querying Histograms

21. Average Request Latencies

22. Native Histogram

PromQL Data Selection Explained

1. Instant Vector Selectors

2. Label Matchers

Types of matchers:

3. Visualizing Instant Vector Selector Behavior (Lookback Delta)

4. Staleness Markers and Staleness Handling

5. Range Vector Selectors

6. Visualizing Range Vector Selector Behavior

7. Relative Offsets (offset Modifier)

8. Visualizing Offsets for Instant Vector Selectors

9. Offset Use Cases

10. Visualizing Offsets for Range Vector Selectors

11. Absolute Evaluation Timestamps (@ Modifier)

12. Visualizing Absolute Evaluation Timestamps

13. Syntactic Order for Modifiers

Correct order:

Understanding Counter Rates and Increases in PromQL

1. Absolute Counter Values and Why We Want Rates

🔹 Absolute Values:

🔸 Why We Want Rates:

2. The Three Counter Increase Functions

3. Behavior of rate() and increase()

rate():

increase():

7. Relative Offsets (`offset` Modifier)

11. Absolute Evaluation Timestamps (`@` Modifier)

3. Behavior of `rate()` and `increase()`

`rate()`:

`increase()`:

5. Calculating the `rate()` and `increase()` Slope

6. Extrapolating the Return Value for the `increase()` Function

9. The `irate()` Function

🧠 How to Interpret `le`

16. Querying Percentiles/Quantiles Using `histogram_quantile()`

17. Using `rate()` or `increase()` to Limit a Histogram to Recent Increases

22. Querying Request Rates Using `_count`

23. Querying Average Request Durations Using `_sum` and `_count`