Understanding Prometheus: A Comprehensive Guide

Table of contents
- Introduction to the Prometheus Monitoring System
- Getting Started with Prometheus
- Understanding Prometheus Metric Types
- 1. Gauges
- 2. Gauge Instrumentation Methods
- 3. Gauges in the Exposition Format
- 4. Querying Gauges
- 5. Gauges Containing Timestamps
- 6. Counters
- 7. Counter Resets
- 8. Counter Instrumentation Methods
- 9. Counters in the Exposition Format
- 10. Querying Counters (Absolute Values vs. Rates)
- 11. Summaries
- 12. Constructing Summaries
- 13. Summary Instrumentation Methods
- 14. Querying Summaries
- 15. Histograms
- 16. Cumulative Histograms
- 17. Constructing Histograms
- 18. Histogram Instrumentation Methods
- 19. Histograms in the Exposition Format
- 20. Querying Histograms
- 21. Average Request Latencies
- 22. Native Histogram
- PromQL Data Selection Explained
- 1. Instant Vector Selectors
- 2. Label Matchers
- 3. Visualizing Instant Vector Selector Behavior (Lookback Delta)
- 4. Staleness Markers and Staleness Handling
- 5. Range Vector Selectors
- 6. Visualizing Range Vector Selector Behavior
- 7. Relative Offsets (offset Modifier)
- 8. Visualizing Offsets for Instant Vector Selectors
- 9. Offset Use Cases
- 10. Visualizing Offsets for Range Vector Selectors
- 11. Absolute Evaluation Timestamps (@ Modifier)
- 12. Visualizing Absolute Evaluation Timestamps
- 13. Syntactic Order for Modifiers
- Understanding Counter Rates and Increases in PromQL
- 1. Absolute Counter Values and Why We Want Rates
- 2. The Three Counter Increase Functions
- 3. Behavior of rate() and increase()
- 4. Handling Counter Resets
- 5. Calculating the rate() and increase() Slope
- 6. Extrapolating the Return Value for the increase() Function
- 7. Confusing Extrapolating for Slow-Moving Counters
- 8. Limiting Extrapolating to Zero Sample Values
- 9. The irate() Function
- 10. Which Function Should You Use?
- Understanding "up" and Friends in Prometheus
- Understanding Prometheus Histograms
- 1. Motivation and Histogram Basics
- 2. Need to Measure Request Durations/Latency
- 3. Downsides of Using Event Logging
- 4. Why a Single Gauge Doesn't Help Us
- 5. Downsides of Using Prometheus Summary Metrics
- 6. Prometheus Histogram Example for Tracking Request Durations
- 7. How Can We Expose Histograms as Time Series to Prometheus?
- 8. Cumulative Histogram Representation
- 9. The Special "le" (Less-Than-Or-Equal) Bucket Upper Bound Label
- 10. Time Series Exposed from a Histogram Metric
- 11. Instrumentation - Adding Histograms to Your Code
- 12. Adding Histograms Without Additional Labels
- 13. Adding Histograms With Additional Labels
- 14. Querying Histograms with PromQL
- 15. Querying All Bucket Series of a Histogram
- 16. Querying Percentiles/Quantiles Using histogram_quantile()
- 17. Using rate() or increase() to Limit a Histogram to Recent Increases
- 18. Controlling the Smoothing Time Window
- 19. Aggregating Histograms and Percentiles Over Label Dimensions
- 20. Errors of Quantile Calculation and Bucketing Schemas
- 21. Showing Histograms as a Heatmap
- 22. Querying Request Rates Using _count
- 23. Querying Average Request Durations Using _sum and _count
- Creating Grafana Dashboards for Prometheus
- 1. Option A: Running Grafana Using Docker
- 2. Option B: Running Grafana Using Pre-Built Binaries
- 3. Logging into Grafana
- 4. Creating a Prometheus Data Source
- 5. Creating a New Dashboard
- 6. Creating a Time Series Chart
- 7. Creating a Gauge Panel
- 8. Creating a Table Panel
- 9. Adding Rows to the Dashboard
- Final Touches
- Monitoring Linux Host Metrics with Prometheus
- 1. Downloading and Unpacking the Node Exporter
- 2. Node Exporter Command-Line Flags
- 3. Running the Node Exporter
- 4. Inspecting the Node Exporter's /metrics Endpoint
- 5. Scraping the Node Exporter with Prometheus
- 6. Verifying Successful Target Scrapes
- 7. Querying Node Exporter Metrics (CPU and Network Usage)
- 8. Showing Host Metrics in Grafana
- Donโt Make These 6 Prometheus Monitoring Mistakes
- Exposing Custom Host Metrics Using the Prometheus Node Exporter
- Relabeling in Prometheus
- 1. ๐ฏ Motivation for Relabeling
- 2. โ๏ธ Relabeling in the Prometheus Configuration File
- 3. ๐งญ Relabeling Steps and Flow
- 4. ๐งฑ Relabeling Rule Structure and Fields
- 5. ๐ท๏ธ Target Metadata Labels
- 6. ๐งช The Relabeling Visualizer Tool
- 7. ๐งท Example 1: Setting a Fixed Label Value
- 8. ๐ Example 2: Overriding the Scrape Port
- 9. ๐ Example 3: Mapping Over Label Patterns
- 10. โ Example 4: Dropping Scraped Samples
- 11. ๐งฉ Debugging Relabeling Rules
- Summary
- Grafana Heatmaps for Prometheus Histograms
Introduction to the Prometheus Monitoring System
Prometheus is an open-source monitoring and alerting toolkit widely used for cloud-native and microservices environments. Designed originally by SoundCloud in 2012 and now part of the Cloud Native Computing Foundation (CNCF), Prometheus excels in collecting time-series data, enabling real-time alerting and powerful metric analysis.
1. ๐ What is Prometheus?
Prometheus is a time-series database and monitoring system. It works by scraping metrics from instrumented targets at specified intervals and storing them in a highly efficient time-series database. Prometheus is widely adopted for its multi-dimensional data model, simple yet powerful query language, and standalone natureโit doesnโt rely on external storage systems or message queues.
Key points:
Pull-based data collection via HTTP.
Stores data as time-stamped metrics.
Comes with a built-in expression browser and alerting.
Scales well for most monitoring needs.
2. ๐งฉ System Architecture
2.1 Metric Sources (Targets)
These are the systems Prometheus collects metrics from:
Applications
Databases
Linux Hosts
Containers
These systems expose metrics endpoints (usually /metrics
) in a format Prometheus can scrape.
2.2 Service Discovery
To automatically discover the metric sources, Prometheus uses:
Kubernetes
Consul
This allows Prometheus to dynamically find services/instances to scrape, rather than manually configuring static targets.
2.3 Prometheus Server (Core)
This is the central brain of the architecture and does most of the heavy lifting:
Data Retriever
Pulls (scrapes) metrics from the discovered targets.TSDB (Time Series Database)
Stores all the scraped metrics as time-series data.HTTP Server
Allows users, tools, and dashboards to query Prometheus data using PromQL (Prometheus Query Language).
Prometheus pulls metrics from targets (not push by default โ although there are workarounds for pushing).
2.4 Querying and Visualization Tools
These tools interact with Prometheus to visualize and analyze the data:
Web UI โ Built-in Prometheus UI.
SigNoz โ Open-source observability platform.
Prom Lens โ Advanced query-building tool.
Grafana โ Popular visualization tool (dashboards and graphs).
These tools query data from Prometheus via PromQL.
2.5 Alerting
Prometheus can evaluate alerting rules and:
Sends alerts to Alert Manager.
Alert Manager handles deduplication, grouping, and routing of alerts.
Alert Manager forwards alerts to notification channels like:
Email
Slack
PagerDuty
2.6 Remote/Local Storage Integration
Prometheus can forward samples to external storage systems for long-term retention or more scalable solutions โ referred to as Remote/Local Storage. This is optional and useful for scaling or compliance needs.
Summary of the Flow
Prometheus discovers targets via Kubernetes/Consul.
Prometheus scrapes (pulls) metrics from targets.
Data is stored in Prometheusโs TSDB.
Users/tools query metrics (via Web UI, Grafana, etc.).
Prometheus sends alerts to Alert Manager based on rules.
Alert Manager forwards alerts to email, Slack, PagerDuty.
Optionally, Prometheus forwards samples to remote storage.
In short:
Prometheus scrapes, stores, analyzes, and alerts on time-series metrics from various sources โ while integrating with external tools for visualization and notifications.
3. โจ Core Features Overview
Prometheus offers a wide range of features for modern monitoring:
Multi-dimensional data model using time series identified by metric names and key/value pairs (labels).
Powerful PromQL (Prometheus Query Language) for slicing and dicing data.
No reliance on external storageโall data is stored locally in a custom TSDB.
Pull-based scraping over HTTP.
Integrated Alerting with Alertmanager.
Flexible Service Discovery supporting Kubernetes, Consul, EC2, and static targets.
Built-in Web UI for ad-hoc queries and visualization.
4. ๐งฌ Prometheus Data Model
Prometheusโs data model revolves around time series.
A metric is a set of time series that share the same name and differ by their label sets.
Each time series is identified by a metric name and a set of key/value labels.
For example:
pgsqlCopyEdithttp_requests_total{method="POST", handler="/api/order"}
Each time series consists of:
Timestamps
Floating-point samples
Labels allow for high-cardinality querying and filtering.
This data model supports flexible and high-dimensional querying without predefined schemas.
5. ๐งพ Metrics Transfer Format
Prometheus uses a text-based exposition format over HTTP.
The standard format is a simple plaintext format, exposed by the
/metrics
endpoint on each target.Example:
bashCopyEdit# HELP http_requests_total Total number of HTTP requests # TYPE http_requests_total counter http_requests_total{method="GET", code="200"} 1027
Metrics types include:
Counter: Monotonic increasing values (e.g., requests served).
Gauge: Values that go up and down (e.g., temperature).
Histogram: Measures distributions (e.g., request durations).
Summary: Similar to histograms but focused on quantiles.
Prometheus scrapes this endpoint regularly and parses the data for storage and query.
6. ๐ Query Language (PromQL)
PromQL (Prometheus Query Language) is a powerful and expressive language for querying time series data.
Used in the web UI, Grafana, alert rules, and API calls.
Supports:
Instant vector: snapshot at a time point.
Range vector: data over a time range.
Arithmetic operations between metrics.
Aggregation:
sum
,avg
,max
,min
,count
, etc.Filtering based on labels.
Examples:
http_requests_total
โ fetch all time series of this metric.sum(rate(http_requests_total[1m])) by (method)
โ total requests per method in the last minute.
PromQL makes Prometheus incredibly flexible for real-time analytics.
7. ๐จ Integrated Alerting
Prometheus comes with built-in alerting capabilities:
Alert Rules are defined using PromQL.
Prometheus evaluates these rules at regular intervals and fires alerts.
Alerts are sent to Alertmanager, which:
Deduplicates alerts
Groups related alerts
Sends notifications via email, Slack, PagerDuty, etc.
Supports silencing and routing policies
Example rule:
yamlCopyEditgroups:
- name: example
rules:
- alert: HighRequestRate
expr: rate(http_requests_total[1m]) > 100
for: 1m
labels:
severity: critical
annotations:
summary: "High request rate detected"
8. ๐ Service Discovery Support
Prometheus can dynamically discover scrape targets using service discovery integrations, avoiding the need for static configs.
Supported methods include:
Kubernetes (pods, services, endpoints)
Consul
EC2
Azure
GCE
Docker Swarm
File-based SD (watching JSON/YAML files)
This enables automatic discovery of new services and ensures Prometheus always monitors the correct set of targets, even in dynamic cloud-native environments.
Getting Started with Prometheus
This section will walk you through the first steps of installing, configuring, and running Prometheus. You'll also learn how to explore its web UI, monitor targets, and query data using PromQL.
1. โฌ๏ธ Downloading Prometheus
Prometheus can be downloaded directly from its official website:
Visit: https://prometheus.io/download
Select the appropriate binary for your operating system (e.g., Linux, macOS, Windows).
Example (Linux, x86_64):
bashCopyEditwget https://github.com/prometheus/prometheus/releases/download/v2.51.1/prometheus-2.51.1.linux-amd64.tar.gz
Make sure to always download the latest stable version.
2. ๐ฆ Unpacking and Inspecting the Tarball
Once downloaded, unpack the tarball using the following command:
bashCopyEdittar -xvf prometheus-2.51.1.linux-amd64.tar.gz
cd prometheus-2.51.1.linux-amd64
Inside the extracted folder, you'll see:
prometheus
โ the main binarypromtool
โ tool to check config filesprometheus.yml
โ default config fileconsole_libraries/
โ libraries for console templatesconsoles/
โ example console templates
This directory structure can be moved or customized based on your deployment setup.
3. โ๏ธ Configuring Prometheus
Prometheus is configured via a YAML file (prometheus.yml
), which defines global settings, scrape targets, alerting, and more.
Example prometheus.yml
:
yamlCopyEditglobal:
scrape_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
Key config sections:
global
: Sets the default scrape interval, evaluation interval, etc.scrape_configs
: Defines monitoring targets, job names, relabeling, etc.alerting
: Configures Alertmanager integration.rule_files
: Specifies rule files for alerts or recording.
Use promtool check config prometheus.yml
to validate your configuration.
4. ๐ Command-Line Flags and Defaults
Prometheus supports many command-line flags to customize runtime behavior. Commonly used ones:
bashCopyEdit./prometheus \
--config.file=prometheus.yml \
--storage.tsdb.path=data/ \
--web.listen-address=":9090"
Useful flags:
--config.file
: Path to the config file (default:prometheus.yml
)--storage.tsdb.path
: Directory for storing metrics data (default:data/
)--web.listen-address
: Port on which Prometheus serves the UI and API--log.level
: Set log level (e.g.,info
,debug
,error
)
You can view all flags by running:
bashCopyEdit./prometheus --help
5. โถ๏ธ Running Prometheus
To start Prometheus:
bashCopyEdit./prometheus --config.file=prometheus.yml
You should see logs indicating that Prometheus is starting and loading targets. By default, the web UI will be accessible at:
arduinoCopyEdithttp://localhost:9090
Make sure port 9090
is open and not blocked by firewalls or other services.
6. ๐ Web Interface
Prometheus includes a built-in web UI accessible via the browser.
Features:
Home dashboard with system status
Expression browser for querying metrics
Visualization of raw time-series data
Target health and label inspection
Alerts and rules display
To access it:
arduinoCopyEdithttp://localhost:9090
Useful tabs:
Status > Targets: See active targets and scrape status
Graph: Run PromQL queries and visualize data
Alerts: View firing and pending alerts
7. ๐ฏ Targets Page
The Targets page shows all the configured jobs and their respective scrape endpoints.
Navigate to:
bashCopyEdithttp://localhost:9090/targets
Youโll see:
Job name
Endpoint
Last scrape time
Scrape duration
Scrape status (UP/DOWN)
If a target is down, check:
If the service is running
If the endpoint is reachable
If the config is correct
This page is essential for debugging connectivity and monitoring issues.
8. ๐ Querying Metrics with PromQL
Prometheus supports PromQL (Prometheus Query Language) for querying and analyzing time-series data.
To try it out:
Go to
http://localhost:9090/graph
Enter a query, e.g.:
promqlCopyEditup
This checks if targets are up (1 = UP, 0 = DOWN).
Examples:
node_cpu_seconds_total
: View total CPU timerate(http_requests_total[1m])
: View request rate over the past minutesum by (instance)(rate(http_requests_total[5m]))
: Total requests per instance
You can visualize results as graphs, tables, or export as JSON using the HTTP API.
Understanding Prometheus Metric Types
Prometheus supports four core metric types that represent different patterns of data collection. These typesโGauges, Counters, Summaries, and Histogramsโallow developers and SREs to instrument and monitor applications with precision and clarity. Each type has specific characteristics and use cases.
1. Gauges
A Gauge is a metric that represents a single numerical value that can arbitrarily go up and down. Use gauges for things like current memory usage, number of active goroutines, or temperature readings.
Example use cases:
Current CPU temperature
Active sessions
Queue length
Free memory
2. Gauge Instrumentation Methods
Prometheus client libraries (e.g., Python, Go, Java) provide methods to work with gauges:
In Go:
goCopyEditvar temperature = prometheus.NewGauge(
prometheus.GaugeOpts{
Name: "room_temperature_celsius",
Help: "Current room temperature in Celsius.",
},
)
temperature.Set(22.5)
temperature.Inc()
temperature.Dec()
In Python:
pythonCopyEditfrom prometheus_client import Gauge
temperature = Gauge('room_temperature_celsius', 'Current room temperature')
temperature.set(22.5)
Common methods:
set(value)
inc()
,dec()
set_to_current_time()
3. Gauges in the Exposition Format
The exposition format is the plain text output served on the /metrics
endpoint.
Example:
nginxCopyEdit# HELP room_temperature_celsius Current room temperature
# TYPE room_temperature_celsius gauge
room_temperature_celsius 22.5
The format is human-readable and easily parseable by Prometheus scrapers.
4. Querying Gauges
Use PromQL to directly view the current value of a gauge:
promqlCopyEditroom_temperature_celsius
You can apply mathematical operations:
promqlCopyEditroom_temperature_celsius * 1.8 + 32 // Convert to Fahrenheit
5. Gauges Containing Timestamps
Gauges can also include explicit timestamps in exposition format, although it's not typical.
Example:
nginxCopyEditroom_temperature_celsius 22.5 1683023900000
However, this is discouraged unless absolutely necessary, as it can complicate time-series storage.
6. Counters
Counters are cumulative metrics that can only increase (or be reset to zero on restart). Use counters to track things like:
Total HTTP requests
Errors
Bytes transferred
They are strictly monotonically increasing.
7. Counter Resets
Counters can reset to zero, typically after a service restart. Prometheus handles this gracefully by detecting sudden drops to zero and adjusting calculations (e.g., using rate()
).
For example:
promqlCopyEditrate(http_requests_total[5m])
This accounts for resets by calculating per-second increase over time.
8. Counter Instrumentation Methods
In Go:
goCopyEditvar requests = prometheus.NewCounter(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total HTTP requests",
},
)
requests.Inc()
requests.Add(3)
In Python:
pythonCopyEditfrom prometheus_client import Counter
requests = Counter('http_requests_total', 'Total HTTP requests')
requests.inc()
9. Counters in the Exposition Format
yamlCopyEdit# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total 1543
This value never decreases unless thereโs a reset.
10. Querying Counters (Absolute Values vs. Rates)
Absolute value:
http_requests_total
Rate of increase:
rate(http_requests_total[1m])
rate()
returns per-second average increase:
promqlCopyEditsum(rate(http_requests_total[5m])) by (method)
Useful for dashboards and alerting thresholds.
11. Summaries
Summaries are used to track observations (e.g., request durations, response sizes) and produce:
Quantiles (e.g., 0.5, 0.9, 0.99)
Sum of all observations
Count of observations
12. Constructing Summaries
In Go:
goCopyEditsummary := prometheus.NewSummary(prometheus.SummaryOpts{
Name: "request_duration_seconds",
Help: "Request duration in seconds",
Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
})
summary.Observe(1.2)
13. Summary Instrumentation Methods
observe(value)
: Add a new observationPredefined objectives (quantiles)
Summaries provide real-time quantile approximations (but with trade-offs like memory usage and no aggregation across labels).
14. Querying Summaries
Summary metrics are split into:
_count
: number of observations_sum
: total value_quantile
: quantile estimations
Example:
promqlCopyEditrate(request_duration_seconds_sum[5m]) / rate(request_duration_seconds_count[5m])
This calculates average request duration over 5 minutes.
15. Histograms
Histograms group observations into configurable buckets and count how many fall into each.
They provide:
Bucketed counts
_count
and_sum
Percentile approximations (via Prometheus, not client)
16. Cumulative Histograms
Histograms use cumulative buckets:
textCopyEditrequest_duration_seconds_bucket{le="0.1"} 240
request_duration_seconds_bucket{le="0.5"} 756
request_duration_seconds_bucket{le="1"} 999
request_duration_seconds_bucket{le="+Inf"} 1024
Each bucket includes the count for that threshold and below.
17. Constructing Histograms
In Go:
goCopyEdithist := prometheus.NewHistogram(prometheus.HistogramOpts{
Name: "request_duration_seconds",
Help: "Histogram of request durations",
Buckets: prometheus.LinearBuckets(0.1, 0.1, 10),
})
hist.Observe(0.3)
Choose buckets wisely to match the distribution of your data.
18. Histogram Instrumentation Methods
observe(value)
: Records a valueBuckets must be set at creation and are immutable
19. Histograms in the Exposition Format
textCopyEdit# HELP request_duration_seconds Histogram of request durations
# TYPE request_duration_seconds histogram
request_duration_seconds_bucket{le="0.1"} 123
request_duration_seconds_bucket{le="0.5"} 456
request_duration_seconds_count 789
request_duration_seconds_sum 105.6
Prometheus computes quantiles during query time (not client side like summaries).
20. Querying Histograms
Example: 95th percentile from histogram:
promqlCopyEdithistogram_quantile(0.95, rate(request_duration_seconds_bucket[5m]))
Average duration:
promqlCopyEditrate(request_duration_seconds_sum[5m]) / rate(request_duration_seconds_count[5m])
21. Average Request Latencies
Both Summaries and Histograms can calculate average latency:
promqlCopyEditrate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])
Histograms are preferred for aggregations across labels, which summaries cannot do.
22. Native Histogram
Introduced in Prometheus v2.40+, Native Histograms are experimental and designed to:
Reduce memory usage
Support better quantile approximation
Be more efficient at high cardinality
Native histograms are exposed using a new metric type and are collected via the OpenMetrics format.
They are configured in Prometheus using:
yamlCopyEditenable_feature: native-histograms
Unlike regular histograms, native histograms donโt require pre-defined buckets and dynamically adapt based on the distribution of data.
PromQL Data Selection Explained
Prometheus Query Language (PromQL) is a powerful tool used for querying time series data. At the heart of PromQL are selectorsโconstructs that define what data to fetch, from which series, and over what time range.
This section focuses on both instant vector and range vector selectors, label matchers, and all modifiers that affect how and when data is retrieved and evaluated.
1. Instant Vector Selectors
An instant vector selector retrieves the latest sample for each time series at a single point in time (usually "now").
Syntax:
promqlCopyEdithttp_requests_total
This fetches all time series with the metric name http_requests_total
at the current moment.
2. Label Matchers
Label matchers refine vector selectors by filtering based on metric labels.
Types of matchers:
Matcher | Description | Example |
= | Equals | {job="api"} |
!= | Not equals | {status!="500"} |
=~ | Regex match | `{method=~"GET |
!~ | Negative regex match | {job!~"dev-.*"} |
Example:
promqlCopyEdithttp_requests_total{job="api", status=~"2.."}
3. Visualizing Instant Vector Selector Behavior (Lookback Delta)
Prometheus doesn't scrape metrics exactly at the evaluation moment. It looks backwards in time using the lookback delta (default: 5m
).
If no sample exists within 5 minutes, Prometheus drops the series from the result.
So:
promqlCopyEditmetric_name
...returns the most recent value within the last 5 minutes.
4. Staleness Markers and Staleness Handling
Prometheus uses staleness markers to detect when a series stops being reported (e.g., app crashed). If a time series disappears, Prometheus marks it as stale.
Staleness is used to:
Stop evaluating old data
Avoid misleading results
These markers are invisible in PromQL but affect evaluation.
5. Range Vector Selectors
A range vector selector retrieves all samples for each time series over a specified time interval.
Syntax:
promqlCopyEdithttp_requests_total[5m]
This selects all values in the last 5 minutes for each series. The output is a range vector, which can be passed into functions like rate()
.
6. Visualizing Range Vector Selector Behavior
Range vectors return a set of time-stamped samples.
For example:
promqlCopyEditrate(http_requests_total[1m])
Behind the scenes, Prometheus:
Selects all samples in the last 1 minute per series
Calculates per-second rate of increase
Each data point in a graph represents a separate evaluation of the range.
7. Relative Offsets (offset
Modifier)
The offset
modifier shifts the evaluation time back in time.
Example:
promqlCopyEdithttp_requests_total offset 1h
Returns the value of http_requests_total
from 1 hour ago (either as instant or range vector depending on selector type).
Can be combined with range vectors:
promqlCopyEditrate(http_requests_total[5m] offset 1h)
This gives the 5-minute rate calculated 1 hour ago.
8. Visualizing Offsets for Instant Vector Selectors
If now
is 16:00, then:
promqlCopyEditmetric_name offset 1h
...evaluates the value of metric_name
at 15:00.
It works like a time machine for metrics.
9. Offset Use Cases
Use offsets to:
Compare current data to past performance
Detect regressions
Create "previous week" or "same time yesterday" graphs
Example:
promqlCopyEdit(rate(http_requests_total[5m]) - rate(http_requests_total[5m] offset 1d)) / rate(http_requests_total[5m] offset 1d)
This shows the percentage change from yesterday.
10. Visualizing Offsets for Range Vector Selectors
Example:
promqlCopyEditrate(metric[1h] offset 2h)
Assume current time is 18:00:
Evaluation time: 18:00
Range: 1h
Offset: 2h
โก๏ธ Evaluates over 15:00 to 16:00
11. Absolute Evaluation Timestamps (@
Modifier)
The @
modifier lets you run a query as if it were evaluated at an exact timestamp.
Syntax:
promqlCopyEdithttp_requests_total @ 1714606800
Uses Unix timestamp in seconds
Only available in Prometheus >= v2.33
Use cases:
Forensics
Debugging exact past states
Deterministic exports
12. Visualizing Absolute Evaluation Timestamps
Imagine this query:
promqlCopyEditrate(http_requests_total[5m]) @ 1714606800
Prometheus computes the 5-minute rate at the exact time 1714606800
.
This enables reproducibility of data snapshots and avoids skew from real-time evaluations.
13. Syntactic Order for Modifiers
When combining modifiers (offset
, @
), their order matters.
Correct order:
promqlCopyEditmetric[5m] offset 1h @ 1714606800
offset
is applied before@
Read it like: "take 5-minute range 1h ago, evaluated at timestamp"
If order is incorrect, Prometheus throws a parse error.
Understanding Counter Rates and Increases in PromQL
Prometheus counters represent monotonically increasing valuesโsuch as the number of requests processed or bytes transferred. Understanding how to interpret, calculate, and query these counters accurately is essential for time-series analytics.
1. Absolute Counter Values and Why We Want Rates
๐น Absolute Values:
Counters like http_requests_total
grow over time. They show the total amount of something that has occurred.
Example:
promqlCopyEdithttp_requests_total
This shows the current cumulative count of HTTP requestsโbut doesn't tell how fast they're coming in.
๐ธ Why We Want Rates:
Absolute values donโt show trends or activity levels. We usually want:
How many requests per second?
How fast is the traffic increasing?
Thus, we compute rates (changes per time unit).
2. The Three Counter Increase Functions
Prometheus provides 3 main functions to evaluate counter growth over time:
Function | Description |
rate() | Calculates per-second average rate |
increase() | Calculates absolute increase over a period |
irate() | Calculates per-second rate using last 2 points (instant rate) |
3. Behavior of rate()
and increase()
rate()
:
Used with range vectors, gives the average rate per second over the range.
Syntax:
promqlCopyEditrate(http_requests_total[5m])
This calculates how many requests per second happened on average over the last 5 minutes.
increase()
:
Calculates total increase over a time range.
Syntax:
promqlCopyEditincrease(http_requests_total[5m])
If 100 requests were made during the 5-minute window, this returns 100
.
4. Handling Counter Resets
Prometheus counters may reset (e.g., due to app restart). Prometheus automatically detects this by identifying a lower value than before.
PromQL functions like rate()
and increase()
:
Detect these resets
Skip invalid segments
Continue calculating using valid portions
๐ธ If a reset is detected:
textCopyEdit... -> 950 -> 980 -> 10 (reset) -> 50
increase()
computes: (980 - 950) + (50 - 10) = 30 + 40 = 70
5. Calculating the rate()
and increase()
Slope
Prometheus interpolates values at the start and end of the range window and fits a linear regression line to the points.
Example:
For increase(http_requests_total[5m])
, Prometheus:
Gathers all samples in the 5-minute window
Interpolates a value at the start
Interpolates a value at the end
Computes the difference between them
Mathematically:
textCopyEditincrease = value_end - value_start
rate = increase / duration_in_seconds
6. Extrapolating the Return Value for the increase()
Function
Prometheus doesn't just blindly subtract endpoints. It extrapolates when samples don't exist exactly at the boundaries.
If data points are sparse, Prometheus interpolates start/end points, then extrapolates the growth based on rate to cover the entire window.
This avoids underestimating counters when scrapes are missed or irregular.
7. Confusing Extrapolating for Slow-Moving Counters
Slow-moving counters (e.g., errors that happen once per hour) can confuse users.
Example:
promqlCopyEditincrease(errors_total[5m])
If one error occurred 4m ago, Prometheus extrapolates to assume a partial contribution across the 5-minute window.
It may look like a fractional increase (e.g., 0.2), which surprises users expecting whole numbers.
๐น Prometheus is mathematically correct, but interpretation requires caution for low-frequency events.
8. Limiting Extrapolating to Zero Sample Values
If no samples are found in a range, increase()
and rate()
return 0, not NaN.
This is crucial for graph continuityโotherwise, dashboards would show gaps.
But be careful: zero increase โ zero traffic. It might mean:
No data scraped
Metric not emitted
Actual zero traffic
๐ Use alerting rules or metadata checks to detect missing data.
9. The irate()
Function
irate()
(instant rate) computes the rate between the two most recent samples in a range.
Syntax:
promqlCopyEditirate(http_requests_total[5m])
Uses just the last two data points
No interpolation, no smoothing
Ideal for spiky, fast-changing signals
โ ๏ธ Use with caution on slow countersโit can be misleading if data is sparse.
10. Which Function Should You Use?
Use Case | Function |
Trends, averages, smoothing | rate() |
Absolute counts over time | increase() |
Current/instantaneous values | irate() |
Alerting (on spikes, errors) | rate() or irate() |
SLO calculation | increase() (e.g., over a day/week) |
๐ธ Rule of thumb:
Dashboards: use
rate()
for visual stability.SLO math: use
increase()
to count events.High-frequency alerting: use
irate()
if latency is critical.
Understanding "up" and Friends in Prometheus
1. Prometheus Server Configuration
Before exploring up
and other auto-generated metrics, it's crucial to understand how Prometheus is configured to monitor targets:
Configuration File: Prometheus uses a
prometheus.yml
configuration file to define scraping jobs.scrape_configs: Within this file, the
scrape_configs
block defines how Prometheus should discover and collect metrics from targets.
Example:
yamlCopyEditscrape_configs:
- job_name: 'node_exporter'
static_configs:
- targets: ['localhost:9100']
Each job defines:
job_name
: A label for the scrape group.targets
: IP addresses or hostnames of endpoints exposing metrics.metrics_path
(default:/metrics
)Optional: relabeling, authentication, TLS, and timeouts.
When Prometheus starts, it uses this configuration to initialize target discovery and begin scraping metrics.
2. Inspecting Targets in Prometheus
To verify that Prometheus is correctly scraping your services:
Navigate to the Targets Page:
- URL:
http://<your-prometheus-host>:9090/targets
- URL:
This page displays:
Job names and their associated targets.
Scrape status (up/down).
Last scrape duration and timestamp.
Labels associated with each target.
Importance: This helps you quickly see which targets are reachable and why some may be down.
Health Status: The field
last scrape error
or the color-coded status lets you identify failures in real time.
3. Showing All Auto-Generated Metrics
Prometheus automatically generates some internal metrics about its own operation, especially for each target it scrapes. These are exposed at:
http://localhost:9090/metrics
To view a list of all available metrics in the UI:
Go to
http://localhost:9090/graph
Click on the "insert metric at cursor" dropdown or start typing in the expression field.
Metrics like
up
,scrape_duration_seconds
, andscrape_samples_post_metric_relabeling
appear.
4. The "up" Metric
This is the most important health metric in Prometheus.
Definition:
up
is a gauge metric automatically generated by Prometheus to indicate whether a target is reachable.Values:
1
: The scrape was successful (target is UP).0
: The scrape failed (target is DOWN or unreachable).
Labels:
textCopyEditup{job="node_exporter", instance="localhost:9100"} 1
Use Case:
You can use this in alerting rules:yamlCopyEditalert: TargetDown expr: up == 0 for: 1m labels: severity: critical annotations: summary: "Target {{ $labels.instance }} is down"
Internals:
up
is computed based on whether Prometheus received a valid HTTP 200 response and successfully parsed the metrics from the target.
5. Other Auto-Generated Metrics
Prometheus exposes several internal metrics for diagnostics and performance monitoring:
Metric | Description |
scrape_duration_seconds | Time taken to scrape a target |
scrape_samples_scraped | Number of samples scraped in the last scrape |
scrape_samples_post_metric_relabeling | Number of samples retained after relabeling |
scrape_series_added | Number of series added in the scrape |
scrape_timeout_seconds | Timeout setting per scrape |
prometheus_sd_* | Service discovery subsystem metrics |
prometheus_target_* | Metrics on target health and discovery |
prometheus_engine_* | Query engine performance |
prometheus_tsdb_* | Storage subsystem metrics (compaction, WAL, memory usage) |
Example:
textCopyEditscrape_duration_seconds{job="node_exporter", instance="localhost:9100"} 0.023
These can be used to:
Detect scrape performance issues
Analyze ingestion rate
Tune Prometheus server configuration
6. Auto-Generated Metrics in the Prometheus Documentation
Prometheus maintains complete documentation of its internal metrics:
Official Reference:
- Prometheus Internal Metrics Documentation
The documentation includes:
Metric name
Type (gauge, counter)
Description
Associated labels
Subsystem/component
Use Case: These are especially useful for:
Monitoring Prometheus server health
Creating dashboards (e.g., Grafana Prometheus dashboards)
Debugging ingestion issues
Auditing scrape errors
๐ Summary Table: Key Auto-Generated Metrics
Metric Name | Type | Purpose |
up | Gauge | Indicates if the target was successfully scraped |
scrape_duration_seconds | Gauge | Scrape latency |
scrape_samples_scraped | Gauge | Number of metrics collected per scrape |
prometheus_target_interval_length_seconds | Gauge | Actual vs expected interval duration |
prometheus_engine_query_duration_seconds | Histogram | Duration of PromQL queries |
prometheus_tsdb_head_series | Gauge | Total active series in TSDB |
Understanding Prometheus Histograms
1. Motivation and Histogram Basics
Histograms in Prometheus are used to observe and record the distribution of events over a set of predefined buckets. They are particularly useful for understanding the behavior of applications, such as response times, request sizes, or any measurable quantity that can be categorized.
2. Need to Measure Request Durations/Latency
Monitoring request durations or latency is crucial for:
Performance Analysis: Understanding how fast your application responds.
SLA/SLO Compliance: Ensuring response times meet agreed standards.
Bottleneck Identification: Detecting slow components in your system.
Histograms allow you to see not just averages but the distribution of response times, which is vital for comprehensive performance monitoring.
3. Downsides of Using Event Logging
While event logging provides detailed insights, it has limitations:
High Overhead: Logging every event can consume significant resources.
Complex Analysis: Aggregating and analyzing logs for metrics is cumbersome.
Latency: Real-time analysis is challenging due to the volume of data.
Histograms offer a more efficient way to monitor metrics like latency without the overhead of detailed logging.
4. Why a Single Gauge Doesn't Help Us
A gauge represents a single numerical value that can go up or down. Using a gauge for metrics like request duration is inadequate because:
Lack of Distribution: Gauges show only the current value, not the spread.
No Historical Context: They don't provide insights into past performance.
Inability to Calculate Percentiles: Gauges can't be used to compute percentiles like the 95th percentile.
5. Downsides of Using Prometheus Summary Metrics
Summaries in Prometheus can calculate quantiles but have drawbacks:
Client-Side Calculation: Quantiles are calculated on the client, limiting flexibility.
No Aggregation Across Instances: Summaries can't be aggregated across multiple instances.
Static Configuration: Quantile objectives must be predefined.
Histograms, on the other hand, allow server-side aggregation and dynamic quantile calculation.
6. Prometheus Histogram Example for Tracking Request Durations
To track request durations:
goCopyEdithttpDuration := prometheus.NewHistogram(prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "A histogram of the HTTP request durations.",
Buckets: prometheus.DefBuckets,
})
This setup records the duration of HTTP requests into predefined buckets, enabling detailed analysis of response times.
7. How Can We Expose Histograms as Time Series to Prometheus?
Prometheus histograms are exposed as multiple time series:
<metric>_bucket{le="..."}
: Cumulative count of observations less than or equal to the bucket's upper bound.<metric>_sum
: Sum of all observed values.<metric>_count
: Total number of observations.
These time series allow Prometheus to store and query histogram data effectively.
8. Cumulative Histogram Representation
This chart plots:
X-axis: Duration in milliseconds (e.g., 25ms, 50ms, 100ms, etc.).
Y-axis: Count of observations that fall within a specific bucket.
Each bar height represents the number of observations between two bounds.
Bucket Counts (as shown):
Bucket Range (ms) | Count |
โค 25 | 31 |
25โ50 | 32 |
50โ100 | 105 |
100โ250 | 617 |
> 250 | 215 |
This means, for example, 617 requests took between 100ms and 250ms.
Prometheus stores histograms in a cumulative format rather than the regular format shown in the image.
A cumulative histogram gives the running total of observations up to each bucket's upper bound:
Bucket (le = "less than or equal to") | Cumulative Count |
le="25" | 31 |
le="50" | 63 (31+32) |
le="100" | 168 (63+105) |
le="250" | 785 (168+617) |
le="+Inf" | 1000 (785+215) |
So instead of individual bars, each bucket value contains the total number of observations less than or equal to the upper bound.
Summary of the Difference
Feature | Regular Histogram (Image) | Cumulative Histogram (Prometheus) |
Bucket Value | Observations within a range | Observations up to a bound |
Data Representation | Independent bar heights | Accumulated total at each threshold |
Example | 105 requests took 50โ100ms | 168 requests took โค100ms |
9. The Special "le" (Less-Than-Or-Equal) Bucket Upper Bound Label
In Prometheus, histograms use bucketed counts to record how many observations fall below certain thresholds.
Each bucket is labeled with:
iniCopyEditle = X
Which means:
โCount of observations less than or equal to X.โ
For example:
le="25"
โ number of observations โค 25 msle="50"
โ number of observations โค 50 ms...
le="+Inf"
โ total count of all observations (since everything is โค โ)
From the image:
Bucket (le ) | Cumulative Count |
โค 25 ms | 31 |
โค 50 ms | 63 |
โค 100 ms | 168 |
โค 250 ms | 785 |
โค +Inf | 1000 |
Interpretation:
From 0โ25 ms: 31 requests completed
25โ50 ms: 63 - 31 = 32 requests
50โ100 ms: 168 - 63 = 105 requests
100โ250 ms: 785 - 168 = 617 requests
250โโ ms: 1000 - 785 = 215 requests
Summary
The
le
label tells you the upper bound of the bucket.These buckets are cumulative: each includes all lower durations.
Subtracting adjacent bucket values gives the number of samples in that range.
The bucket with
le="+Inf"
always contains the total number of samples.
10. Time Series Exposed from a Histogram Metric
This cumulative histogram displays the duration (in seconds) of observed events, using bucket boundaries (e.g. le="0.025"
) along the X-axis, and the cumulative count of observations along the Y-axis.
The time series exposed by a Prometheus histogram metric named http_request_duration_seconds_bucket
would look like:
promqlCopyEdithttp_request_duration_seconds_bucket{le="0.025"}
http_request_duration_seconds_bucket{le="0.05"}
http_request_duration_seconds_bucket{le="0.1"}
http_request_duration_seconds_bucket{le="0.25"}
http_request_duration_seconds_bucket{le="+Inf"}
Each of these is a separate time series, and their values increase cumulatively as more events fall into that bucket or smaller.
๐ง How to Interpret le
Each le
value is an upper boundary, meaning:
le="0.025"
โ all durations โค 25 msle="0.05"
โ all durations โค 50 msle="0.1"
โ all durations โค 100 ms...
le="+Inf"
โ all observations (total count)
๐ Behind the Scenes: Prometheus Histogram Export
A histogram metric in Prometheus (like http_request_duration_seconds
) exposes 3 types of time series automatically:
Series Type | Purpose |
*_bucket{le="..."} | Buckets by le , cumulative counts |
*_count | Total count of observations |
*_sum | Total sum of all observed values |
So for http_request_duration_seconds
, you'll see:
promqlCopyEdithttp_request_duration_seconds_bucket{le="0.025"}
http_request_duration_seconds_bucket{le="0.05"}
...
http_request_duration_seconds_bucket{le="+Inf"}
http_request_duration_seconds_sum
http_request_duration_seconds_count
โ Why It Matters
You can compute percentiles using these buckets (e.g. 95th percentile from histogram approximation).
Subtracting two adjacent buckets gives the count in that interval.
It enables time-based slicing (e.g. rate of slow responses over the last 5 minutes).
11. Instrumentation - Adding Histograms to Your Code
To instrument your code with histograms:
- Define the Histogram:
goCopyEditvar requestDuration = prometheus.NewHistogram(prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "Histogram of response time for handler.",
Buckets: prometheus.LinearBuckets(0.05, 0.05, 20),
})
- Register the Histogram:
goCopyEditprometheus.MustRegister(requestDuration)
- Observe Values:
goCopyEditstart := time.Now()
// handle request
duration := time.Since(start).Seconds()
requestDuration.Observe(duration)
12. Adding Histograms Without Additional Labels
When adding histograms without additional labels:
Simplifies Aggregation: Easier to aggregate across instances.
Reduces Cardinality: Fewer unique time series, conserving resources.
Use Case: Suitable for global metrics where differentiation isn't necessary.
13. Adding Histograms With Additional Labels
Adding labels to histograms allows for more granular analysis:
goCopyEditvar requestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "Histogram of response time for handler.",
Buckets: prometheus.LinearBuckets(0.05, 0.05, 20),
},
[]string{"method", "endpoint"},
)
This setup enables you to analyze request durations by HTTP method and endpoint.
14. Querying Histograms with PromQL
PromQL provides functions to query histograms:
rate()
: Calculates the per-second average rate of increase.increase()
: Calculates the total increase over a time range.histogram_quantile()
: Estimates quantiles from histogram buckets.
Example:
promqlCopyEdithistogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
This query estimates the 95th percentile of request durations over the last 5 minutes.
15. Querying All Bucket Series of a Histogram
To retrieve all bucket series:
promqlCopyEdithttp_request_duration_seconds_bucket
This returns all time series with the bucket
suffix, allowing you to analyze the distribution across all buckets.
16. Querying Percentiles/Quantiles Using histogram_quantile()
The histogram_quantile()
function estimates quantiles:
promqlCopyEdithistogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
This computes the 95th percentile by summing the rate of increase across all buckets and applying the quantile function.
17. Using rate()
or increase()
to Limit a Histogram to Recent Increases
To focus on recent data:
rate()
: Provides the per-second average rate over a time range.increase()
: Gives the total increase over a time range.
Example:
promqlCopyEditrate(http_request_duration_seconds_bucket[5m])
This calculates the rate of increase for each bucket over the last 5 minutes.
18. Controlling the Smoothing Time Window
The time range specified in rate()
or increase()
functions controls the smoothing window:
Shorter Window: More responsive to recent changes but noisier.
Longer Window: Smoother results but less responsive to recent changes.
Choose the window size based on the desired balance between responsiveness and smoothness.
19. Aggregating Histograms and Percentiles Over Label Dimensions
To aggregate histograms across dimensions:
promqlCopyEditsum(rate(http_request_duration_seconds_bucket[5m])) by (le)
This sums the rate of increase for each bucket across all instances. You can then apply histogram_quantile()
to compute percentiles:
promqlCopyEdithistogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
20. Errors of Quantile Calculation and Bucketing Schemas
Quantile estimation errors can arise due to:
Bucket Granularity: Coarse buckets lead to less accurate quantiles.
Data Distribution: Uneven distributions can skew results.
Interpolation Assumptions:
histogram_quantile()
assumes a uniform distribution within buckets.
To minimize errors:
Use Appropriate Buckets: Choose bucket boundaries that reflect your data distribution.
Monitor Bucket Usage: Ensure that most data falls within the defined buckets.
21. Showing Histograms as a Heatmap
Heatmaps provide a visual representation of histogram data over time:
X-Axis: Time.
Y-Axis: Bucket boundaries.
Color Intensity: Frequency of observations.
In Grafana:
Select Heatmap Panel.
Configure Data Source: Use Prometheus as the data source.
Enter Query: For example:
promqlCopyEditrate(http_request_duration_seconds_bucket[5m])
- Adjust Visualization Settings: Set appropriate axes and color schemes.
Visual Aid:
22. Querying Request Rates Using _count
To calculate the rate of requests:
promqlCopyEditrate(http_request_duration_seconds_count[5m])
This provides the per-second rate of HTTP requests over the last 5 minutes.
23. Querying Average Request Durations Using _sum
and _count
To compute the average request duration:
promqlCopyEditrate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])
This divides the total duration by the number of requests, yielding the average duration per request.
Creating Grafana Dashboards for Prometheus
1. Option A: Running Grafana Using Docker
Step-by-Step Instructions
โ Prerequisites:
Docker installed on your system.
Prometheus already running (can also be in Docker).
๐ง Start Grafana using Docker:
bashCopyEditdocker run -d \
-p 3000:3000 \
--name=grafana \
grafana/grafana
This command:
Runs Grafana in the background (
-d
)Maps Grafanaโs port 3000 to your local machine
Names the container
grafana
๐งช Check if itโs running:
Visit http://localhost:3000
2. Option B: Running Grafana Using Pre-Built Binaries
โ Prerequisites:
Installed Prometheus
Installed Grafana binary for your OS from:
๐ https://grafana.com/grafana/download
๐งฐ Installation Steps:
๐ Windows:
Unzip the downloaded Grafana
.zip
file.Open a terminal (
cmd
) and navigate to thebin
folder inside the extracted directory.Run:
bashCopyEditgrafana-server.exe
๐ Linux:
bashCopyEdittar -zxvf grafana-<version>.linux-amd64.tar.gz
cd grafana-<version>
./bin/grafana-server
Grafana will run on http://localhost:3000
.
3. Logging into Grafana
๐น๏ธ First Login
Open browser โ Visit: http://localhost:3000
Default credentials:
Username:
admin
Password:
admin
๐ Youโll be asked to change the password on first login.
4. Creating a Prometheus Data Source
๐ก Add Prometheus as a Data Source:
In the left sidebar, click gear icon (โ๏ธ) โ Data Sources
Click โAdd data sourceโ
Choose โPrometheusโ
Under HTTP > URL, enter:
httpCopyEdithttp://localhost:9090
(Replace localhost:9090
with your actual Prometheus URL if it differs)
Click โSave & Testโ
- You should see a green message: โ Data source is working
5. Creating a New Dashboard
๐ ๏ธ Steps to Create a Dashboard:
Click the โ+โ (plus) icon in the left sidebar โ Dashboard
Click โAdd new panelโ
Youโll now see a new panel editor with default settings
At the top, name your dashboard (click on the title "New dashboard")
Click Save (floppy disk icon) in the top right โ Give it a name โ Save
6. Creating a Time Series Chart
๐ Steps to Add a Time Series Panel:
In your new dashboard, click โAdd new panelโ
Choose Visualization type: Time series (left-hand side)
In the Query section:
Set Data Source:
Prometheus
Enter query:
promqlCopyEditrate(http_requests_total[5m])
Click Run to see the graph populate.
Customize:
Panel title, units (like seconds, ms, etc.)
Axes (logarithmic or linear)
Legend display
Click Apply to save the panel to your dashboard
7. Creating a Gauge Panel
๐ฏ Steps to Add a Gauge:
Click โAdd panelโ โ In the Visualization options, select Gauge
In the Query box, enter something like:
promqlCopyEdithttp_requests_total
- or a value-producing metric like:
promqlCopyEditsum(rate(cpu_usage_seconds_total[1m]))
Configure:
Min & Max range (example: 0 โ 100 for percentages)
Thresholds (to color the gauge: green/yellow/red)
Unit: e.g.,
percent
,seconds
,req/sec
Click Apply
8. Creating a Table Panel
๐งฎ Steps to Add a Table Panel:
Click โAdd panelโ
Select Visualization โ Table
In the query section, use a metric that returns multiple labels/values:
promqlCopyEdittopk(5, rate(http_requests_total[1m]))
Under Format:
Set to โTableโ
Adjust time range, value format
Style:
Add column aliases
Apply unit types (seconds, bytes, %, etc.)
Click Apply
9. Adding Rows to the Dashboard
๐ Organize Panels Using Rows:
In the dashboard view, click the dropdown menu (three-dot icon) in the upper right
Select โAdd rowโ
Enter a name for the row (e.g., โPerformance Metricsโ)
Drag and drop existing panels into this row
Use rows to group related panels:
CPU Stats
Memory Usage
Latency Tracking
๐ Rows can be collapsed/expanded, improving usability in large dashboards.
Final Touches
Use โDashboard Settingsโ (gear icon at the top) to:
Set auto-refresh (e.g., every 10s, 30s, etc.)
Set default time range
Add dashboard-level variables
Monitoring Linux Host Metrics with Prometheus
1. Downloading and Unpacking the Node Exporter
The Node Exporter is an official Prometheus exporter for exposing hardware and OS metrics from *nix systems.
โ Steps:
๐ Download:
Go to: https://prometheus.io/download/#node_exporter
Or directly use:
bashCopyEditwget https://github.com/prometheus/node_exporter/releases/download/v1.8.0/node_exporter-1.8.0.linux-amd64.tar.gz
๐ฆ Unpack:
bashCopyEdittar -xvf node_exporter-1.8.0.linux-amd64.tar.gz
cd node_exporter-1.8.0.linux-amd64
2. Node Exporter Command-Line Flags
The Node Exporter has many flags to control which metrics it exposes.
๐ง Common Flags:
bashCopyEdit./node_exporter \
--web.listen-address=":9100" \
--web.telemetry-path="/metrics" \
--collector.cpu \
--collector.meminfo \
--collector.diskstats
๐ Flag Details:
Flag | Description |
--web.listen-address | Address/port to serve metrics (default :9100 ) |
--web.telemetry-path | Path where metrics are exposed (default /metrics ) |
--collector.<name> | Enable or disable individual collectors |
You can list all collectors with:
bashCopyEdit./node_exporter --help
3. Running the Node Exporter
๐ข Start Node Exporter (basic way):
bashCopyEdit./node_exporter
It will start serving metrics on:
๐ http://localhost:9100/metrics
๐ Run in Background (production):
bashCopyEditnohup ./node_exporter > node_exporter.log 2>&1 &
Or, create a systemd service (recommended for servers):
bashCopyEditsudo nano /etc/systemd/system/node_exporter.service
Paste:
iniCopyEdit[Unit]
Description=Node Exporter
After=network.target
[Service]
User=nobody
ExecStart=/path/to/node_exporter
[Install]
WantedBy=default.target
Enable & start:
bashCopyEditsudo systemctl daemon-reexec
sudo systemctl start node_exporter
sudo systemctl enable node_exporter
4. Inspecting the Node Exporter's /metrics Endpoint
Open in browser or curl:
bashCopyEditcurl http://localhost:9100/metrics
๐ Youโll see raw Prometheus metrics like:
bashCopyEdit# HELP node_cpu_seconds_total Seconds the CPUs spent in each mode.
node_cpu_seconds_total{cpu="0",mode="user"} 3452.92
node_memory_MemAvailable_bytes 123456789
node_filesystem_size_bytes{...} 1099511627776
These are the real-time system stats exposed to Prometheus.
5. Scraping the Node Exporter with Prometheus
๐ง Modify prometheus.yml
config:
Add the Node Exporter as a static target:
yamlCopyEditscrape_configs:
- job_name: 'node_exporter'
static_configs:
- targets: ['localhost:9100']
If Node Exporter runs on another host, replace localhost
with that IP or hostname.
๐ Restart Prometheus:
bashCopyEdit./prometheus --config.file=prometheus.yml
Or if using systemd:
bashCopyEditsudo systemctl restart prometheus
6. Verifying Successful Target Scrapes
โ Go to Prometheus UI:
Visit: http://localhost:9090/targets
You should see:
yamlCopyEditjob: node_exporter
target: localhost:9100
last scrape: <time>
status: UP
This confirms Prometheus is successfully scraping metrics.
7. Querying Node Exporter Metrics (CPU and Network Usage)
๐ง Example PromQL Queries:
๐งฎ CPU Usage (total per mode):
promqlCopyEditrate(node_cpu_seconds_total{mode!="idle"}[5m])
๐ง Memory Available:
promqlCopyEditnode_memory_MemAvailable_bytes
๐ก Network Received:
promqlCopyEditrate(node_network_receive_bytes_total[1m])
๐ค Network Transmitted:
promqlCopyEditrate(node_network_transmit_bytes_total[1m])
๐ฝ Disk Space Used:
promqlCopyEdit(node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes
8. Showing Host Metrics in Grafana
๐บ Visualizing in Grafana:
๐ Prerequisites:
- Prometheus added as a data source in Grafana.
Steps to Create a System Monitoring Dashboard:
Create a new dashboard โ Add Panel
Use these queries:
๐ง CPU Load (Time series):
promqlCopyEditrate(node_cpu_seconds_total{mode="user"}[5m])
๐ก Network Usage (Table or Graph):
promqlCopyEditrate(node_network_receive_bytes_total[1m])
rate(node_network_transmit_bytes_total[1m])
๐ฝ Disk Usage (Gauge):
promqlCopyEdit100 * (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes
๐ง Memory Usage (Gauge):
promqlCopyEdit100 * (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes
Optional: Import Official Grafana Dashboard
Go to Grafana โ Dashboards โ Import
Use Dashboard ID:
1860
(Node Exporter Full)Choose your Prometheus data source โ Import
This provides a rich pre-built monitoring dashboard.
Donโt Make These 6 Prometheus Monitoring Mistakes
Mistake 1: Cardinality Bombs
๐ฅ Problem:
Creating a high number of unique time series by using too many or high-variance labels (e.g., user IDs, IP addresses, request paths) causes cardinality explosions, which:
Consume excessive memory and CPU
Slow down queries and alert evaluations
Can crash Prometheus
๐งจ Example:
promqlCopyEdithttp_requests_total{user_id="1234", session_id="a9b8c7"}
If every user and session has unique IDs, this results in millions of time series.
โ Best Practices:
Avoid using high-cardinality labels like
user_id
,session_id
,request_path
, etc.Use static or bounded labels like
status
,method
,instance
.Use aggregation or label_replace() to group data instead of exploding it.
Mistake 2: Aggregating Away Too Many Labels
โ ๏ธ Problem:
When using sum()
or avg()
without carefully specifying by()
labels, you lose context and might aggregate metrics incorrectly.
๐ตโ๐ซ Example:
promqlCopyEditsum(rate(http_requests_total[5m]))
This sums all requests from all endpoints, all instances, all statuses โ losing all distinguishing information.
๐ง Solution:
promqlCopyEditsum by (job, instance, status) (rate(http_requests_total[5m]))
Keep important dimensional context
Aggregate only intentionally based on your alerting or visualization needs
Mistake 3: Unscoped Metric Selectors
๐ฃ Problem:
Writing PromQL like this:
promqlCopyEditup
โฆwithout any scoping labels means querying every single up
metric from all jobs, across all targets โ including exporters and services you might not care about.
๐ Consequences:
Wastes query time and resources
Can return noisy or misleading results
Hard to debug or tune alerts
โ Solution:
Scope it!
promqlCopyEditup{job="my_service"}
Or:
promqlCopyEditup{job=~"api|frontend"}
Use scoped selectors to reduce noise and make queries faster and more accurate.
Mistake 4: Missing for
Durations in Alerting Rules
๐จ Problem:
Creating alerts without a for:
clause in the rule leads to instantaneous alerts that fire as soon as a condition is true โ even briefly โ leading to flapping or false positives.
๐งจ Example:
yamlCopyEdit- alert: HighCPU
expr: rate(node_cpu_seconds_total{mode="user"}[1m]) > 0.9
This could fire if CPU spikes just for a second.
โ Solution:
Add for:
to wait before alerting:
yamlCopyEdit- alert: HighCPU
expr: rate(node_cpu_seconds_total{mode="user"}[1m]) > 0.9
for: 2m
This ensures the alert only triggers if the condition holds continuously for 2 minutes.
Mistake 5: Too Short Rate Windows
๐ Problem:
Using short windows for rate()
or increase()
(like [30s]
) leads to noisy or erratic results, especially for low-frequency metrics.
๐ง Why?
rate()
needs multiple samples to give meaningful resultsShort windows donโt smooth over variations or delays
โ Bad:
promqlCopyEditrate(http_requests_total[30s])
โ Good:
promqlCopyEditrate(http_requests_total[5m])
Longer windows provide more stable, statistically accurate results
For alerts, use windows like
[2m]
to[5m]
For dashboards, use dynamic durations like
$__rate_interval
in Grafana
Mistake 6: Using Functions With Incorrect Metric Types
๐ฑ Problem:
Applying PromQL functions meant for one metric type (e.g., counters) to another type (e.g., gauges) leads to invalid or misleading results.
โ Example:
promqlCopyEditrate(node_memory_MemAvailable_bytes[5m])
This is incorrect. node_memory_MemAvailable_bytes
is a gauge, not a counter, so rate()
doesn't make sense.
โ Solution:
Use
rate()
orincrease()
only with monotonically increasing countersUse raw gauge values for metrics like memory, disk, temperature
Function Compatibility:
Function | Works With | Description |
rate() | Counters | Rate of increase over time window |
increase() | Counters | Total increase over time window |
avg_over_time | All | Average value over time |
max_over_time | All | Maximum value over time |
Summary Table
Mistake | Root Cause | Consequences | Fix |
1. Cardinality Bombs | High-cardinality labels | Memory bloat, instability | Remove unbounded labels |
2. Over-Aggregation | Aggregating all labels | Loss of detail, inaccurate alerts | Use by(...) carefully |
3. Unscoped Selectors | No filtering in queries | Noisy, inefficient results | Use proper label filters |
4. Missing for | No delay in alerts | False positives | Add for: to alert rules |
5. Short Rate Windows | Tiny time ranges | Noisy or empty data | Use [2m] to [5m] |
6. Wrong Function Use | Using rate() on gauges | Misleading results | Match function to metric type |
Exposing Custom Host Metrics Using the Prometheus Node Exporter
1. ๐ "textfile" Collector Module Basics
โ
What is the textfile
collector?
A built-in module of the Node Exporter
Reads files containing Prometheus metric data in text exposition format
These files must be placed in a designated directory
Useful for ad-hoc, one-off, or custom metrics from scripts or non-Go code
๐ How It Works:
You create files like
my_custom_metric.prom
Put them in the directory specified with:
bashCopyEdit--collector.textfile.directory=/var/lib/node_exporter/textfile_collector/
Node Exporter reads those files at scrape time and exposes them as metrics under the
/metrics
endpoint
๐ Key Notes:
Each file should use the
.prom
extension and be valid Prometheus text formatFiles are removed or rotated automatically by you (Node Exporter doesnโt do cleanup)
Avoid frequently rewriting large files (keep them small)
2. ๐ Exposing a Custom Cron Job Metric
Suppose you want to measure the success/failure of a backup script run by cron.
๐จโ๐ป Bash Script Example:
bashCopyEdit#!/bin/bash
BACKUP_STATUS=1 # assume failure
if /usr/local/bin/backup.sh; then
BACKUP_STATUS=0
fi
echo "# HELP backup_success Whether the backup succeeded (1) or failed (0)" > /var/lib/node_exporter/textfile_collector/backup.prom
echo "# TYPE backup_success gauge" >> /var/lib/node_exporter/textfile_collector/backup.prom
echo "backup_success $BACKUP_STATUS" >> /var/lib/node_exporter/textfile_collector/backup.prom
โฑ Cron Job Entry:
cronCopyEdit0 2 * * * /usr/local/bin/backup_metric.sh
This creates or updates /var/lib/node_exporter/textfile_collector/backup.prom
every night at 2 AM. Node Exporter will serve that file as part of its /metrics
.
๐งช You can query this in Prometheus:
promqlCopyEditbackup_success
3. ๐งโ๐ป Generating Metric Text Files From Go
You can also generate .prom
files from Go programs that gather and export custom metrics.
โ Step-by-step Example:
Import Required Package
goCopyEditimport (
"fmt"
"os"
)
Write Metrics to File
goCopyEditfunc writeCustomMetric(filename string, metricName string, value float64) {
file, err := os.Create(filename)
if err != nil {
panic(err)
}
defer file.Close()
fmt.Fprintf(file, "# HELP %s Custom metric\n", metricName)
fmt.Fprintf(file, "# TYPE %s gauge\n", metricName)
fmt.Fprintf(file, "%s %f\n", metricName, value)
}
Usage
goCopyEditfunc main() {
writeCustomMetric("/var/lib/node_exporter/textfile_collector/my_metric.prom", "my_custom_gauge", 42.0)
}
Run this Go program periodically (via cron or systemd timer) to update the metric.
4. ๐ "textfile" Collector Example Scripts Repository
There is an official community-maintained repo with example scripts:
๐ https://github.com/prometheus/node-exporter-textfile-collector-scripts
โ What's in the repo?
Prebuilt scripts to collect metrics like:
SMART disk health
RAID status
Sensors temperature
Filesystem usage
Battery levels
Scripts written in bash, Python, or other languages
Designed to drop
.prom
files in thetextfile_collector
directory
๐ Directory Structure:
swiftCopyEdit/var/lib/node_exporter/textfile_collector/
โโโ smartctl.prom
โโโ sensors.prom
โโโ custom_ping_check.prom
Each .prom
file contains one or more metrics with the appropriate format.
Example content of smartctl.prom
:
bashCopyEdit# HELP smart_disk_ok Whether the disk passed SMART test
# TYPE smart_disk_ok gauge
smart_disk_ok{device="/dev/sda"} 1
smart_disk_ok{device="/dev/sdb"} 0
This allows you to alert on disk failure using PromQL.
Best Practices
Practice | Recommendation |
File format | Use only .prom extension and proper Prometheus text format |
File ownership | Ensure Node Exporter has read access to the files |
Script errors | Avoid creating invalid or partial .prom files (use temp file then rename) |
Performance | Donโt create too many metrics or files. Keep it lean. |
Rotation | Manually rotate or overwrite files regularly to avoid stale metrics |
Sample PromQL Queries
promqlCopyEditbackup_success == 0
Alert if your backup fails.
promqlCopyEditsmart_disk_ok == 0
Detect failing disks.
promqlCopyEditavg(node_custom_ping_latency_ms) by (target)
Get average ping latency from a script.
Relabeling in Prometheus
1. ๐ฏ Motivation for Relabeling
Prometheus scrapes targets and attaches labels to their metrics. Sometimes:
You want to modify these labels.
You want to drop or keep certain targets.
You want to rewrite target addresses.
You want to extract or clean up metadata from service discovery.
Relabeling provides a powerful and flexible way to transform labels or control scrape behavior.
2. โ๏ธ Relabeling in the Prometheus Configuration File
Relabeling is configured in your prometheus.yml
file under different contexts:
Section | Purpose |
relabel_configs | Target relabeling โ modifies targets before scraping |
metric_relabel_configs | Metric relabeling โ modifies individual metrics after scraping |
relabel_configs under remote_write | Modifies labels before sending metrics to a remote storage |
๐ง Example Layout:
yamlCopyEditscrape_configs:
- job_name: 'example'
static_configs:
- targets: ['localhost:9100']
relabel_configs:
- source_labels: [__address__]
regex: 'localhost:9100'
target_label: instance
replacement: 'my-node'
3. ๐งญ Relabeling Steps and Flow
Target Relabeling Flow:
Service discovery (SD) returns a list of target groups.
Each target gets label metadata (like
__address__
,__meta_kubernetes_pod_name
, etc.).These labels go through relabeling steps (
relabel_configs
).The resulting targets are scraped if theyโre not dropped.
Metric Relabeling Flow:
After scraping, each metric passes through
metric_relabel_configs
.Metrics can be dropped, relabeled, or kept based on the rules.
4. ๐งฑ Relabeling Rule Structure and Fields
Each relabeling rule is a YAML dictionary with:
Field | Description |
source_labels | List of labels used as input |
separator | String used to join multiple source label values (default: ; ) |
regex | A regular expression to match against the joined string |
target_label | The label to write the result to |
replacement | String to use as replacement value |
action | What to do: replace , keep , drop , hashmod , labelmap , etc. |
๐ Example:
yamlCopyEdit- source_labels: [__meta_kubernetes_pod_name]
regex: '(.*)'
target_label: pod
replacement: '$1'
action: replace
5. ๐ท๏ธ Target Metadata Labels
When using service discovery (e.g., Kubernetes), targets come with metadata labels, prefixed with __meta_
.
Examples:
__meta_kubernetes_namespace
__meta_kubernetes_pod_name
__meta_kubernetes_pod_label_app
These are temporary labels used during relabeling and are discarded afterward unless explicitly copied.
6. ๐งช The Relabeling Visualizer Tool
๐ Prometheus Relabel Debugger
This web-based tool lets you:
Paste raw label sets
Test relabeling rules interactively
See how each step transforms your labels
Extremely useful for Kubernetes SD debugging
7. ๐งท Example 1: Setting a Fixed Label Value
Add a new label env="prod"
to all targets:
yamlCopyEdit- target_label: env
replacement: prod
action: replace
8. ๐ Example 2: Overriding the Scrape Port
Force scraping on port 9100
regardless of what SD gives:
yamlCopyEdit- source_labels: [__address__]
regex: '(.*):.*'
target_label: __address__
replacement: '${1}:9100'
action: replace
9. ๐ Example 3: Mapping Over Label Patterns
Map all labels with prefix __meta_kubernetes_pod_label_
into real labels:
yamlCopyEdit- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
This will turn:
iniCopyEdit__meta_kubernetes_pod_label_app="nginx"
Into:
iniCopyEditapp="nginx"
10. โ Example 4: Dropping Scraped Samples
Use metric_relabel_configs
to drop unwanted metrics:
yamlCopyEditmetric_relabel_configs:
- source_labels: [__name__]
regex: 'node_cpu_seconds_total'
action: drop
Or drop entire targets from scrape:
yamlCopyEditrelabel_configs:
- source_labels: [__meta_kubernetes_namespace]
regex: 'test-namespace'
action: drop
11. ๐งฉ Debugging Relabeling Rules
๐ How to Debug:
Use
/targets
in Prometheus web UIShows original labels and post-relabel labels
Shows if a target was dropped
Use
/api/v1/targets
to fetch live target infoUse the PromLabs relabel debugger to simulate complex flows
Log level
debug
in Prometheus to see full relabeling logs
Summary
Feature | Description |
relabel_configs | Change scrape targets and metadata |
metric_relabel_configs | Filter or relabel individual metrics |
labelmap | Rename multiple labels based on regex |
drop , keep | Selectively drop/keep targets or metrics |
Visual Debugging | Use relabeler.promlabs.com |
Grafana Heatmaps for Prometheus Histograms
1. Adding and Configuring a Heatmap Panel for Prometheus Histograms
๐ธ What Is a Heatmap Panel?
A heatmap is a two-dimensional chart where:
The X-axis usually represents time.
The Y-axis represents value buckets (e.g., request durations, response sizes).
Color intensity represents the frequency or count of occurrences.
In Prometheus, heatmaps are built from histogram metrics, specifically the _bucket
time series from histogram instruments.
Prerequisites
You have Prometheus set up and scraping histogram metrics.
Example Prometheus metric:
http_request_duration_seconds_bucket
Grafana is connected to Prometheus as a data source.
Step-by-Step: Adding a Heatmap Panel
๐ Step 1: Open Grafana and Create/Edit a Dashboard
Go to Grafana (typically http://localhost:3000).
Click โ+โ โ Dashboard.
Click โAdd New Panelโ.
From the panel type selector, choose "Heatmap".
Step 2: Write the PromQL Query
Use the histogram _bucket
metric with a rate()
or increase()
function:
promqlCopyEditrate(http_request_duration_seconds_bucket[5m])
Replace the metric with your own histogram bucket metric.
rate()
shows frequency per second.increase()
is used for absolute count over a time window.
Step 3: Group by Bucket (le
) and Label
You must group by the le
label (less-than-or-equal) to segment by bucket:
promqlCopyEditsum by (le) (
rate(http_request_duration_seconds_bucket[5m])
)
If you have other labels (e.g., job
, instance
), include them as needed.
Step 4: Panel Settings
A. Data Format
- Format as: Time series buckets (NOT regular time series).
B. Visualization Settings
Set the Y-axis to โlogarithmicโ if your buckets vary widely.
Set the Y-axis unit**: seconds (
s
), milliseconds (ms
), or your metric unit.Choose Color scheme: usually gradient or spectrum.
Adjust Bucket sort: ascending (for duration buckets).
C. Binning Options
In Display > Binning:
X-Axis (time): automatically binned
Y-Axis (bucket boundaries):
Choose โSeriesโ mode for Prometheus
Binning mode: โAutoโ or specify your own bucket steps (optional)
Step 5: Save and Observe
Click Apply to save the panel.
Observe how your metric is distributed across buckets over time.
2. Using and Interpreting the Heatmap Panel
๐ Understanding What You See
The heatmap shows how frequently values fall into different buckets over time.
Each horizontal slice (row) = one bucket (e.g., request โค 0.3s, โค 0.5s, etc.)
Each vertical column = a time slice (e.g., every minute)
Each cell color = frequency (how many requests fell in that range)
Common Use Cases
Use Case | How Heatmap Helps |
Request Latency Analysis | View if most requests fall into <0.5s or spike into higher buckets |
Memory Usage | See how memory allocations vary and group over thresholds |
Response Size | Analyze spikes in payload size over time |
Application Load | View load distribution across histogram buckets |
Typical Interpretation Patterns
Darker cells: More frequent values in that bucket/time.
Sudden color changes: Traffic spike or regression.
Wider spread of color across buckets: Latency variability or inconsistent performance.
Example Histogram Metric
If you're using the default Prometheus Go client:
promqlCopyEditsum by (le) (
rate(http_request_duration_seconds_bucket{job="my-api"}[5m])
)
This query feeds into a heatmap that shows request duration patterns.
Compare with Other Metrics
Combine heatmap with:
_sum: Total request time (for avg calculation).
_count: Total request count.
Use these with PromQL like:
promqlCopyEditrate(http_request_duration_seconds_sum[5m])
/
rate(http_request_duration_seconds_count[5m])
To get average duration, and cross-check the heatmapโs validity.
Subscribe to my newsletter
Read articles from Arijit Das directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
