Understanding Prometheus: A Comprehensive Guide

Arijit DasArijit Das
46 min read

Table of contents

Introduction to the Prometheus Monitoring System

Prometheus is an open-source monitoring and alerting toolkit widely used for cloud-native and microservices environments. Designed originally by SoundCloud in 2012 and now part of the Cloud Native Computing Foundation (CNCF), Prometheus excels in collecting time-series data, enabling real-time alerting and powerful metric analysis.

1. ๐Ÿ” What is Prometheus?

Prometheus is a time-series database and monitoring system. It works by scraping metrics from instrumented targets at specified intervals and storing them in a highly efficient time-series database. Prometheus is widely adopted for its multi-dimensional data model, simple yet powerful query language, and standalone natureโ€”it doesnโ€™t rely on external storage systems or message queues.

Key points:

  • Pull-based data collection via HTTP.

  • Stores data as time-stamped metrics.

  • Comes with a built-in expression browser and alerting.

  • Scales well for most monitoring needs.

2. ๐Ÿงฉ System Architecture

Uploaded image

2.1 Metric Sources (Targets)

These are the systems Prometheus collects metrics from:

  • Applications

  • Databases

  • Linux Hosts

  • Containers

These systems expose metrics endpoints (usually /metrics) in a format Prometheus can scrape.

2.2 Service Discovery

To automatically discover the metric sources, Prometheus uses:

  • Kubernetes

  • Consul

This allows Prometheus to dynamically find services/instances to scrape, rather than manually configuring static targets.

2.3 Prometheus Server (Core)

This is the central brain of the architecture and does most of the heavy lifting:

  • Data Retriever
    Pulls (scrapes) metrics from the discovered targets.

  • TSDB (Time Series Database)
    Stores all the scraped metrics as time-series data.

  • HTTP Server
    Allows users, tools, and dashboards to query Prometheus data using PromQL (Prometheus Query Language).

Prometheus pulls metrics from targets (not push by default โ€” although there are workarounds for pushing).

2.4 Querying and Visualization Tools

These tools interact with Prometheus to visualize and analyze the data:

  • Web UI โ€” Built-in Prometheus UI.

  • SigNoz โ€” Open-source observability platform.

  • Prom Lens โ€” Advanced query-building tool.

  • Grafana โ€” Popular visualization tool (dashboards and graphs).

These tools query data from Prometheus via PromQL.

2.5 Alerting

Prometheus can evaluate alerting rules and:

  • Sends alerts to Alert Manager.

  • Alert Manager handles deduplication, grouping, and routing of alerts.

  • Alert Manager forwards alerts to notification channels like:

    • Email

    • Slack

    • PagerDuty

2.6 Remote/Local Storage Integration

Prometheus can forward samples to external storage systems for long-term retention or more scalable solutions โ€” referred to as Remote/Local Storage. This is optional and useful for scaling or compliance needs.

Summary of the Flow

  1. Prometheus discovers targets via Kubernetes/Consul.

  2. Prometheus scrapes (pulls) metrics from targets.

  3. Data is stored in Prometheusโ€™s TSDB.

  4. Users/tools query metrics (via Web UI, Grafana, etc.).

  5. Prometheus sends alerts to Alert Manager based on rules.

  6. Alert Manager forwards alerts to email, Slack, PagerDuty.

  7. Optionally, Prometheus forwards samples to remote storage.

In short:

Prometheus scrapes, stores, analyzes, and alerts on time-series metrics from various sources โ€” while integrating with external tools for visualization and notifications.

3. โœจ Core Features Overview

Prometheus offers a wide range of features for modern monitoring:

  • Multi-dimensional data model using time series identified by metric names and key/value pairs (labels).

  • Powerful PromQL (Prometheus Query Language) for slicing and dicing data.

  • No reliance on external storageโ€”all data is stored locally in a custom TSDB.

  • Pull-based scraping over HTTP.

  • Integrated Alerting with Alertmanager.

  • Flexible Service Discovery supporting Kubernetes, Consul, EC2, and static targets.

  • Built-in Web UI for ad-hoc queries and visualization.

4. ๐Ÿงฌ Prometheus Data Model

Prometheusโ€™s data model revolves around time series.

  • A metric is a set of time series that share the same name and differ by their label sets.

  • Each time series is identified by a metric name and a set of key/value labels.

  • For example:

      pgsqlCopyEdithttp_requests_total{method="POST", handler="/api/order"}
    
  • Each time series consists of:

    • Timestamps

    • Floating-point samples

  • Labels allow for high-cardinality querying and filtering.

This data model supports flexible and high-dimensional querying without predefined schemas.

5. ๐Ÿงพ Metrics Transfer Format

Prometheus uses a text-based exposition format over HTTP.

  • The standard format is a simple plaintext format, exposed by the /metrics endpoint on each target.

  • Example:

      bashCopyEdit# HELP http_requests_total Total number of HTTP requests
      # TYPE http_requests_total counter
      http_requests_total{method="GET", code="200"} 1027
    
  • Metrics types include:

    • Counter: Monotonic increasing values (e.g., requests served).

    • Gauge: Values that go up and down (e.g., temperature).

    • Histogram: Measures distributions (e.g., request durations).

    • Summary: Similar to histograms but focused on quantiles.

Prometheus scrapes this endpoint regularly and parses the data for storage and query.

6. ๐Ÿ”Ž Query Language (PromQL)

PromQL (Prometheus Query Language) is a powerful and expressive language for querying time series data.

  • Used in the web UI, Grafana, alert rules, and API calls.

  • Supports:

    • Instant vector: snapshot at a time point.

    • Range vector: data over a time range.

    • Arithmetic operations between metrics.

    • Aggregation: sum, avg, max, min, count, etc.

    • Filtering based on labels.

Examples:

  • http_requests_total โ€“ fetch all time series of this metric.

  • sum(rate(http_requests_total[1m])) by (method) โ€“ total requests per method in the last minute.

PromQL makes Prometheus incredibly flexible for real-time analytics.

7. ๐Ÿšจ Integrated Alerting

Prometheus comes with built-in alerting capabilities:

  • Alert Rules are defined using PromQL.

  • Prometheus evaluates these rules at regular intervals and fires alerts.

  • Alerts are sent to Alertmanager, which:

    • Deduplicates alerts

    • Groups related alerts

    • Sends notifications via email, Slack, PagerDuty, etc.

    • Supports silencing and routing policies

Example rule:

yamlCopyEditgroups:
- name: example
  rules:
  - alert: HighRequestRate
    expr: rate(http_requests_total[1m]) > 100
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "High request rate detected"

8. ๐ŸŒ Service Discovery Support

Prometheus can dynamically discover scrape targets using service discovery integrations, avoiding the need for static configs.

Supported methods include:

  • Kubernetes (pods, services, endpoints)

  • Consul

  • EC2

  • Azure

  • GCE

  • Docker Swarm

  • File-based SD (watching JSON/YAML files)

This enables automatic discovery of new services and ensures Prometheus always monitors the correct set of targets, even in dynamic cloud-native environments.

Getting Started with Prometheus

This section will walk you through the first steps of installing, configuring, and running Prometheus. You'll also learn how to explore its web UI, monitor targets, and query data using PromQL.

1. โฌ‡๏ธ Downloading Prometheus

Prometheus can be downloaded directly from its official website:

  • Visit: https://prometheus.io/download

  • Select the appropriate binary for your operating system (e.g., Linux, macOS, Windows).

  • Example (Linux, x86_64):

bashCopyEditwget https://github.com/prometheus/prometheus/releases/download/v2.51.1/prometheus-2.51.1.linux-amd64.tar.gz

Make sure to always download the latest stable version.

2. ๐Ÿ“ฆ Unpacking and Inspecting the Tarball

Once downloaded, unpack the tarball using the following command:

bashCopyEdittar -xvf prometheus-2.51.1.linux-amd64.tar.gz
cd prometheus-2.51.1.linux-amd64

Inside the extracted folder, you'll see:

  • prometheus โ€“ the main binary

  • promtool โ€“ tool to check config files

  • prometheus.yml โ€“ default config file

  • console_libraries/ โ€“ libraries for console templates

  • consoles/ โ€“ example console templates

This directory structure can be moved or customized based on your deployment setup.

3. โš™๏ธ Configuring Prometheus

Prometheus is configured via a YAML file (prometheus.yml), which defines global settings, scrape targets, alerting, and more.

Example prometheus.yml:

yamlCopyEditglobal:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

Key config sections:

  • global: Sets the default scrape interval, evaluation interval, etc.

  • scrape_configs: Defines monitoring targets, job names, relabeling, etc.

  • alerting: Configures Alertmanager integration.

  • rule_files: Specifies rule files for alerts or recording.

Use promtool check config prometheus.yml to validate your configuration.

4. ๐Ÿ Command-Line Flags and Defaults

Prometheus supports many command-line flags to customize runtime behavior. Commonly used ones:

bashCopyEdit./prometheus \
  --config.file=prometheus.yml \
  --storage.tsdb.path=data/ \
  --web.listen-address=":9090"

Useful flags:

  • --config.file: Path to the config file (default: prometheus.yml)

  • --storage.tsdb.path: Directory for storing metrics data (default: data/)

  • --web.listen-address: Port on which Prometheus serves the UI and API

  • --log.level: Set log level (e.g., info, debug, error)

You can view all flags by running:

bashCopyEdit./prometheus --help

5. โ–ถ๏ธ Running Prometheus

To start Prometheus:

bashCopyEdit./prometheus --config.file=prometheus.yml

You should see logs indicating that Prometheus is starting and loading targets. By default, the web UI will be accessible at:

arduinoCopyEdithttp://localhost:9090

Make sure port 9090 is open and not blocked by firewalls or other services.

6. ๐ŸŒ Web Interface

Prometheus includes a built-in web UI accessible via the browser.

Features:

  • Home dashboard with system status

  • Expression browser for querying metrics

  • Visualization of raw time-series data

  • Target health and label inspection

  • Alerts and rules display

To access it:

arduinoCopyEdithttp://localhost:9090

Useful tabs:

  • Status > Targets: See active targets and scrape status

  • Graph: Run PromQL queries and visualize data

  • Alerts: View firing and pending alerts

7. ๐ŸŽฏ Targets Page

The Targets page shows all the configured jobs and their respective scrape endpoints.

Navigate to:

bashCopyEdithttp://localhost:9090/targets

Youโ€™ll see:

  • Job name

  • Endpoint

  • Last scrape time

  • Scrape duration

  • Scrape status (UP/DOWN)

If a target is down, check:

  • If the service is running

  • If the endpoint is reachable

  • If the config is correct

This page is essential for debugging connectivity and monitoring issues.

8. ๐Ÿ”Ž Querying Metrics with PromQL

Prometheus supports PromQL (Prometheus Query Language) for querying and analyzing time-series data.

To try it out:

  • Go to http://localhost:9090/graph

  • Enter a query, e.g.:

      promqlCopyEditup
    

    This checks if targets are up (1 = UP, 0 = DOWN).

Examples:

  • node_cpu_seconds_total: View total CPU time

  • rate(http_requests_total[1m]): View request rate over the past minute

  • sum by (instance)(rate(http_requests_total[5m])): Total requests per instance

You can visualize results as graphs, tables, or export as JSON using the HTTP API.

Understanding Prometheus Metric Types

Prometheus supports four core metric types that represent different patterns of data collection. These typesโ€”Gauges, Counters, Summaries, and Histogramsโ€”allow developers and SREs to instrument and monitor applications with precision and clarity. Each type has specific characteristics and use cases.

1. Gauges

A Gauge is a metric that represents a single numerical value that can arbitrarily go up and down. Use gauges for things like current memory usage, number of active goroutines, or temperature readings.

Example use cases:

  • Current CPU temperature

  • Active sessions

  • Queue length

  • Free memory

2. Gauge Instrumentation Methods

Prometheus client libraries (e.g., Python, Go, Java) provide methods to work with gauges:

In Go:

goCopyEditvar temperature = prometheus.NewGauge(
    prometheus.GaugeOpts{
        Name: "room_temperature_celsius",
        Help: "Current room temperature in Celsius.",
    },
)
temperature.Set(22.5)
temperature.Inc()
temperature.Dec()

In Python:

pythonCopyEditfrom prometheus_client import Gauge
temperature = Gauge('room_temperature_celsius', 'Current room temperature')
temperature.set(22.5)

Common methods:

  • set(value)

  • inc(), dec()

  • set_to_current_time()

3. Gauges in the Exposition Format

The exposition format is the plain text output served on the /metrics endpoint.

Example:

nginxCopyEdit# HELP room_temperature_celsius Current room temperature
# TYPE room_temperature_celsius gauge
room_temperature_celsius 22.5

The format is human-readable and easily parseable by Prometheus scrapers.

4. Querying Gauges

Use PromQL to directly view the current value of a gauge:

promqlCopyEditroom_temperature_celsius

You can apply mathematical operations:

promqlCopyEditroom_temperature_celsius * 1.8 + 32  // Convert to Fahrenheit

5. Gauges Containing Timestamps

Gauges can also include explicit timestamps in exposition format, although it's not typical.

Example:

nginxCopyEditroom_temperature_celsius 22.5 1683023900000

However, this is discouraged unless absolutely necessary, as it can complicate time-series storage.

6. Counters

Counters are cumulative metrics that can only increase (or be reset to zero on restart). Use counters to track things like:

  • Total HTTP requests

  • Errors

  • Bytes transferred

They are strictly monotonically increasing.

7. Counter Resets

Counters can reset to zero, typically after a service restart. Prometheus handles this gracefully by detecting sudden drops to zero and adjusting calculations (e.g., using rate()).

For example:

promqlCopyEditrate(http_requests_total[5m])

This accounts for resets by calculating per-second increase over time.

8. Counter Instrumentation Methods

In Go:

goCopyEditvar requests = prometheus.NewCounter(
    prometheus.CounterOpts{
        Name: "http_requests_total",
        Help: "Total HTTP requests",
    },
)
requests.Inc()
requests.Add(3)

In Python:

pythonCopyEditfrom prometheus_client import Counter
requests = Counter('http_requests_total', 'Total HTTP requests')
requests.inc()

9. Counters in the Exposition Format

yamlCopyEdit# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total 1543

This value never decreases unless thereโ€™s a reset.

10. Querying Counters (Absolute Values vs. Rates)

  • Absolute value: http_requests_total

  • Rate of increase: rate(http_requests_total[1m])

rate() returns per-second average increase:

promqlCopyEditsum(rate(http_requests_total[5m])) by (method)

Useful for dashboards and alerting thresholds.

11. Summaries

Summaries are used to track observations (e.g., request durations, response sizes) and produce:

  • Quantiles (e.g., 0.5, 0.9, 0.99)

  • Sum of all observations

  • Count of observations

12. Constructing Summaries

In Go:

goCopyEditsummary := prometheus.NewSummary(prometheus.SummaryOpts{
    Name:       "request_duration_seconds",
    Help:       "Request duration in seconds",
    Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
})
summary.Observe(1.2)

13. Summary Instrumentation Methods

  • observe(value): Add a new observation

  • Predefined objectives (quantiles)

Summaries provide real-time quantile approximations (but with trade-offs like memory usage and no aggregation across labels).

14. Querying Summaries

Summary metrics are split into:

  • _count: number of observations

  • _sum: total value

  • _quantile: quantile estimations

Example:

promqlCopyEditrate(request_duration_seconds_sum[5m]) / rate(request_duration_seconds_count[5m])

This calculates average request duration over 5 minutes.

15. Histograms

Histograms group observations into configurable buckets and count how many fall into each.

They provide:

  • Bucketed counts

  • _count and _sum

  • Percentile approximations (via Prometheus, not client)

16. Cumulative Histograms

Histograms use cumulative buckets:

textCopyEditrequest_duration_seconds_bucket{le="0.1"} 240
request_duration_seconds_bucket{le="0.5"} 756
request_duration_seconds_bucket{le="1"}   999
request_duration_seconds_bucket{le="+Inf"} 1024

Each bucket includes the count for that threshold and below.

17. Constructing Histograms

In Go:

goCopyEdithist := prometheus.NewHistogram(prometheus.HistogramOpts{
    Name:    "request_duration_seconds",
    Help:    "Histogram of request durations",
    Buckets: prometheus.LinearBuckets(0.1, 0.1, 10),
})
hist.Observe(0.3)

Choose buckets wisely to match the distribution of your data.

18. Histogram Instrumentation Methods

  • observe(value): Records a value

  • Buckets must be set at creation and are immutable

19. Histograms in the Exposition Format

textCopyEdit# HELP request_duration_seconds Histogram of request durations
# TYPE request_duration_seconds histogram
request_duration_seconds_bucket{le="0.1"} 123
request_duration_seconds_bucket{le="0.5"} 456
request_duration_seconds_count 789
request_duration_seconds_sum 105.6

Prometheus computes quantiles during query time (not client side like summaries).

20. Querying Histograms

Example: 95th percentile from histogram:

promqlCopyEdithistogram_quantile(0.95, rate(request_duration_seconds_bucket[5m]))

Average duration:

promqlCopyEditrate(request_duration_seconds_sum[5m]) / rate(request_duration_seconds_count[5m])

21. Average Request Latencies

Both Summaries and Histograms can calculate average latency:

promqlCopyEditrate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])

Histograms are preferred for aggregations across labels, which summaries cannot do.

22. Native Histogram

Introduced in Prometheus v2.40+, Native Histograms are experimental and designed to:

  • Reduce memory usage

  • Support better quantile approximation

  • Be more efficient at high cardinality

Native histograms are exposed using a new metric type and are collected via the OpenMetrics format.

They are configured in Prometheus using:

yamlCopyEditenable_feature: native-histograms

Unlike regular histograms, native histograms donโ€™t require pre-defined buckets and dynamically adapt based on the distribution of data.

PromQL Data Selection Explained

Prometheus Query Language (PromQL) is a powerful tool used for querying time series data. At the heart of PromQL are selectorsโ€”constructs that define what data to fetch, from which series, and over what time range.

This section focuses on both instant vector and range vector selectors, label matchers, and all modifiers that affect how and when data is retrieved and evaluated.

1. Instant Vector Selectors

An instant vector selector retrieves the latest sample for each time series at a single point in time (usually "now").

Syntax:

promqlCopyEdithttp_requests_total

This fetches all time series with the metric name http_requests_total at the current moment.

2. Label Matchers

Label matchers refine vector selectors by filtering based on metric labels.

Types of matchers:

MatcherDescriptionExample
=Equals{job="api"}
!=Not equals{status!="500"}
=~Regex match`{method=~"GET
!~Negative regex match{job!~"dev-.*"}

Example:

promqlCopyEdithttp_requests_total{job="api", status=~"2.."}

3. Visualizing Instant Vector Selector Behavior (Lookback Delta)

Prometheus doesn't scrape metrics exactly at the evaluation moment. It looks backwards in time using the lookback delta (default: 5m).

If no sample exists within 5 minutes, Prometheus drops the series from the result.

So:

promqlCopyEditmetric_name

...returns the most recent value within the last 5 minutes.

4. Staleness Markers and Staleness Handling

Prometheus uses staleness markers to detect when a series stops being reported (e.g., app crashed). If a time series disappears, Prometheus marks it as stale.

Staleness is used to:

  • Stop evaluating old data

  • Avoid misleading results

These markers are invisible in PromQL but affect evaluation.

5. Range Vector Selectors

A range vector selector retrieves all samples for each time series over a specified time interval.

Syntax:

promqlCopyEdithttp_requests_total[5m]

This selects all values in the last 5 minutes for each series. The output is a range vector, which can be passed into functions like rate().

6. Visualizing Range Vector Selector Behavior

Range vectors return a set of time-stamped samples.

For example:

promqlCopyEditrate(http_requests_total[1m])

Behind the scenes, Prometheus:

  • Selects all samples in the last 1 minute per series

  • Calculates per-second rate of increase

Each data point in a graph represents a separate evaluation of the range.

7. Relative Offsets (offset Modifier)

The offset modifier shifts the evaluation time back in time.

Example:

promqlCopyEdithttp_requests_total offset 1h

Returns the value of http_requests_total from 1 hour ago (either as instant or range vector depending on selector type).

Can be combined with range vectors:

promqlCopyEditrate(http_requests_total[5m] offset 1h)

This gives the 5-minute rate calculated 1 hour ago.

8. Visualizing Offsets for Instant Vector Selectors

If now is 16:00, then:

promqlCopyEditmetric_name offset 1h

...evaluates the value of metric_name at 15:00.

It works like a time machine for metrics.

9. Offset Use Cases

Use offsets to:

  • Compare current data to past performance

  • Detect regressions

  • Create "previous week" or "same time yesterday" graphs

Example:

promqlCopyEdit(rate(http_requests_total[5m]) - rate(http_requests_total[5m] offset 1d)) / rate(http_requests_total[5m] offset 1d)

This shows the percentage change from yesterday.

10. Visualizing Offsets for Range Vector Selectors

Example:

promqlCopyEditrate(metric[1h] offset 2h)

Assume current time is 18:00:

  • Evaluation time: 18:00

  • Range: 1h

  • Offset: 2h

โžก๏ธ Evaluates over 15:00 to 16:00

11. Absolute Evaluation Timestamps (@ Modifier)

The @ modifier lets you run a query as if it were evaluated at an exact timestamp.

Syntax:

promqlCopyEdithttp_requests_total @ 1714606800
  • Uses Unix timestamp in seconds

  • Only available in Prometheus >= v2.33

Use cases:

  • Forensics

  • Debugging exact past states

  • Deterministic exports

12. Visualizing Absolute Evaluation Timestamps

Imagine this query:

promqlCopyEditrate(http_requests_total[5m]) @ 1714606800

Prometheus computes the 5-minute rate at the exact time 1714606800.

This enables reproducibility of data snapshots and avoids skew from real-time evaluations.

13. Syntactic Order for Modifiers

When combining modifiers (offset, @), their order matters.

Correct order:

promqlCopyEditmetric[5m] offset 1h @ 1714606800
  • offset is applied before @

  • Read it like: "take 5-minute range 1h ago, evaluated at timestamp"

If order is incorrect, Prometheus throws a parse error.

Understanding Counter Rates and Increases in PromQL

Prometheus counters represent monotonically increasing valuesโ€”such as the number of requests processed or bytes transferred. Understanding how to interpret, calculate, and query these counters accurately is essential for time-series analytics.

1. Absolute Counter Values and Why We Want Rates

๐Ÿ”น Absolute Values:

Counters like http_requests_total grow over time. They show the total amount of something that has occurred.

Example:

promqlCopyEdithttp_requests_total

This shows the current cumulative count of HTTP requestsโ€”but doesn't tell how fast they're coming in.

๐Ÿ”ธ Why We Want Rates:

Absolute values donโ€™t show trends or activity levels. We usually want:

  • How many requests per second?

  • How fast is the traffic increasing?

Thus, we compute rates (changes per time unit).

2. The Three Counter Increase Functions

Prometheus provides 3 main functions to evaluate counter growth over time:

FunctionDescription
rate()Calculates per-second average rate
increase()Calculates absolute increase over a period
irate()Calculates per-second rate using last 2 points (instant rate)

3. Behavior of rate() and increase()

rate():

Used with range vectors, gives the average rate per second over the range.

Syntax:

promqlCopyEditrate(http_requests_total[5m])

This calculates how many requests per second happened on average over the last 5 minutes.

increase():

Calculates total increase over a time range.

Syntax:

promqlCopyEditincrease(http_requests_total[5m])

If 100 requests were made during the 5-minute window, this returns 100.

4. Handling Counter Resets

Prometheus counters may reset (e.g., due to app restart). Prometheus automatically detects this by identifying a lower value than before.

PromQL functions like rate() and increase():

  • Detect these resets

  • Skip invalid segments

  • Continue calculating using valid portions

๐Ÿ”ธ If a reset is detected:

textCopyEdit... -> 950 -> 980 -> 10 (reset) -> 50

increase() computes: (980 - 950) + (50 - 10) = 30 + 40 = 70

5. Calculating the rate() and increase() Slope

Prometheus interpolates values at the start and end of the range window and fits a linear regression line to the points.

Example:

For increase(http_requests_total[5m]), Prometheus:

  1. Gathers all samples in the 5-minute window

  2. Interpolates a value at the start

  3. Interpolates a value at the end

  4. Computes the difference between them

Mathematically:

textCopyEditincrease = value_end - value_start
rate = increase / duration_in_seconds

6. Extrapolating the Return Value for the increase() Function

Prometheus doesn't just blindly subtract endpoints. It extrapolates when samples don't exist exactly at the boundaries.

If data points are sparse, Prometheus interpolates start/end points, then extrapolates the growth based on rate to cover the entire window.

This avoids underestimating counters when scrapes are missed or irregular.

7. Confusing Extrapolating for Slow-Moving Counters

Slow-moving counters (e.g., errors that happen once per hour) can confuse users.

Example:

promqlCopyEditincrease(errors_total[5m])
  • If one error occurred 4m ago, Prometheus extrapolates to assume a partial contribution across the 5-minute window.

  • It may look like a fractional increase (e.g., 0.2), which surprises users expecting whole numbers.

๐Ÿ”น Prometheus is mathematically correct, but interpretation requires caution for low-frequency events.

8. Limiting Extrapolating to Zero Sample Values

If no samples are found in a range, increase() and rate() return 0, not NaN.

This is crucial for graph continuityโ€”otherwise, dashboards would show gaps.

But be careful: zero increase โ‰  zero traffic. It might mean:

  • No data scraped

  • Metric not emitted

  • Actual zero traffic

๐Ÿ“Œ Use alerting rules or metadata checks to detect missing data.

9. The irate() Function

irate() (instant rate) computes the rate between the two most recent samples in a range.

Syntax:

promqlCopyEditirate(http_requests_total[5m])
  • Uses just the last two data points

  • No interpolation, no smoothing

  • Ideal for spiky, fast-changing signals

โš ๏ธ Use with caution on slow countersโ€”it can be misleading if data is sparse.

10. Which Function Should You Use?

Use CaseFunction
Trends, averages, smoothingrate()
Absolute counts over timeincrease()
Current/instantaneous valuesirate()
Alerting (on spikes, errors)rate() or irate()
SLO calculationincrease() (e.g., over a day/week)

๐Ÿ”ธ Rule of thumb:

  • Dashboards: use rate() for visual stability.

  • SLO math: use increase() to count events.

  • High-frequency alerting: use irate() if latency is critical.

Understanding "up" and Friends in Prometheus

1. Prometheus Server Configuration

Before exploring up and other auto-generated metrics, it's crucial to understand how Prometheus is configured to monitor targets:

  • Configuration File: Prometheus uses a prometheus.yml configuration file to define scraping jobs.

  • scrape_configs: Within this file, the scrape_configs block defines how Prometheus should discover and collect metrics from targets.

Example:

yamlCopyEditscrape_configs:
  - job_name: 'node_exporter'
    static_configs:
      - targets: ['localhost:9100']
  • Each job defines:

    • job_name: A label for the scrape group.

    • targets: IP addresses or hostnames of endpoints exposing metrics.

    • metrics_path (default: /metrics)

    • Optional: relabeling, authentication, TLS, and timeouts.

When Prometheus starts, it uses this configuration to initialize target discovery and begin scraping metrics.

2. Inspecting Targets in Prometheus

To verify that Prometheus is correctly scraping your services:

  • Navigate to the Targets Page:

    • URL: http://<your-prometheus-host>:9090/targets
  • This page displays:

    • Job names and their associated targets.

    • Scrape status (up/down).

    • Last scrape duration and timestamp.

    • Labels associated with each target.

  • Importance: This helps you quickly see which targets are reachable and why some may be down.

  • Health Status: The field last scrape error or the color-coded status lets you identify failures in real time.

3. Showing All Auto-Generated Metrics

Prometheus automatically generates some internal metrics about its own operation, especially for each target it scrapes. These are exposed at:

  • http://localhost:9090/metrics

To view a list of all available metrics in the UI:

  • Go to http://localhost:9090/graph

  • Click on the "insert metric at cursor" dropdown or start typing in the expression field.

  • Metrics like up, scrape_duration_seconds, and scrape_samples_post_metric_relabeling appear.

4. The "up" Metric

This is the most important health metric in Prometheus.

  • Definition: up is a gauge metric automatically generated by Prometheus to indicate whether a target is reachable.

  • Values:

    • 1: The scrape was successful (target is UP).

    • 0: The scrape failed (target is DOWN or unreachable).

  • Labels:

      textCopyEditup{job="node_exporter", instance="localhost:9100"} 1
    
  • Use Case:
    You can use this in alerting rules:

      yamlCopyEditalert: TargetDown
      expr: up == 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "Target {{ $labels.instance }} is down"
    
  • Internals: up is computed based on whether Prometheus received a valid HTTP 200 response and successfully parsed the metrics from the target.

5. Other Auto-Generated Metrics

Prometheus exposes several internal metrics for diagnostics and performance monitoring:

MetricDescription
scrape_duration_secondsTime taken to scrape a target
scrape_samples_scrapedNumber of samples scraped in the last scrape
scrape_samples_post_metric_relabelingNumber of samples retained after relabeling
scrape_series_addedNumber of series added in the scrape
scrape_timeout_secondsTimeout setting per scrape
prometheus_sd_*Service discovery subsystem metrics
prometheus_target_*Metrics on target health and discovery
prometheus_engine_*Query engine performance
prometheus_tsdb_*Storage subsystem metrics (compaction, WAL, memory usage)

Example:

textCopyEditscrape_duration_seconds{job="node_exporter", instance="localhost:9100"} 0.023

These can be used to:

  • Detect scrape performance issues

  • Analyze ingestion rate

  • Tune Prometheus server configuration

6. Auto-Generated Metrics in the Prometheus Documentation

Prometheus maintains complete documentation of its internal metrics:

  • Official Reference:

    • Prometheus Internal Metrics Documentation
  • The documentation includes:

    • Metric name

    • Type (gauge, counter)

    • Description

    • Associated labels

    • Subsystem/component

  • Use Case: These are especially useful for:

    • Monitoring Prometheus server health

    • Creating dashboards (e.g., Grafana Prometheus dashboards)

    • Debugging ingestion issues

    • Auditing scrape errors

๐Ÿ” Summary Table: Key Auto-Generated Metrics

Metric NameTypePurpose
upGaugeIndicates if the target was successfully scraped
scrape_duration_secondsGaugeScrape latency
scrape_samples_scrapedGaugeNumber of metrics collected per scrape
prometheus_target_interval_length_secondsGaugeActual vs expected interval duration
prometheus_engine_query_duration_secondsHistogramDuration of PromQL queries
prometheus_tsdb_head_seriesGaugeTotal active series in TSDB

Understanding Prometheus Histograms

1. Motivation and Histogram Basics

Histograms in Prometheus are used to observe and record the distribution of events over a set of predefined buckets. They are particularly useful for understanding the behavior of applications, such as response times, request sizes, or any measurable quantity that can be categorized.

2. Need to Measure Request Durations/Latency

Monitoring request durations or latency is crucial for:

  • Performance Analysis: Understanding how fast your application responds.

  • SLA/SLO Compliance: Ensuring response times meet agreed standards.

  • Bottleneck Identification: Detecting slow components in your system.

Histograms allow you to see not just averages but the distribution of response times, which is vital for comprehensive performance monitoring.

3. Downsides of Using Event Logging

While event logging provides detailed insights, it has limitations:

  • High Overhead: Logging every event can consume significant resources.

  • Complex Analysis: Aggregating and analyzing logs for metrics is cumbersome.

  • Latency: Real-time analysis is challenging due to the volume of data.

Histograms offer a more efficient way to monitor metrics like latency without the overhead of detailed logging.

4. Why a Single Gauge Doesn't Help Us

A gauge represents a single numerical value that can go up or down. Using a gauge for metrics like request duration is inadequate because:

  • Lack of Distribution: Gauges show only the current value, not the spread.

  • No Historical Context: They don't provide insights into past performance.

  • Inability to Calculate Percentiles: Gauges can't be used to compute percentiles like the 95th percentile.

5. Downsides of Using Prometheus Summary Metrics

Summaries in Prometheus can calculate quantiles but have drawbacks:

  • Client-Side Calculation: Quantiles are calculated on the client, limiting flexibility.

  • No Aggregation Across Instances: Summaries can't be aggregated across multiple instances.

  • Static Configuration: Quantile objectives must be predefined.

Histograms, on the other hand, allow server-side aggregation and dynamic quantile calculation.

6. Prometheus Histogram Example for Tracking Request Durations

To track request durations:

goCopyEdithttpDuration := prometheus.NewHistogram(prometheus.HistogramOpts{
    Name:    "http_request_duration_seconds",
    Help:    "A histogram of the HTTP request durations.",
    Buckets: prometheus.DefBuckets,
})

This setup records the duration of HTTP requests into predefined buckets, enabling detailed analysis of response times.

7. How Can We Expose Histograms as Time Series to Prometheus?

Prometheus histograms are exposed as multiple time series:

  • <metric>_bucket{le="..."}: Cumulative count of observations less than or equal to the bucket's upper bound.

  • <metric>_sum: Sum of all observed values.

  • <metric>_count: Total number of observations.

These time series allow Prometheus to store and query histogram data effectively.

8. Cumulative Histogram Representation

This chart plots:

  • X-axis: Duration in milliseconds (e.g., 25ms, 50ms, 100ms, etc.).

  • Y-axis: Count of observations that fall within a specific bucket.

  • Each bar height represents the number of observations between two bounds.

Bucket Counts (as shown):

Bucket Range (ms)Count
โ‰ค 2531
25โ€“5032
50โ€“100105
100โ€“250617
> 250215

This means, for example, 617 requests took between 100ms and 250ms.

Prometheus stores histograms in a cumulative format rather than the regular format shown in the image.

A cumulative histogram gives the running total of observations up to each bucket's upper bound:

Bucket (le = "less than or equal to")Cumulative Count
le="25"31
le="50"63 (31+32)
le="100"168 (63+105)
le="250"785 (168+617)
le="+Inf"1000 (785+215)

So instead of individual bars, each bucket value contains the total number of observations less than or equal to the upper bound.

Summary of the Difference

FeatureRegular Histogram (Image)Cumulative Histogram (Prometheus)
Bucket ValueObservations within a rangeObservations up to a bound
Data RepresentationIndependent bar heightsAccumulated total at each threshold
Example105 requests took 50โ€“100ms168 requests took โ‰ค100ms

9. The Special "le" (Less-Than-Or-Equal) Bucket Upper Bound Label

In Prometheus, histograms use bucketed counts to record how many observations fall below certain thresholds.

Each bucket is labeled with:

iniCopyEditle = X

Which means:

โ€œCount of observations less than or equal to X.โ€

For example:

  • le="25" โ†’ number of observations โ‰ค 25 ms

  • le="50" โ†’ number of observations โ‰ค 50 ms

  • ...

  • le="+Inf" โ†’ total count of all observations (since everything is โ‰ค โˆž)

From the image:

Bucket (le)Cumulative Count
โ‰ค 25 ms31
โ‰ค 50 ms63
โ‰ค 100 ms168
โ‰ค 250 ms785
โ‰ค +Inf1000

Interpretation:

  • From 0โ€“25 ms: 31 requests completed

  • 25โ€“50 ms: 63 - 31 = 32 requests

  • 50โ€“100 ms: 168 - 63 = 105 requests

  • 100โ€“250 ms: 785 - 168 = 617 requests

  • 250โ€“โˆž ms: 1000 - 785 = 215 requests

Summary

  • The le label tells you the upper bound of the bucket.

  • These buckets are cumulative: each includes all lower durations.

  • Subtracting adjacent bucket values gives the number of samples in that range.

  • The bucket with le="+Inf" always contains the total number of samples.

10. Time Series Exposed from a Histogram Metric

This cumulative histogram displays the duration (in seconds) of observed events, using bucket boundaries (e.g. le="0.025") along the X-axis, and the cumulative count of observations along the Y-axis.

The time series exposed by a Prometheus histogram metric named http_request_duration_seconds_bucket would look like:

promqlCopyEdithttp_request_duration_seconds_bucket{le="0.025"}
http_request_duration_seconds_bucket{le="0.05"}
http_request_duration_seconds_bucket{le="0.1"}
http_request_duration_seconds_bucket{le="0.25"}
http_request_duration_seconds_bucket{le="+Inf"}

Each of these is a separate time series, and their values increase cumulatively as more events fall into that bucket or smaller.


๐Ÿง  How to Interpret le

Each le value is an upper boundary, meaning:

  • le="0.025" โ†’ all durations โ‰ค 25 ms

  • le="0.05" โ†’ all durations โ‰ค 50 ms

  • le="0.1" โ†’ all durations โ‰ค 100 ms

  • ...

  • le="+Inf" โ†’ all observations (total count)


๐Ÿ“ Behind the Scenes: Prometheus Histogram Export

A histogram metric in Prometheus (like http_request_duration_seconds) exposes 3 types of time series automatically:

Series TypePurpose
*_bucket{le="..."}Buckets by le, cumulative counts
*_countTotal count of observations
*_sumTotal sum of all observed values

So for http_request_duration_seconds, you'll see:

promqlCopyEdithttp_request_duration_seconds_bucket{le="0.025"}
http_request_duration_seconds_bucket{le="0.05"}
...
http_request_duration_seconds_bucket{le="+Inf"}

http_request_duration_seconds_sum
http_request_duration_seconds_count

โœ… Why It Matters

  • You can compute percentiles using these buckets (e.g. 95th percentile from histogram approximation).

  • Subtracting two adjacent buckets gives the count in that interval.

  • It enables time-based slicing (e.g. rate of slow responses over the last 5 minutes).

11. Instrumentation - Adding Histograms to Your Code

To instrument your code with histograms:

  1. Define the Histogram:
goCopyEditvar requestDuration = prometheus.NewHistogram(prometheus.HistogramOpts{
    Name:    "http_request_duration_seconds",
    Help:    "Histogram of response time for handler.",
    Buckets: prometheus.LinearBuckets(0.05, 0.05, 20),
})
  1. Register the Histogram:
goCopyEditprometheus.MustRegister(requestDuration)
  1. Observe Values:
goCopyEditstart := time.Now()
// handle request
duration := time.Since(start).Seconds()
requestDuration.Observe(duration)

12. Adding Histograms Without Additional Labels

When adding histograms without additional labels:

  • Simplifies Aggregation: Easier to aggregate across instances.

  • Reduces Cardinality: Fewer unique time series, conserving resources.

  • Use Case: Suitable for global metrics where differentiation isn't necessary.

13. Adding Histograms With Additional Labels

Adding labels to histograms allows for more granular analysis:

goCopyEditvar requestDuration = prometheus.NewHistogramVec(
    prometheus.HistogramOpts{
        Name:    "http_request_duration_seconds",
        Help:    "Histogram of response time for handler.",
        Buckets: prometheus.LinearBuckets(0.05, 0.05, 20),
    },
    []string{"method", "endpoint"},
)

This setup enables you to analyze request durations by HTTP method and endpoint.

14. Querying Histograms with PromQL

PromQL provides functions to query histograms:

  • rate(): Calculates the per-second average rate of increase.

  • increase(): Calculates the total increase over a time range.

  • histogram_quantile(): Estimates quantiles from histogram buckets.

Example:

promqlCopyEdithistogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

This query estimates the 95th percentile of request durations over the last 5 minutes.

15. Querying All Bucket Series of a Histogram

To retrieve all bucket series:

promqlCopyEdithttp_request_duration_seconds_bucket

This returns all time series with the bucket suffix, allowing you to analyze the distribution across all buckets.

16. Querying Percentiles/Quantiles Using histogram_quantile()

The histogram_quantile() function estimates quantiles:

promqlCopyEdithistogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

This computes the 95th percentile by summing the rate of increase across all buckets and applying the quantile function.

17. Using rate() or increase() to Limit a Histogram to Recent Increases

To focus on recent data:

  • rate(): Provides the per-second average rate over a time range.

  • increase(): Gives the total increase over a time range.

Example:

promqlCopyEditrate(http_request_duration_seconds_bucket[5m])

This calculates the rate of increase for each bucket over the last 5 minutes.

18. Controlling the Smoothing Time Window

The time range specified in rate() or increase() functions controls the smoothing window:

  • Shorter Window: More responsive to recent changes but noisier.

  • Longer Window: Smoother results but less responsive to recent changes.

Choose the window size based on the desired balance between responsiveness and smoothness.

19. Aggregating Histograms and Percentiles Over Label Dimensions

To aggregate histograms across dimensions:

promqlCopyEditsum(rate(http_request_duration_seconds_bucket[5m])) by (le)

This sums the rate of increase for each bucket across all instances. You can then apply histogram_quantile() to compute percentiles:

promqlCopyEdithistogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

20. Errors of Quantile Calculation and Bucketing Schemas

Quantile estimation errors can arise due to:

  • Bucket Granularity: Coarse buckets lead to less accurate quantiles.

  • Data Distribution: Uneven distributions can skew results.

  • Interpolation Assumptions: histogram_quantile() assumes a uniform distribution within buckets.

To minimize errors:

  • Use Appropriate Buckets: Choose bucket boundaries that reflect your data distribution.

  • Monitor Bucket Usage: Ensure that most data falls within the defined buckets.

21. Showing Histograms as a Heatmap

Heatmaps provide a visual representation of histogram data over time:

  • X-Axis: Time.

  • Y-Axis: Bucket boundaries.

  • Color Intensity: Frequency of observations.

In Grafana:

  1. Select Heatmap Panel.

  2. Configure Data Source: Use Prometheus as the data source.

  3. Enter Query: For example:

promqlCopyEditrate(http_request_duration_seconds_bucket[5m])
  1. Adjust Visualization Settings: Set appropriate axes and color schemes.

Visual Aid:

22. Querying Request Rates Using _count

To calculate the rate of requests:

promqlCopyEditrate(http_request_duration_seconds_count[5m])

This provides the per-second rate of HTTP requests over the last 5 minutes.

23. Querying Average Request Durations Using _sum and _count

To compute the average request duration:

promqlCopyEditrate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])

This divides the total duration by the number of requests, yielding the average duration per request.

Creating Grafana Dashboards for Prometheus

1. Option A: Running Grafana Using Docker

Step-by-Step Instructions

โœ… Prerequisites:

  • Docker installed on your system.

  • Prometheus already running (can also be in Docker).

๐Ÿ”ง Start Grafana using Docker:

bashCopyEditdocker run -d \
  -p 3000:3000 \
  --name=grafana \
  grafana/grafana

This command:

  • Runs Grafana in the background (-d)

  • Maps Grafanaโ€™s port 3000 to your local machine

  • Names the container grafana

๐Ÿงช Check if itโ€™s running:

Visit http://localhost:3000

2. Option B: Running Grafana Using Pre-Built Binaries

โœ… Prerequisites:

  • Installed Prometheus

  • Installed Grafana binary for your OS from:
    ๐Ÿ”— https://grafana.com/grafana/download

๐Ÿงฐ Installation Steps:

๐Ÿ‘‰ Windows:

  1. Unzip the downloaded Grafana .zip file.

  2. Open a terminal (cmd) and navigate to the bin folder inside the extracted directory.

  3. Run:

bashCopyEditgrafana-server.exe

๐Ÿ‘‰ Linux:

bashCopyEdittar -zxvf grafana-<version>.linux-amd64.tar.gz
cd grafana-<version>
./bin/grafana-server

Grafana will run on http://localhost:3000.

3. Logging into Grafana

๐Ÿ•น๏ธ First Login

๐Ÿ” Youโ€™ll be asked to change the password on first login.

4. Creating a Prometheus Data Source

๐Ÿ“ก Add Prometheus as a Data Source:

  1. In the left sidebar, click gear icon (โš™๏ธ) โ†’ Data Sources

  2. Click โ€œAdd data sourceโ€

  3. Choose โ€œPrometheusโ€

  4. Under HTTP > URL, enter:

httpCopyEdithttp://localhost:9090

(Replace localhost:9090 with your actual Prometheus URL if it differs)

  1. Click โ€œSave & Testโ€

    • You should see a green message: โœ… Data source is working

5. Creating a New Dashboard

๐Ÿ› ๏ธ Steps to Create a Dashboard:

  1. Click the โ€œ+โ€ (plus) icon in the left sidebar โ†’ Dashboard

  2. Click โ€œAdd new panelโ€

  3. Youโ€™ll now see a new panel editor with default settings

  4. At the top, name your dashboard (click on the title "New dashboard")

  5. Click Save (floppy disk icon) in the top right โ†’ Give it a name โ†’ Save

6. Creating a Time Series Chart

๐Ÿ“ˆ Steps to Add a Time Series Panel:

  1. In your new dashboard, click โ€œAdd new panelโ€

  2. Choose Visualization type: Time series (left-hand side)

  3. In the Query section:

    • Set Data Source: Prometheus

    • Enter query:

promqlCopyEditrate(http_requests_total[5m])
  1. Click Run to see the graph populate.

  2. Customize:

    • Panel title, units (like seconds, ms, etc.)

    • Axes (logarithmic or linear)

    • Legend display

  3. Click Apply to save the panel to your dashboard

7. Creating a Gauge Panel

๐ŸŽฏ Steps to Add a Gauge:

  1. Click โ€œAdd panelโ€ โ†’ In the Visualization options, select Gauge

  2. In the Query box, enter something like:

promqlCopyEdithttp_requests_total
  • or a value-producing metric like:
promqlCopyEditsum(rate(cpu_usage_seconds_total[1m]))
  1. Configure:

    • Min & Max range (example: 0 โ€“ 100 for percentages)

    • Thresholds (to color the gauge: green/yellow/red)

    • Unit: e.g., percent, seconds, req/sec

  2. Click Apply

8. Creating a Table Panel

๐Ÿงฎ Steps to Add a Table Panel:

  1. Click โ€œAdd panelโ€

  2. Select Visualization โ†’ Table

  3. In the query section, use a metric that returns multiple labels/values:

promqlCopyEdittopk(5, rate(http_requests_total[1m]))
  1. Under Format:

    • Set to โ€œTableโ€

    • Adjust time range, value format

  2. Style:

    • Add column aliases

    • Apply unit types (seconds, bytes, %, etc.)

  3. Click Apply

9. Adding Rows to the Dashboard

๐Ÿ“‹ Organize Panels Using Rows:

  1. In the dashboard view, click the dropdown menu (three-dot icon) in the upper right

  2. Select โ€œAdd rowโ€

  3. Enter a name for the row (e.g., โ€œPerformance Metricsโ€)

  4. Drag and drop existing panels into this row

  5. Use rows to group related panels:

    • CPU Stats

    • Memory Usage

    • Latency Tracking

๐Ÿ“Œ Rows can be collapsed/expanded, improving usability in large dashboards.

Final Touches

  • Use โ€œDashboard Settingsโ€ (gear icon at the top) to:

    • Set auto-refresh (e.g., every 10s, 30s, etc.)

    • Set default time range

    • Add dashboard-level variables

Monitoring Linux Host Metrics with Prometheus

1. Downloading and Unpacking the Node Exporter

The Node Exporter is an official Prometheus exporter for exposing hardware and OS metrics from *nix systems.

โœ… Steps:

๐Ÿ”— Download:

Go to: https://prometheus.io/download/#node_exporter

Or directly use:

bashCopyEditwget https://github.com/prometheus/node_exporter/releases/download/v1.8.0/node_exporter-1.8.0.linux-amd64.tar.gz

๐Ÿ“ฆ Unpack:

bashCopyEdittar -xvf node_exporter-1.8.0.linux-amd64.tar.gz
cd node_exporter-1.8.0.linux-amd64

2. Node Exporter Command-Line Flags

The Node Exporter has many flags to control which metrics it exposes.

๐Ÿ”ง Common Flags:

bashCopyEdit./node_exporter \
  --web.listen-address=":9100" \
  --web.telemetry-path="/metrics" \
  --collector.cpu \
  --collector.meminfo \
  --collector.diskstats

๐Ÿ“š Flag Details:

FlagDescription
--web.listen-addressAddress/port to serve metrics (default :9100)
--web.telemetry-pathPath where metrics are exposed (default /metrics)
--collector.<name>Enable or disable individual collectors

You can list all collectors with:

bashCopyEdit./node_exporter --help

3. Running the Node Exporter

๐ŸŸข Start Node Exporter (basic way):

bashCopyEdit./node_exporter

It will start serving metrics on:
๐Ÿ‘‰ http://localhost:9100/metrics

๐Ÿš€ Run in Background (production):

bashCopyEditnohup ./node_exporter > node_exporter.log 2>&1 &

Or, create a systemd service (recommended for servers):

bashCopyEditsudo nano /etc/systemd/system/node_exporter.service

Paste:

iniCopyEdit[Unit]
Description=Node Exporter
After=network.target

[Service]
User=nobody
ExecStart=/path/to/node_exporter

[Install]
WantedBy=default.target

Enable & start:

bashCopyEditsudo systemctl daemon-reexec
sudo systemctl start node_exporter
sudo systemctl enable node_exporter

4. Inspecting the Node Exporter's /metrics Endpoint

Open in browser or curl:

bashCopyEditcurl http://localhost:9100/metrics

๐Ÿ“‹ Youโ€™ll see raw Prometheus metrics like:

bashCopyEdit# HELP node_cpu_seconds_total Seconds the CPUs spent in each mode.
node_cpu_seconds_total{cpu="0",mode="user"} 3452.92
node_memory_MemAvailable_bytes 123456789
node_filesystem_size_bytes{...} 1099511627776

These are the real-time system stats exposed to Prometheus.

5. Scraping the Node Exporter with Prometheus

๐Ÿ”ง Modify prometheus.yml config:

Add the Node Exporter as a static target:

yamlCopyEditscrape_configs:
  - job_name: 'node_exporter'
    static_configs:
      - targets: ['localhost:9100']

If Node Exporter runs on another host, replace localhost with that IP or hostname.

๐Ÿ” Restart Prometheus:

bashCopyEdit./prometheus --config.file=prometheus.yml

Or if using systemd:

bashCopyEditsudo systemctl restart prometheus

6. Verifying Successful Target Scrapes

โœ… Go to Prometheus UI:

Visit: http://localhost:9090/targets

You should see:

yamlCopyEditjob: node_exporter
target: localhost:9100
last scrape: <time>
status: UP

This confirms Prometheus is successfully scraping metrics.

7. Querying Node Exporter Metrics (CPU and Network Usage)

๐Ÿง  Example PromQL Queries:

๐Ÿงฎ CPU Usage (total per mode):

promqlCopyEditrate(node_cpu_seconds_total{mode!="idle"}[5m])

๐Ÿง  Memory Available:

promqlCopyEditnode_memory_MemAvailable_bytes

๐Ÿ“ก Network Received:

promqlCopyEditrate(node_network_receive_bytes_total[1m])

๐Ÿ“ค Network Transmitted:

promqlCopyEditrate(node_network_transmit_bytes_total[1m])

๐Ÿ’ฝ Disk Space Used:

promqlCopyEdit(node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes

8. Showing Host Metrics in Grafana

๐Ÿ“บ Visualizing in Grafana:

๐Ÿ“Œ Prerequisites:

  • Prometheus added as a data source in Grafana.

Steps to Create a System Monitoring Dashboard:

  1. Create a new dashboard โ†’ Add Panel

  2. Use these queries:

๐Ÿง  CPU Load (Time series):

promqlCopyEditrate(node_cpu_seconds_total{mode="user"}[5m])

๐Ÿ“ก Network Usage (Table or Graph):

promqlCopyEditrate(node_network_receive_bytes_total[1m])
rate(node_network_transmit_bytes_total[1m])

๐Ÿ’ฝ Disk Usage (Gauge):

promqlCopyEdit100 * (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes

๐Ÿง  Memory Usage (Gauge):

promqlCopyEdit100 * (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes

Optional: Import Official Grafana Dashboard

  1. Go to Grafana โ†’ Dashboards โ†’ Import

  2. Use Dashboard ID: 1860 (Node Exporter Full)

  3. Choose your Prometheus data source โ†’ Import

This provides a rich pre-built monitoring dashboard.

Donโ€™t Make These 6 Prometheus Monitoring Mistakes

Mistake 1: Cardinality Bombs

๐Ÿ”ฅ Problem:

Creating a high number of unique time series by using too many or high-variance labels (e.g., user IDs, IP addresses, request paths) causes cardinality explosions, which:

  • Consume excessive memory and CPU

  • Slow down queries and alert evaluations

  • Can crash Prometheus

๐Ÿงจ Example:

promqlCopyEdithttp_requests_total{user_id="1234", session_id="a9b8c7"}

If every user and session has unique IDs, this results in millions of time series.

โœ… Best Practices:

  • Avoid using high-cardinality labels like user_id, session_id, request_path, etc.

  • Use static or bounded labels like status, method, instance.

  • Use aggregation or label_replace() to group data instead of exploding it.

Mistake 2: Aggregating Away Too Many Labels

โš ๏ธ Problem:

When using sum() or avg() without carefully specifying by() labels, you lose context and might aggregate metrics incorrectly.

๐Ÿ˜ตโ€๐Ÿ’ซ Example:

promqlCopyEditsum(rate(http_requests_total[5m]))

This sums all requests from all endpoints, all instances, all statuses โ€” losing all distinguishing information.

๐Ÿง  Solution:

promqlCopyEditsum by (job, instance, status) (rate(http_requests_total[5m]))
  • Keep important dimensional context

  • Aggregate only intentionally based on your alerting or visualization needs

Mistake 3: Unscoped Metric Selectors

๐Ÿ’ฃ Problem:

Writing PromQL like this:

promqlCopyEditup

โ€ฆwithout any scoping labels means querying every single up metric from all jobs, across all targets โ€” including exporters and services you might not care about.

๐Ÿ” Consequences:

  • Wastes query time and resources

  • Can return noisy or misleading results

  • Hard to debug or tune alerts

โœ… Solution:

Scope it!

promqlCopyEditup{job="my_service"}

Or:

promqlCopyEditup{job=~"api|frontend"}

Use scoped selectors to reduce noise and make queries faster and more accurate.

Mistake 4: Missing for Durations in Alerting Rules

๐Ÿšจ Problem:

Creating alerts without a for: clause in the rule leads to instantaneous alerts that fire as soon as a condition is true โ€” even briefly โ€” leading to flapping or false positives.

๐Ÿงจ Example:

yamlCopyEdit- alert: HighCPU
  expr: rate(node_cpu_seconds_total{mode="user"}[1m]) > 0.9

This could fire if CPU spikes just for a second.

โœ… Solution:

Add for: to wait before alerting:

yamlCopyEdit- alert: HighCPU
  expr: rate(node_cpu_seconds_total{mode="user"}[1m]) > 0.9
  for: 2m

This ensures the alert only triggers if the condition holds continuously for 2 minutes.

Mistake 5: Too Short Rate Windows

๐Ÿ“‰ Problem:

Using short windows for rate() or increase() (like [30s]) leads to noisy or erratic results, especially for low-frequency metrics.

๐Ÿง  Why?

  • rate() needs multiple samples to give meaningful results

  • Short windows donโ€™t smooth over variations or delays

โŒ Bad:

promqlCopyEditrate(http_requests_total[30s])

โœ… Good:

promqlCopyEditrate(http_requests_total[5m])
  • Longer windows provide more stable, statistically accurate results

  • For alerts, use windows like [2m] to [5m]

  • For dashboards, use dynamic durations like $__rate_interval in Grafana

Mistake 6: Using Functions With Incorrect Metric Types

๐Ÿ˜ฑ Problem:

Applying PromQL functions meant for one metric type (e.g., counters) to another type (e.g., gauges) leads to invalid or misleading results.

โŒ Example:

promqlCopyEditrate(node_memory_MemAvailable_bytes[5m])

This is incorrect. node_memory_MemAvailable_bytes is a gauge, not a counter, so rate() doesn't make sense.

โœ… Solution:

  • Use rate() or increase() only with monotonically increasing counters

  • Use raw gauge values for metrics like memory, disk, temperature

Function Compatibility:

FunctionWorks WithDescription
rate()CountersRate of increase over time window
increase()CountersTotal increase over time window
avg_over_timeAllAverage value over time
max_over_timeAllMaximum value over time

Summary Table

MistakeRoot CauseConsequencesFix
1. Cardinality BombsHigh-cardinality labelsMemory bloat, instabilityRemove unbounded labels
2. Over-AggregationAggregating all labelsLoss of detail, inaccurate alertsUse by(...) carefully
3. Unscoped SelectorsNo filtering in queriesNoisy, inefficient resultsUse proper label filters
4. Missing forNo delay in alertsFalse positivesAdd for: to alert rules
5. Short Rate WindowsTiny time rangesNoisy or empty dataUse [2m] to [5m]
6. Wrong Function UseUsing rate() on gaugesMisleading resultsMatch function to metric type

Exposing Custom Host Metrics Using the Prometheus Node Exporter

1. ๐Ÿ” "textfile" Collector Module Basics

โœ… What is the textfile collector?

  • A built-in module of the Node Exporter

  • Reads files containing Prometheus metric data in text exposition format

  • These files must be placed in a designated directory

  • Useful for ad-hoc, one-off, or custom metrics from scripts or non-Go code

๐Ÿ“‚ How It Works:

  • You create files like my_custom_metric.prom

  • Put them in the directory specified with:

      bashCopyEdit--collector.textfile.directory=/var/lib/node_exporter/textfile_collector/
    
  • Node Exporter reads those files at scrape time and exposes them as metrics under the /metrics endpoint

๐Ÿ“Œ Key Notes:

  • Each file should use the .prom extension and be valid Prometheus text format

  • Files are removed or rotated automatically by you (Node Exporter doesnโ€™t do cleanup)

  • Avoid frequently rewriting large files (keep them small)

2. ๐Ÿ•’ Exposing a Custom Cron Job Metric

Suppose you want to measure the success/failure of a backup script run by cron.

๐Ÿ‘จโ€๐Ÿ’ป Bash Script Example:

bashCopyEdit#!/bin/bash

BACKUP_STATUS=1  # assume failure

if /usr/local/bin/backup.sh; then
  BACKUP_STATUS=0
fi

echo "# HELP backup_success Whether the backup succeeded (1) or failed (0)" > /var/lib/node_exporter/textfile_collector/backup.prom
echo "# TYPE backup_success gauge" >> /var/lib/node_exporter/textfile_collector/backup.prom
echo "backup_success $BACKUP_STATUS" >> /var/lib/node_exporter/textfile_collector/backup.prom

โฑ Cron Job Entry:

cronCopyEdit0 2 * * * /usr/local/bin/backup_metric.sh

This creates or updates /var/lib/node_exporter/textfile_collector/backup.prom every night at 2 AM. Node Exporter will serve that file as part of its /metrics.

๐Ÿงช You can query this in Prometheus:

promqlCopyEditbackup_success

3. ๐Ÿง‘โ€๐Ÿ’ป Generating Metric Text Files From Go

You can also generate .prom files from Go programs that gather and export custom metrics.

โœ… Step-by-step Example:

Import Required Package

goCopyEditimport (
    "fmt"
    "os"
)

Write Metrics to File

goCopyEditfunc writeCustomMetric(filename string, metricName string, value float64) {
    file, err := os.Create(filename)
    if err != nil {
        panic(err)
    }
    defer file.Close()

    fmt.Fprintf(file, "# HELP %s Custom metric\n", metricName)
    fmt.Fprintf(file, "# TYPE %s gauge\n", metricName)
    fmt.Fprintf(file, "%s %f\n", metricName, value)
}

Usage

goCopyEditfunc main() {
    writeCustomMetric("/var/lib/node_exporter/textfile_collector/my_metric.prom", "my_custom_gauge", 42.0)
}

Run this Go program periodically (via cron or systemd timer) to update the metric.

4. ๐Ÿ“ "textfile" Collector Example Scripts Repository

There is an official community-maintained repo with example scripts:
๐Ÿ”— https://github.com/prometheus/node-exporter-textfile-collector-scripts

โœ… What's in the repo?

  • Prebuilt scripts to collect metrics like:

    • SMART disk health

    • RAID status

    • Sensors temperature

    • Filesystem usage

    • Battery levels

  • Scripts written in bash, Python, or other languages

  • Designed to drop .prom files in the textfile_collector directory

๐Ÿ“‚ Directory Structure:

swiftCopyEdit/var/lib/node_exporter/textfile_collector/
โ”œโ”€โ”€ smartctl.prom
โ”œโ”€โ”€ sensors.prom
โ”œโ”€โ”€ custom_ping_check.prom

Each .prom file contains one or more metrics with the appropriate format.

Example content of smartctl.prom:

bashCopyEdit# HELP smart_disk_ok Whether the disk passed SMART test
# TYPE smart_disk_ok gauge
smart_disk_ok{device="/dev/sda"} 1
smart_disk_ok{device="/dev/sdb"} 0

This allows you to alert on disk failure using PromQL.

Best Practices

PracticeRecommendation
File formatUse only .prom extension and proper Prometheus text format
File ownershipEnsure Node Exporter has read access to the files
Script errorsAvoid creating invalid or partial .prom files (use temp file then rename)
PerformanceDonโ€™t create too many metrics or files. Keep it lean.
RotationManually rotate or overwrite files regularly to avoid stale metrics

Sample PromQL Queries

promqlCopyEditbackup_success == 0

Alert if your backup fails.

promqlCopyEditsmart_disk_ok == 0

Detect failing disks.

promqlCopyEditavg(node_custom_ping_latency_ms) by (target)

Get average ping latency from a script.

Relabeling in Prometheus

1. ๐ŸŽฏ Motivation for Relabeling

Prometheus scrapes targets and attaches labels to their metrics. Sometimes:

  • You want to modify these labels.

  • You want to drop or keep certain targets.

  • You want to rewrite target addresses.

  • You want to extract or clean up metadata from service discovery.

Relabeling provides a powerful and flexible way to transform labels or control scrape behavior.

2. โš™๏ธ Relabeling in the Prometheus Configuration File

Relabeling is configured in your prometheus.yml file under different contexts:

SectionPurpose
relabel_configsTarget relabeling โ€“ modifies targets before scraping
metric_relabel_configsMetric relabeling โ€“ modifies individual metrics after scraping
relabel_configs under remote_writeModifies labels before sending metrics to a remote storage

๐Ÿ”ง Example Layout:

yamlCopyEditscrape_configs:
  - job_name: 'example'
    static_configs:
      - targets: ['localhost:9100']
    relabel_configs:
      - source_labels: [__address__]
        regex: 'localhost:9100'
        target_label: instance
        replacement: 'my-node'

3. ๐Ÿงญ Relabeling Steps and Flow

Target Relabeling Flow:

  1. Service discovery (SD) returns a list of target groups.

  2. Each target gets label metadata (like __address__, __meta_kubernetes_pod_name, etc.).

  3. These labels go through relabeling steps (relabel_configs).

  4. The resulting targets are scraped if theyโ€™re not dropped.

Metric Relabeling Flow:

  1. After scraping, each metric passes through metric_relabel_configs.

  2. Metrics can be dropped, relabeled, or kept based on the rules.

4. ๐Ÿงฑ Relabeling Rule Structure and Fields

Each relabeling rule is a YAML dictionary with:

FieldDescription
source_labelsList of labels used as input
separatorString used to join multiple source label values (default: ;)
regexA regular expression to match against the joined string
target_labelThe label to write the result to
replacementString to use as replacement value
actionWhat to do: replace, keep, drop, hashmod, labelmap, etc.

๐Ÿ›  Example:

yamlCopyEdit- source_labels: [__meta_kubernetes_pod_name]
  regex: '(.*)'
  target_label: pod
  replacement: '$1'
  action: replace

5. ๐Ÿท๏ธ Target Metadata Labels

When using service discovery (e.g., Kubernetes), targets come with metadata labels, prefixed with __meta_.

Examples:

  • __meta_kubernetes_namespace

  • __meta_kubernetes_pod_name

  • __meta_kubernetes_pod_label_app

These are temporary labels used during relabeling and are discarded afterward unless explicitly copied.

6. ๐Ÿงช The Relabeling Visualizer Tool

๐Ÿ”— Prometheus Relabel Debugger

This web-based tool lets you:

  • Paste raw label sets

  • Test relabeling rules interactively

  • See how each step transforms your labels

  • Extremely useful for Kubernetes SD debugging

7. ๐Ÿงท Example 1: Setting a Fixed Label Value

Add a new label env="prod" to all targets:

yamlCopyEdit- target_label: env
  replacement: prod
  action: replace

8. ๐Ÿ” Example 2: Overriding the Scrape Port

Force scraping on port 9100 regardless of what SD gives:

yamlCopyEdit- source_labels: [__address__]
  regex: '(.*):.*'
  target_label: __address__
  replacement: '${1}:9100'
  action: replace

9. ๐Ÿ”„ Example 3: Mapping Over Label Patterns

Map all labels with prefix __meta_kubernetes_pod_label_ into real labels:

yamlCopyEdit- action: labelmap
  regex: __meta_kubernetes_pod_label_(.+)

This will turn:

iniCopyEdit__meta_kubernetes_pod_label_app="nginx"

Into:

iniCopyEditapp="nginx"

10. โŒ Example 4: Dropping Scraped Samples

Use metric_relabel_configs to drop unwanted metrics:

yamlCopyEditmetric_relabel_configs:
  - source_labels: [__name__]
    regex: 'node_cpu_seconds_total'
    action: drop

Or drop entire targets from scrape:

yamlCopyEditrelabel_configs:
  - source_labels: [__meta_kubernetes_namespace]
    regex: 'test-namespace'
    action: drop

11. ๐Ÿงฉ Debugging Relabeling Rules

๐Ÿ” How to Debug:

  1. Use /targets in Prometheus web UI

    • Shows original labels and post-relabel labels

    • Shows if a target was dropped

  2. Use /api/v1/targets to fetch live target info

  3. Use the PromLabs relabel debugger to simulate complex flows

  4. Log level debug in Prometheus to see full relabeling logs

Summary

FeatureDescription
relabel_configsChange scrape targets and metadata
metric_relabel_configsFilter or relabel individual metrics
labelmapRename multiple labels based on regex
drop, keepSelectively drop/keep targets or metrics
Visual DebuggingUse relabeler.promlabs.com

Grafana Heatmaps for Prometheus Histograms

1. Adding and Configuring a Heatmap Panel for Prometheus Histograms

๐Ÿ”ธ What Is a Heatmap Panel?

A heatmap is a two-dimensional chart where:

  • The X-axis usually represents time.

  • The Y-axis represents value buckets (e.g., request durations, response sizes).

  • Color intensity represents the frequency or count of occurrences.

In Prometheus, heatmaps are built from histogram metrics, specifically the _bucket time series from histogram instruments.

Prerequisites

  • You have Prometheus set up and scraping histogram metrics.

  • Example Prometheus metric: http_request_duration_seconds_bucket

  • Grafana is connected to Prometheus as a data source.

Step-by-Step: Adding a Heatmap Panel

๐Ÿ›  Step 1: Open Grafana and Create/Edit a Dashboard

  • Go to Grafana (typically http://localhost:3000).

  • Click โ€œ+โ€ โ†’ Dashboard.

  • Click โ€œAdd New Panelโ€.

  • From the panel type selector, choose "Heatmap".

Step 2: Write the PromQL Query

Use the histogram _bucket metric with a rate() or increase() function:

promqlCopyEditrate(http_request_duration_seconds_bucket[5m])

Replace the metric with your own histogram bucket metric.

  • rate() shows frequency per second.

  • increase() is used for absolute count over a time window.

Step 3: Group by Bucket (le) and Label

You must group by the le label (less-than-or-equal) to segment by bucket:

promqlCopyEditsum by (le) (
  rate(http_request_duration_seconds_bucket[5m])
)

If you have other labels (e.g., job, instance), include them as needed.

Step 4: Panel Settings

A. Data Format

  • Format as: Time series buckets (NOT regular time series).

B. Visualization Settings

  • Set the Y-axis to โ€œlogarithmicโ€ if your buckets vary widely.

  • Set the Y-axis unit**: seconds (s), milliseconds (ms), or your metric unit.

  • Choose Color scheme: usually gradient or spectrum.

  • Adjust Bucket sort: ascending (for duration buckets).

C. Binning Options

In Display > Binning:

  • X-Axis (time): automatically binned

  • Y-Axis (bucket boundaries):

    • Choose โ€œSeriesโ€ mode for Prometheus

    • Binning mode: โ€œAutoโ€ or specify your own bucket steps (optional)

Step 5: Save and Observe

  • Click Apply to save the panel.

  • Observe how your metric is distributed across buckets over time.

2. Using and Interpreting the Heatmap Panel

๐Ÿ”Ž Understanding What You See

The heatmap shows how frequently values fall into different buckets over time.

Each horizontal slice (row) = one bucket (e.g., request โ‰ค 0.3s, โ‰ค 0.5s, etc.)

Each vertical column = a time slice (e.g., every minute)

Each cell color = frequency (how many requests fell in that range)

Common Use Cases

Use CaseHow Heatmap Helps
Request Latency AnalysisView if most requests fall into <0.5s or spike into higher buckets
Memory UsageSee how memory allocations vary and group over thresholds
Response SizeAnalyze spikes in payload size over time
Application LoadView load distribution across histogram buckets

Typical Interpretation Patterns

  • Darker cells: More frequent values in that bucket/time.

  • Sudden color changes: Traffic spike or regression.

  • Wider spread of color across buckets: Latency variability or inconsistent performance.

Example Histogram Metric

If you're using the default Prometheus Go client:

promqlCopyEditsum by (le) (
  rate(http_request_duration_seconds_bucket{job="my-api"}[5m])
)

This query feeds into a heatmap that shows request duration patterns.

Compare with Other Metrics

Combine heatmap with:

  • _sum: Total request time (for avg calculation).

  • _count: Total request count.

  • Use these with PromQL like:

promqlCopyEditrate(http_request_duration_seconds_sum[5m])
/
rate(http_request_duration_seconds_count[5m])

To get average duration, and cross-check the heatmapโ€™s validity.

Generated image

0
Subscribe to my newsletter

Read articles from Arijit Das directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Arijit Das
Arijit Das