Getting Started with Prometheus: A Guide

Sahil NaikSahil Naik
8 min read

I hope you read my previous blog on Prometheus, which covers the basics. Let me know what you found most helpful in it. In this blog, we will cover the fundamentals of Prometheus, PromQL, dashboarding and visualization, service discovery, Push Gateway, and monitoring Kubernetes. So, let's get started without wasting any time!


Prometheus Architecture:

Who Does What?

Prometheus works like a data collection and alerting system that continuously pulls metrics from various sources and stores them for monitoring and querying.

πŸ”Ή Key Components:
1️⃣ Prometheus Server – The core brain 🧠
2️⃣ Targets/Exporters – The data providers πŸ“‘
3️⃣ Time-Series Database (TSDB) – The storage πŸ—„οΈ
4️⃣ PromQL (Prometheus Query Language) – The data analyzer πŸ“Š
5️⃣ Alertmanager – The notifier 🚨
6️⃣ Grafana – The dashboard viewer πŸ“Ί


🟒 1. Prometheus Server (Core Component)

  • Function: Pulls data from targets, stores it, and processes queries.

  • Contains:

    • Scraper: Fetches data from predefined targets (like your apps, servers, or containers).

    • Storage (TSDB): Stores data as time-series (timestamped records).

    • Query Engine (PromQL): Allows data analysis using queries.

πŸ‘‰ Example:

  • Every 15 seconds, Prometheus asks an app: "Hey, how’s your CPU usage?"

  • The app replies: "Right now, it's 30%."

  • Prometheus saves this info as:

      cpu_usage{instance="server1"} 30  # Metric value 30 at timestamp
    

πŸ“‘ 2. Targets & Exporters (Where Data Comes From)

  • Function: Expose metrics in a format Prometheus understands.

  • Types:
    1️⃣ Direct Targets β†’ Apps that expose /metrics (e.g., a Go app with Prometheus client library).
    2️⃣ Exporters β†’ Convert third-party data into Prometheus format.

    • Node Exporter β†’ Monitors Linux system metrics.

    • cAdvisor β†’ Monitors Docker container metrics.

    • Kube-State-Metrics β†’ Monitors Kubernetes workloads.

πŸ‘‰ Example:

  • A web app runs on http://app:8000/metrics, and Prometheus scrapes it every 10s.

  • If the app doesn’t expose /metrics, we use an Exporter to bridge the gap.


πŸ—„οΈ 3. Time-Series Database (TSDB)

  • Function: Stores collected metrics efficiently.

  • Structure:

    • Timestamp (When the data was recorded).

    • Metric Name (What is being measured).

    • Labels (Extra details like instance="server1").

    • Value (The actual measurement).

πŸ‘‰ Example (TSDB Entry):

http_requests_total{method="GET", status="200"} 1250  # 1250 GET requests recorded

πŸ“Š 4. PromQL (Query Engine)

  • Function: Allows analysis of collected metrics.

  • Queries:

    • up β†’ Shows which targets are working.

    • rate(http_requests_total[5m]) β†’ Requests per second in the last 5 mins.

πŸ‘‰ Example Query & Output:
Query:

sum(rate(cpu_usage_seconds_total[1m])) by (instance)

Output:

instance="server1" β†’ 0.5  (50% CPU usage in the last 1 min)

πŸ“‘ 5. Service Discovery Mechanism (Finder of Targets)

  • Function: Keeps track of which instances (pods, containers, or VMs) are running.

  • Where does it look?

    • Kubernetes API ☸️

    • AWS EC2 API ☁️

    • Docker Swarm API 🐳

    • Consul, Etcd, Zookeeper πŸ”—

    • Custom HTTP SD API 🌐

πŸ‘‰ Example:

  • If a new Pod starts in Kubernetes, Prometheus automatically detects and starts scraping it.

  • If a server shuts down on AWS, it stops scraping it.

🚨 6. Alertmanager (Sends Notifications)

  • Function: Sends alerts when something goes wrong.

  • Example Alert Rule (CPU Usage High):

      groups:
        - name: high_cpu_alerts
          rules:
            - alert: HighCPU
              expr: sum(rate(cpu_usage_seconds_total[5m])) > 0.9
              for: 2m
              labels:
                severity: critical
              annotations:
                description: "High CPU usage detected!"
    
  • Alerts can be sent to: Slack, Email, PagerDuty, etc.


πŸ“Ί 7. Grafana (Visualization)

  • Function: Displays Prometheus data on dashboards.

  • Example Dashboard:

    • CPU, Memory, Network usage graphs.

    • Alerts when server load is too high.


πŸ“Œ How Everything Works Together

  1. Targets/Exporters expose data β†’

  2. Prometheus Server scrapes data every X seconds β†’

  3. TSDB stores the metrics β†’

  4. PromQL is used to query & analyze data β†’

  5. Alertmanager triggers alerts if needed β†’

  6. Grafana visualizes the data πŸ“Š

  • Prometheus pulls data from targets (it doesn’t wait for data to be pushed).

  • It stores metrics as time-series and allows powerful queries.

  • It can trigger alerts when something is wrong.

  • Grafana can be used to display dashboards beautifully.


Prometheus Fundamentals:

  1. Node Exporter πŸ–₯️:

    Node Exporter is a Prometheus exporter that collects system-level metrics (CPU, memory, disk, network, etc.) from a machine (server, VM, or local system) and exposes them to Prometheus for monitoring.

    By default, Prometheus cannot directly monitor system metrics like CPU usage or memory consumption. Node Exporter solves this by exposing those metrics in a Prometheus-compatible format.

    πŸ”Ή System Monitoring β†’ Tracks CPU, memory, disk, network, and more
    πŸ”Ή Lightweight & Efficient β†’ Runs as a small background process
    πŸ”Ή Prometheus-Compatible β†’ Exposes metrics via HTTP (localhost:9100/metrics)

  2. Authentication and Encryption:

    By default, when Prometheus is set up to scrape data from a node, it does not enforce authentication. This means that anyone with access to the target can retrieve the exposed metrics, which could lead to unauthorized data access. To prevent this, it’s crucial to implement authentication and encryption.


Metrics in Prometheus:

  1. Metric Name
    A descriptive name for the thing being measured (e.g., http_requests_total, cpu_usage_seconds_total).

  2. Labels (Optional)
    Key-value pairs used to differentiate different dimensions of the same metric (e.g., method="GET", status="200").

  3. Timestamp
    When the data point was recorded (often automatically managed by Prometheus).

  4. Value
    The actual numeric value of the metric at that time.

Example:

http_requests_total{method="GET", handler="/api", status="200"} 1287
  • Metric name: http_requests_total

  • Labels: method="GET", handler="/api", status="200"

  • Value: 1287 (number of successful GET requests to /api)

  • Timestamp: (implicitly recorded by Prometheus when scraped)


What is PromQL?

PromQL stands for Prometheus Query Language β€” it's the powerful and flexible language you use to query, filter, and analyze metrics stored in Prometheus.


πŸ”Ή What You Can Do with PromQL:

  • Select metrics (e.g., http_requests_total)

  • Filter by labels (e.g., method="GET")

  • Perform calculations (e.g., rate of change, averages, percentages)

  • Aggregate data (e.g., by instance, job, or other labels)

  • Generate graphs and alerts


πŸ”Έ Basic Syntax Examples

πŸ” 1. Select a Metric

promqlCopyEdithttp_requests_total

Shows all time series with that metric name.


🎯 2. Filter by Label

promqlCopyEdithttp_requests_total{method="GET", status="200"}

Filters only the GET requests with status 200.


⏱️ 3. Calculate Rate of Increase

rate(http_requests_total[1m])

Shows the per-second rate of requests over the last 1 minute.


πŸ“Š 4. Aggregate by Label

sum(rate(http_requests_total[5m])) by (job)

Shows total request rate per job (like frontend, backend, etc.)


🧠 5. Calculate CPU Usage

100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Shows CPU usage (as a %) per instance by subtracting idle time.


PromQL:

πŸ” 1. Selectors

Selectors are used to specify which time series (metrics) you want to query. There are two main types:

πŸ‘‰ Instant Vector Selector

Selects the latest sample for each matching time series.

http_requests_total{method="GET", status="200"}
  • Metric: http_requests_total

  • Labels: method="GET", status="200"

This returns the current value of all time series that match the labels.


πŸ‘‰ Range Vector Selector

Selects time series data over a time range.

rate(http_requests_total[5m])
  • http_requests_total[5m] = all data points from the past 5 minutes.

  • Used inside functions like rate() or avg_over_time().


🏷️ 2. Matchers

Matchers are used inside label selectors (the {} block) to filter which time series you want.

Common Matchers:

MatcherMeaningExample
=Equalsjob="api-server"
!=Not equalmethod!="POST"
=~Regex matchinstance=~"server.*"
!~Regex does not matchstatus!~"4.." (not 4xx errors)

Example:

http_requests_total{job="frontend", status=~"5.."}

Selects all 5xx status codes from the frontend job.


πŸ› οΈ 3. Modifiers

Modifiers change how a function behaves or how results are grouped.

a. by / without (Aggregation Modifiers)

Used with aggregations like sum, avg, etc.

  • by(...): keep these labels

  • without(...): drop these labels

sum(rate(http_requests_total[5m])) by (job)

Sum the rate, but group by job.

sum(rate(http_requests_total[5m])) without (instance)

Sum the rate and ignore instance label.


b. on / ignoring (Binary Operator Modifiers)

Used when combining two metrics.

http_requests_total / on(instance) up

Joins metrics only where instance matches.

http_requests_total / ignoring(job) up

Joins metrics but ignores job label during match.


c. offset

offset shifts the evaluation time backward by a given duration. β€œShow me the value of this metric X time ago.”

<metric_name>[<range>] offset <duration>

OR (for instant vectors):

<metric_name> offset <duration>

⏱ Prometheus Time Units Table

UnitSuffixMeaningExample Usage
Secondss1 secondoffset 30s
Minutesm60 secondsoffset 5m
Hoursh60 minutesoffset 1h
Daysd24 hoursoffset 2d
Weeksw7 daysoffset 1w
Yearsy365 days (not leap)offset 1y (rare)

βœ… Examples in Context

rate(http_requests_total[5m] offset 1h)

β†’ Rate of requests 1 hour ago over a 5-minute window

up offset 2d

β†’ Status of targets exactly 2 days ago

avg_over_time(cpu_usage[10m] offset 7d)

β†’ CPU usage last week at this time, averaged over 10 minutes


10
Subscribe to my newsletter

Read articles from Sahil Naik directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Sahil Naik
Sahil Naik

πŸ’» Sahil learns, codes, and automates, documenting his journey every step of the way. πŸš€