Day 41 of 90 Days of DevOps: Mastering PromQL and Alert Rules in Prometheus

Vaishnavi DVaishnavi D
4 min read

Yesterday, I explored the world of Prometheus exporters. lightweight agents that expose metrics from various services and systems that Prometheus can scrape. I worked with node_exporter for system-level metrics and kube-state-metrics for Kubernetes-level insights. This extended Prometheus’ visibility into infrastructure and orchestration layers, making it capable of monitoring everything from node CPU to pod restarts.

Today, I took a deep dive into one of the most powerful tools in Prometheus: PromQL (Prometheus Query Language), the foundation for querying, visualizing, and alerting. This blog post will walk through why PromQL matters, how to write effective queries, and how to convert those queries into real-time alerts.

Why PromQL Matters in DevOps

Prometheus stores every metric as a time series, a stream of timestamped values. While this is extremely powerful, it's also raw and noisy. That’s where PromQL comes in. It's a domain-specific language that lets you:

  • Filter and aggregate time-series data

  • Compare metrics over time

  • Create performance dashboards in Grafana

  • Trigger alerts on specific thresholds

  • Make intelligent decisions about scaling, troubleshooting, or alerting

In short, PromQL is the brain of Prometheus. Without it, metrics remain meaningless.

Common PromQL Queries for Kubernetes Monitoring

Let’s break down some highly useful PromQL queries you’ll need as a DevOps engineer:

CPU Usage (User Mode)

rate(node_cpu_seconds_total{mode="user"}[5m])

This measures the rate of CPU seconds consumed in user mode over the last 5 minutes. A sudden spike here can indicate CPU-bound processes.

Memory Usage Ratio

node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes

This returns the available memory percentage. If this drops consistently, it could indicate memory leaks or memory-intensive applications.

Pod Restart Rate

rate(kube_pod_container_status_restarts_total[5m])

This tells you how often your containers are restarting. A high restart rate is a strong signal of unstable or crashing pods.

HTTP Request Error Rate

rate(http_requests_total{status=~"5.."}[5m])

This gives the rate of 5xx server errors in your HTTP requests, useful for application-level monitoring.

Disk Space Utilization

node_filesystem_size_bytes - node_filesystem_free_bytes

This calculates the used disk space and helps prevent issues from full disks, which can lead to application crashes.

Creating Alert Rules in Prometheus

Monitoring alone is not enough; alerting is what keeps your teams informed when things go wrong.

Prometheus uses a rule-based system to define alerts that are evaluated periodically. When an expression evaluates to true for a certain period, an alert is fired.

Sample Alert Rule for High CPU Usage

groups:
- name: example-alert
  rules:
  - alert: HighCPUUsage
    expr: rate(container_cpu_usage_seconds_total[5m]) > 0.9
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage detected"
      description: "Container CPU usage is over 90% for the last 2 minutes."

Explanation:

  • expr: PromQL expression that evaluates metric data.

  • for: Alert fires only if the condition holds true for this duration (2 minutes).

  • labels: Metadata such as severity (used by Alertmanager for routing).

  • annotations: Human-readable messages sent in notifications like Slack or email.

These alert rules are typically stored in a file named alerts.yaml or defined as PrometheusRule custom resources if you're using the Prometheus Operator in Kubernetes.

How Alerting Fits into the Bigger Picture

A well-functioning alert system should:

  • Catch issues before customers do

  • Avoid false positives

  • Prioritize severity (e.g., warning vs. critical)

  • Route alerts to the right teams

To achieve this, Prometheus alert rules are integrated with Alertmanager, which handles deduplication, grouping, silencing, and notification via Slack, email, PagerDuty, etc. (We’ll cover this in detail on Day 43.)

Final Thoughts

Today, I unlocked one of the most essential skills in cloud-native monitoring: PromQL and alerting. Learning PromQL has given me the ability to extract precisely what I need from a sea of metrics, whether I’m debugging a node issue, tracking pod restarts, or setting up production-grade alerts.

Combining metrics and rules makes Prometheus an intelligent system, not just a passive observer. And that's the kind of observability every DevOps team needs.

On Day 42, I’ll take these queries and transform them into beautiful Grafana dashboards, a step that brings data to life visually and makes decision-making effortless.

0
Subscribe to my newsletter

Read articles from Vaishnavi D directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Vaishnavi D
Vaishnavi D