Day 41 of 90 Days of DevOps: Mastering PromQL and Alert Rules in Prometheus


Yesterday, I explored the world of Prometheus exporters. lightweight agents that expose metrics from various services and systems that Prometheus can scrape. I worked with node_exporter
for system-level metrics and kube-state-metrics
for Kubernetes-level insights. This extended Prometheus’ visibility into infrastructure and orchestration layers, making it capable of monitoring everything from node CPU to pod restarts.
Today, I took a deep dive into one of the most powerful tools in Prometheus: PromQL (Prometheus Query Language), the foundation for querying, visualizing, and alerting. This blog post will walk through why PromQL matters, how to write effective queries, and how to convert those queries into real-time alerts.
Why PromQL Matters in DevOps
Prometheus stores every metric as a time series, a stream of timestamped values. While this is extremely powerful, it's also raw and noisy. That’s where PromQL comes in. It's a domain-specific language that lets you:
Filter and aggregate time-series data
Compare metrics over time
Create performance dashboards in Grafana
Trigger alerts on specific thresholds
Make intelligent decisions about scaling, troubleshooting, or alerting
In short, PromQL is the brain of Prometheus. Without it, metrics remain meaningless.
Common PromQL Queries for Kubernetes Monitoring
Let’s break down some highly useful PromQL queries you’ll need as a DevOps engineer:
CPU Usage (User Mode)
rate(node_cpu_seconds_total{mode="user"}[5m])
This measures the rate of CPU seconds consumed in user mode over the last 5 minutes. A sudden spike here can indicate CPU-bound processes.
Memory Usage Ratio
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
This returns the available memory percentage. If this drops consistently, it could indicate memory leaks or memory-intensive applications.
Pod Restart Rate
rate(kube_pod_container_status_restarts_total[5m])
This tells you how often your containers are restarting. A high restart rate is a strong signal of unstable or crashing pods.
HTTP Request Error Rate
rate(http_requests_total{status=~"5.."}[5m])
This gives the rate of 5xx server errors in your HTTP requests, useful for application-level monitoring.
Disk Space Utilization
node_filesystem_size_bytes - node_filesystem_free_bytes
This calculates the used disk space and helps prevent issues from full disks, which can lead to application crashes.
Creating Alert Rules in Prometheus
Monitoring alone is not enough; alerting is what keeps your teams informed when things go wrong.
Prometheus uses a rule-based system to define alerts that are evaluated periodically. When an expression evaluates to true
for a certain period, an alert is fired.
Sample Alert Rule for High CPU Usage
groups:
- name: example-alert
rules:
- alert: HighCPUUsage
expr: rate(container_cpu_usage_seconds_total[5m]) > 0.9
for: 2m
labels:
severity: warning
annotations:
summary: "High CPU usage detected"
description: "Container CPU usage is over 90% for the last 2 minutes."
Explanation:
expr
: PromQL expression that evaluates metric data.for
: Alert fires only if the condition holds true for this duration (2 minutes).labels
: Metadata such asseverity
(used by Alertmanager for routing).annotations
: Human-readable messages sent in notifications like Slack or email.
These alert rules are typically stored in a file named alerts.yaml
or defined as PrometheusRule custom resources if you're using the Prometheus Operator in Kubernetes.
How Alerting Fits into the Bigger Picture
A well-functioning alert system should:
Catch issues before customers do
Avoid false positives
Prioritize severity (e.g., warning vs. critical)
Route alerts to the right teams
To achieve this, Prometheus alert rules are integrated with Alertmanager, which handles deduplication, grouping, silencing, and notification via Slack, email, PagerDuty, etc. (We’ll cover this in detail on Day 43.)
Final Thoughts
Today, I unlocked one of the most essential skills in cloud-native monitoring: PromQL and alerting. Learning PromQL has given me the ability to extract precisely what I need from a sea of metrics, whether I’m debugging a node issue, tracking pod restarts, or setting up production-grade alerts.
Combining metrics and rules makes Prometheus an intelligent system, not just a passive observer. And that's the kind of observability every DevOps team needs.
On Day 42, I’ll take these queries and transform them into beautiful Grafana dashboards, a step that brings data to life visually and makes decision-making effortless.
Subscribe to my newsletter
Read articles from Vaishnavi D directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
