Taking Airflow to Production: Adding Health Metrics on AKS

Adding health checks for an airflow deployment seems trivial since airflow helm charts provides the required nuts and bolts. I'll share my journey implementing health monitoring for Airflow on Kubernetes (AKS), from basic health checks to comprehensive dashboards and alerting. Here's what I wanted to accomplish:
Implement proper health checks so Kubernetes could detect and recover from failures
Set up alerts to notify our team when something goes wrong
Create visibility through dashboards that show the real-time status of our Airflow platform
Integrate with our existing AKS monitoring infrastructure
Our environment includes
Airflow 2.10.5
AKS with promethues monitoring enabled
Airflow managed via argo
First Step: Adding Health Checks to Airflow
Airflow helm is a great place to start off on the configs. I started off on adding the health check to Webserver & Scheduler. The defaults provided are good starts, however we had to adjust the few configs on the delays.
webserver:
livenessProbe:
initialDelaySeconds: 60 # Give it a minute to startup
timeoutSeconds: 5
failureThreshold: 5 # Don't restart too eagerly
periodSeconds: 10
scheme: HTTP
readinessProbe:
initialDelaySeconds: 15
timeoutSeconds: 5
failureThreshold: 5
periodSeconds: 10
scheme: HTTP
scheduler:
...
# If the scheduler stops heartbeating for 5 minutes (5*60s) kill the
# scheduler and let Kubernetes restart it
livenessProbe:
initialDelaySeconds: 10
timeoutSeconds: 20
failureThreshold: 5
periodSeconds: 60
command: ~
# Wait for at most 1 minute (6*10s) for the scheduler container to startup.
# livenessProbe kicks in after the first successful startupProbe
startupProbe:
failureThreshold: 6
periodSeconds: 10
timeoutSeconds: 20
command: ~
Setting Up Monitoring with Prometheus
Refer post here to details on how to discover and setup prometheus on AKS
Creating Airflow Specific Alerts
Identify the alerts required and add those as PrometheusRule
.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: airflow-health-alerts
namespace: airflow
labels:
release: prometheus-stack # This must match your Prometheus ruleSelector
spec:
groups:
- name: airflow-health.rules
rules:
- alert: AirflowWebserverUnhealthy
expr: kube_pod_container_status_ready{namespace="airflow", container="webserver"} == 0
for: 5m
labels:
severity: critical
component: airflow
annotations:
summary: "Airflow webserver is not ready"
description: "Airflow webserver pod has failed readiness probe for more than 5 minutes."
Gotchas
The key part was getting the label selector right - initially, my rules weren't being picked up because the label didn't match what our Prometheus instance was looking for. Identify the rule selector by looking at the prometheus operator config
kubectl get prometheus -n monitoring -o yaml
You should see something like below which indicates the match criteria which needs to be there in the custom rules
ruleSelector:
matchLabels:
release: prometheus-stack # This must match your Prometheus ruleSelector
On successful addition, you should be able to browse through rules and looking through http://localhost:9090/rules
kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
Setting Up Alert Notifications
Setup alerts by using the AlertmanagerConfig
. Again ensure that match labels are as per the deployment prometheus operator.
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
name: airflow-simple-alerts
namespace: airflow
labels:
release: prometheus-stack
spec:
route:
receiver: 'slack-notifications'
groupBy: ['alertname', 'component']
groupWait: 30s
groupInterval: 5m
repeatInterval: 4h
receivers:
- name: 'slack-notifications'
slackConfigs:
- apiURL:
name: alertmanager-slack-webhook
key: url
channel: '#airflow-alerts'
sendResolved: true
Once added, the same can be verified by looking at the alert manager http://localhost:9093/#/status
kubectl port-forward -n monitoring svc/prometheus-stack-kube-prom-alertmanager 9093:9093
GitOps: Managing Monitoring as Code
I restructured our Airflow application repository to include the monitoring configurations:
airflow/
├── helm-values.yaml
├── monitoring/
│ ├── prometheusrule.yaml
│ └── alertmanagerconfig.yaml
This way, whenever we updated our Airflow deployment, the monitoring configuration would be updated alongside it.
Summary
Our final setup includes:
Health checks for all Airflow components
Prometheus rules for alerting on failures
AlertManager configurations for routing to Slack/email
Everything managed through GitOps with Argo CD
In taking airflow to production I comprehensive monitoring approach is required and its a continuous process. The components are all there - you just need to connect them together in a way that works.
Subscribe to my newsletter
Read articles from sharath chandra directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by