Adding health checks for an airflow deployment seems trivial since airflow helm charts provides the required nuts and bolts. I'll share my journey implementing health monitoring for Airflow on Kubernetes (AKS), from basic health checks to comprehensive dashboards and alerting. Here's what I wanted to accomplish:

Implement proper health checks so Kubernetes could detect and recover from failures
Set up alerts to notify our team when something goes wrong
Create visibility through dashboards that show the real-time status of our Airflow platform
Integrate with our existing AKS monitoring infrastructure

Our environment includes

Airflow 2.10.5
AKS with promethues monitoring enabled
Airflow managed via argo

First Step: Adding Health Checks to Airflow

Airflow helm is a great place to start off on the configs. I started off on adding the health check to Webserver & Scheduler. The defaults provided are good starts, however we had to adjust the few configs on the delays.

webserver:
  livenessProbe:
    initialDelaySeconds: 60 # Give it a minute to startup
    timeoutSeconds: 5
    failureThreshold: 5  # Don't restart too eagerly
    periodSeconds: 10
    scheme: HTTP

  readinessProbe:
    initialDelaySeconds: 15
    timeoutSeconds: 5
    failureThreshold: 5
    periodSeconds: 10
    scheme: HTTP

scheduler:
  ...

  # If the scheduler stops heartbeating for 5 minutes (5*60s) kill the
  # scheduler and let Kubernetes restart it
  livenessProbe:
    initialDelaySeconds: 10
    timeoutSeconds: 20
    failureThreshold: 5
    periodSeconds: 60
    command: ~

  # Wait for at most 1 minute (6*10s) for the scheduler container to startup.
  # livenessProbe kicks in after the first successful startupProbe
  startupProbe:
    failureThreshold: 6
    periodSeconds: 10
    timeoutSeconds: 20
    command: ~

Setting Up Monitoring with Prometheus

Refer post here to details on how to discover and setup prometheus on AKS

Creating Airflow Specific Alerts

Identify the alerts required and add those as PrometheusRule .

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: airflow-health-alerts
  namespace: airflow
  labels:
    release: prometheus-stack  # This must match your Prometheus ruleSelector
spec:
  groups:
  - name: airflow-health.rules
    rules:
    - alert: AirflowWebserverUnhealthy
      expr: kube_pod_container_status_ready{namespace="airflow", container="webserver"} == 0
      for: 5m
      labels:
        severity: critical
        component: airflow
      annotations:
        summary: "Airflow webserver is not ready"
        description: "Airflow webserver pod has failed readiness probe for more than 5 minutes."

Gotchas

The key part was getting the label selector right - initially, my rules weren't being picked up because the label didn't match what our Prometheus instance was looking for. Identify the rule selector by looking at the prometheus operator config

kubectl get prometheus -n monitoring -o yaml

You should see something like below which indicates the match criteria which needs to be there in the custom rules

ruleSelector:
  matchLabels:
    release: prometheus-stack # This must match your Prometheus ruleSelector

On successful addition, you should be able to browse through rules and looking through http://localhost:9090/rules

kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090

Setting Up Alert Notifications

Setup alerts by using the AlertmanagerConfig . Again ensure that match labels are as per the deployment prometheus operator.

apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: airflow-simple-alerts
  namespace: airflow
  labels:
    release: prometheus-stack
spec:
  route:
    receiver: 'slack-notifications'
    groupBy: ['alertname', 'component']
    groupWait: 30s
    groupInterval: 5m
    repeatInterval: 4h

  receivers:
  - name: 'slack-notifications'
    slackConfigs:
    - apiURL:
        name: alertmanager-slack-webhook
        key: url
      channel: '#airflow-alerts'
      sendResolved: true

Once added, the same can be verified by looking at the alert manager http://localhost:9093/#/status

kubectl port-forward -n monitoring svc/prometheus-stack-kube-prom-alertmanager 9093:9093

GitOps: Managing Monitoring as Code

I restructured our Airflow application repository to include the monitoring configurations:

airflow/
├── helm-values.yaml
├── monitoring/
│   ├── prometheusrule.yaml
│   └── alertmanagerconfig.yaml

This way, whenever we updated our Airflow deployment, the monitoring configuration would be updated alongside it.

Summary

Our final setup includes:

Health checks for all Airflow components
Prometheus rules for alerting on failures
AlertManager configurations for routing to Slack/email
Everything managed through GitOps with Argo CD

In taking airflow to production I comprehensive monitoring approach is required and its a continuous process. The components are all there - you just need to connect them together in a way that works.

Taking Airflow to Production: Adding Health Metrics on AKS