Dockerizing Apache Airflow with Docker-compose - Part 1
In this series, we are going to look at how to dockerize and deploy an Apache airflow pipeline. For this first article, we will learn how to dockerize Apache Airflow. You have your pipeline set up and running locally, now it is time to dockerize the pipeline and then later deploy it to production. Let's get to it......
For convenience, we will use the example of automating the database backup pipeline to illustrate how to dockerize and deploy Apache Airflow using docker-compose and Ansible. If you haven't read the article, now is a good time to check it out first.
Why dockerize Airflow?
Suppose you have other developers on your team who will need to contribute (at least at some point) to the pipeline you are building. It is not convenient to have them do the setup necessary to get airflow up and running. This is where dockerizing airflow comes in handy to save you and the team from the pain of having to set up the environment and instead helps you and the team to concentrate on finetuning and adding more pipelines.
Since Airflow requires setting up multiple components to run, it is reasonable to run all these components as independent services in containers. This is especially ideal if your pipeline runs heavy workloads and requires more workers to run efficiently. To run Airflow services in containers we will use docker-compose.
Airflow services
There are many different Airflow services you can configure with docker-compose depending on your needs. In this article we will deal with five services:
Postgres database: Storage to store all the DAG instances.
airflow web-server: the web UI to trigger and check the status of DAG runs.
airflow scheduler: used to monitor all the tasks and trigger task instances.
aiflow init: initializes the services. creates airflow webserver user on initial setup.
Configure services in docker-compose
Now let's configure these services in a docker-compose.yml
file. Create a docker-compose file and place it in the root directory of the project. Add the code below in the file.
---
version: '3'
x-airflow-common:
&airflow-common
# In order to add custom dependencies or upgrade provider packages you can use your extended image.
# Comment the image line, place your Dockerfile in the directory where you placed the docker-compose.yaml
# and uncomment the "build" line below, Then run `docker-compose build` to build the images.
image: ${AIRFLOW_IMAGE_NAME:-apache/airflow:2.5.1}
# build: .
environment:
&airflow-common-env
AIRFLOW__CORE__EXECUTOR: LocalExecutor
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
# For backward compatibility, with Airflow <2.3
AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
AIRFLOW__CORE__FERNET_KEY: ''
AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'
AIRFLOW__CORE__LOAD_EXAMPLES: 'false'
AIRFLOW__API__AUTH_BACKENDS: 'airflow.api.auth.backend.basic_auth,airflow.api.auth.backend.session'
_PIP_ADDITIONAL_REQUIREMENTS: ${_PIP_ADDITIONAL_REQUIREMENTS:-apache-airflow-providers-sftp apache-airflow-providers-ssh}
volumes:
- ${AIRFLOW_PROJ_DIR:-.}/dags:/opt/airflow/dags
- ${AIRFLOW_PROJ_DIR:-.}/logs:/opt/airflow/logs
- ${AIRFLOW_PROJ_DIR:-.}/plugins:/opt/airflow/plugins
user: "${AIRFLOW_UID:-50000}:0"
depends_on:
&airflow-common-depends-on
postgres:
condition: service_healthy
services:
postgres:
image: postgres:13
environment:
POSTGRES_USER: airflow
POSTGRES_PASSWORD: airflow
POSTGRES_DB: airflow
volumes:
- postgres-db-volume:/var/lib/postgresql/data
healthcheck:
test: ["CMD", "pg_isready", "-U", "airflow"]
interval: 5s
retries: 5
restart: always
airflow-webserver:
<<: *airflow-common
command: webserver
ports:
- 8080:8080
healthcheck:
test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
interval: 10s
timeout: 10s
retries: 5
restart: always
depends_on:
<<: *airflow-common-depends-on
airflow-init:
condition: service_completed_successfully
airflow-scheduler:
<<: *airflow-common
command: scheduler
healthcheck:
test: ["CMD-SHELL", 'airflow jobs check --job-type SchedulerJob --hostname "$${HOSTNAME}"']
interval: 10s
timeout: 10s
retries: 5
restart: always
depends_on:
<<: *airflow-common-depends-on
airflow-init:
condition: service_completed_successfully
airflow-init:
<<: *airflow-common
entrypoint: /bin/bash
# yamllint disable rule:line-length
command:
- -c
- |
function ver() {
printf "%04d%04d%04d%04d" $${1//./ }
}
airflow_version=$$(AIRFLOW__LOGGING__LOGGING_LEVEL=INFO && gosu airflow airflow version)
airflow_version_comparable=$$(ver $${airflow_version})
min_airflow_version=2.2.0
min_airflow_version_comparable=$$(ver $${min_airflow_version})
if (( airflow_version_comparable < min_airflow_version_comparable )); then
echo
echo -e "\033[1;31mERROR!!!: Too old Airflow version $${airflow_version}!\e[0m"
echo "The minimum Airflow version supported: $${min_airflow_version}. Only use this or higher!"
echo
exit 1
fi
if [[ -z "${AIRFLOW_UID}" ]]; then
echo
echo -e "\033[1;33mWARNING!!!: AIRFLOW_UID not set!\e[0m"
echo "If you are on Linux, you SHOULD follow the instructions below to set "
echo "AIRFLOW_UID environment variable, otherwise files will be owned by root."
echo "For other operating systems you can get rid of the warning with manually created .env file:"
echo " See: https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html#setting-the-right-airflow-user"
echo
fi
one_meg=1048576
mem_available=$$(($$(getconf _PHYS_PAGES) * $$(getconf PAGE_SIZE) / one_meg))
cpus_available=$$(grep -cE 'cpu[0-9]+' /proc/stat)
disk_available=$$(df / | tail -1 | awk '{print $$4}')
warning_resources="false"
if (( mem_available < 4000 )) ; then
echo
echo -e "\033[1;33mWARNING!!!: Not enough memory available for Docker.\e[0m"
echo "At least 4GB of memory required. You have $$(numfmt --to iec $$((mem_available * one_meg)))"
echo
warning_resources="true"
fi
if (( cpus_available < 2 )); then
echo
echo -e "\033[1;33mWARNING!!!: Not enough CPUS available for Docker.\e[0m"
echo "At least 2 CPUs recommended. You have $${cpus_available}"
echo
warning_resources="true"
fi
if (( disk_available < one_meg * 10 )); then
echo
echo -e "\033[1;33mWARNING!!!: Not enough Disk space available for Docker.\e[0m"
echo "At least 10 GBs recommended. You have $$(numfmt --to iec $$((disk_available * 1024 )))"
echo
warning_resources="true"
fi
if [[ $${warning_resources} == "true" ]]; then
echo
echo -e "\033[1;33mWARNING!!!: You have not enough resources to run Airflow (see above)!\e[0m"
echo "Please follow the instructions to increase amount of resources available:"
echo " https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html#before-you-begin"
echo
fi
mkdir -p /sources/logs /sources/dags /sources/plugins
chown -R "${AIRFLOW_UID}:0" /sources/{logs,dags,plugins}
exec /entrypoint airflow version
# yamllint enable rule:line-length
environment:
<<: *airflow-common-env
_AIRFLOW_DB_UPGRADE: 'true'
_AIRFLOW_WWW_USER_CREATE: 'true'
_AIRFLOW_WWW_USER_USERNAME: ${_AIRFLOW_WWW_USER_USERNAME:-airflow}
_AIRFLOW_WWW_USER_PASSWORD: ${_AIRFLOW_WWW_USER_PASSWORD:-airflow}
_PIP_ADDITIONAL_REQUIREMENTS: ''
user: "0:0"
volumes:
- ${AIRFLOW_PROJ_DIR:-.}:/sources
airflow-cli:
<<: *airflow-common
profiles:
- debug
environment:
<<: *airflow-common-env
CONNECTION_CHECK_MAX_COUNT: "0"
# Workaround for entrypoint issue. See: https://github.com/apache/airflow/issues/16252
command:
- bash
- -c
- airflow
volumes:
postgres-db-volume:
The code above configures the four services we need to run a containerized Airflow pipeline.
Initialize Airflow
To start the services, you first need to initialise Airflow and create a username and password for the web server. Run the command below in the terminal to initialize Airflow.
docker compose up airflow-init
.
This will initialize airflow and create a username and password. The default username and password is airflow
. You can change the default values by changing the value for _AIRFLOW_WWW_USER_USERNAME
and _AIRFLOW_WWW_USER_PASSWORD
environmental variables in the web-server service.
Start Airflow services
After doing initialization, you can start all the other services. Simply run the command below
docker-compose up
After a few minutes, all the services should be up and you can check their status by running docker ps
and you should see something similar to this in the terminal:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
a325b33721c8 apache/airflow:2.5.1 "/usr/bin/dumb-init …" About a minute ago Up About a minute (healthy) 0.0.0.0:8080->8080/tcp airflow_airflow-webserver_1
459b0ac8ed4f apache/airflow:2.5.1 "/usr/bin/dumb-init …" About a minute ago Up About a minute (healthy) 8080/tcp airflow_airflow-scheduler_1
b2695abc6fc3 postgres:13 "docker-entrypoint.s…" 2 minutes ago Up 2 minutes (healthy) 5432/tcp airflow_postgres_1
When the services are up and running you can access the web server by opening http://localhost:8080
in your browser and enter the username and password.
Congrats!!!!!, you have dockerized your Airflow pipeline.
Stay tuned for the next article where we will look into how to deploy the dockerized Airflow pipeline using Ansible. Until next time......Happy coding!!!!!
Additional Resources
Subscribe to my newsletter
Read articles from Arnold Kamanzi directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Arnold Kamanzi
Arnold Kamanzi
I am a software engineer transitioning to Devops. I have interest in improving developer productivity via automation. I write mostly about automation and workflow improvements.