Airflow Installation Using Docker Compose on Mac Mini

LinetorLinetor
9 min read

1. What is Airflow?

Apache Airflow is an open-source tool for creating and running workflows, providing powerful automation and scheduling capabilities for data pipelines. It allows you to define and execute tasks using DAGs (Directed Acyclic Graphs) and monitor their status easily through a web UI.

  • My favorite feature is the ability to share logs and code.

    • The scheduling function is similar to Cron, but I particularly like that I can check execution results and logs directly from the web.

    • The web UI makes it convenient to view and share code.

    • Airflow is known for its powerful backfill functionality, but I have never actually used it.

      • I didn't use backfill at work — instead, I used task clear to re-run past tasks. I looped over past dates, so setting the execution_date was necessary.

2. Setting Up the Mac Mini Environment

To run Airflow with Docker Compose on a Mac Mini, I first need to set up the required environment.

2.1 Required Software Installation

To run Airflow on a Mac, you need the following:

  • Docker and Docker Compose (installable via Homebrew)

  • Python 3 (needed for Airflow configuration)

    • I use pyenv to manage Python versions.

    • Since Airflow runs in Docker, installing Python separately is not necessary.

    • However, having the Airflow package installed locally makes writing DAGs more convenient.

If these are not installed, you can install them using Homebrew:

brew install --cask docker
brew install python

After installing Docker, start Docker Desktop and verify that it runs correctly.

3. Installing and Running Airflow with Docker Compose

3.1 Downloading the Official Airflow Docker Compose File

Retrieve the Docker Compose file from the official Airflow GitHub repository:

mkdir airflow && cd airflow
curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.10.3/docker-compose.yaml'

For the latest version:

curl -LfO 'https://airflow.apache.org/docs/apache-airflow/stable/docker-compose.yaml'
  • Note: The above commands may not be exact, as I had previously downloaded the Docker Compose file and backed it up on Google Cloud. I can't confirm if this is the exact method I used—ChatGPT suggested this.

  • Additionally, I run the following services as standalone installations rather than using Docker:

    • PostgreSQL

    • Redis

Thus, I configure the following environment variables in my docker-compose.yaml file:

environment: &airflow-common-env
    AIRFLOW__CORE__EXECUTOR: CeleryExecutor
    AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://id:password!@localhost/airflow_db?sslmode=disable
    AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://id:password!@localhost/airflow_db
    AIRFLOW__CELERY__BROKER_URL: redis://:@localshot:6379/0
    AIRFLOW__CORE__FERNET_KEY: ""
    AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: "true"
    AIRFLOW__CORE__LOAD_EXAMPLES: "false"
    AIRFLOW__API__AUTH_BACKENDS: "airflow.api.auth.backend.basic_auth,airflow.api.auth.backend.session"
    # yamllint disable rule:line-length
    # Use simple http server on scheduler for health checks
    # See https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/check-health.html#scheduler-health-check-server
    # yamllint enable rule:line-length
    AIRFLOW__SCHEDULER__ENABLE_HEALTH_CHECK: "true"
    # WARNING: Use _PIP_ADDITIONAL_REQUIREMENTS option ONLY for a quick checks
    # for other purpose (development, test and especially production usage) build/extend Airflow image.
    _PIP_ADDITIONAL_REQUIREMENTS: ${_PIP_ADDITIONAL_REQUIREMENTS:-}

Since I use standalone PostgreSQL and Redis, I comment out the corresponding sections in docker-compose.yaml:

services:
    # postgres:
    # #image: postgres:13
    # image: busybox
    # # environment:
    # # POSTGRES_USER: airflow
    # # POSTGRES_PASSWORD: airflow
    # # POSTGRES_DB: airflow
    # # volumes:
    # # - postgres-db-volume:/var/lib/postgresql/data
    # healthcheck:
    # test: ["CMD", "pg_isready", "-h", "192.168.0.1", "-U", "admin"]
    # interval: 10s
    # retries: 5
    # start_period: 5s
    # # restart: always
    # command: ["sleep", "infinity"]

    # redis:
    # # Redis is limited to 7.2-bookworm due to licencing change
    # # https://redis.io/blog/redis-adopts-dual-source-available-licensing/
    # # image: redis:7.2-bookworm
    # # image: redis:7.2-bookworm
    # image: busybox
    # # expose:
    # # - 6379
    # healthcheck:
    # test: ["CMD", "redis-cli", "-h", "192.168.0.2", "ping"]
    # interval: 10s
    # timeout: 30s
    # retries: 50
    # start_period: 30s
    # # restart: always
    # command: ["sleep", "infinity"]

    airflow-webserver:

3.2 Setting Up Environment Variables

Create a .env file for Airflow with the required configurations:

echo -e "AIRFLOW_UID=$(id -u)\nAIRFLOW_GID=0" > .env
  • On Mac, setting AIRFLOW_GID=0 helps avoid permission issues (as suggested by ChatGPT).

  • However, I did not include GID in my .env file. Instead, I structured my Airflow project as follows:

AIRFLOW_PROJ_DIR=/path/to/airflow/project/airflow
  • File Structure
├── .env
└── docker-compose.yaml

3.3 Creating the Required Directory Structure

Create the necessary folders for Airflow:

mkdir -p dags logs plugins
  • Folder structure:

    • There is no file on config folder
.
├── config
├── dags
├── logs
└── plugins

3.4 Running Docker Containers

Start the Airflow containers using:

docker-compose up -d

Once all services (Webserver, Scheduler, etc.) are running, you can access the Airflow UI.

3.5 Verifying Web UI Access

Open http://localhost:8080 in your browser to check if the Airflow UI is running.

  • Username: airflow

  • Password: airflow

4. Testing a Simple DAG

4.1 Enabling Default Example DAGs

By default, Airflow’s example DAGs are disabled. You can enable them by modifying airflow.cfg:

docker-compose exec webserver airflow config set core load_examples True
  • However, I kept example DAGs disabled. Instead of using a command, I set the following in docker-compose.yaml:

      AIRFLOW__CORE__LOAD_EXAMPLES: "false"
    
  • Initially, I enabled example DAGs for testing, but they generated excessive logs, so I disabled them.

  • Log management in Airflow was particularly challenging for me due to the large volume of logs generated.

4.2 Running an Example DAG

Navigate to the DAGs page in the Airflow UI and run the example_bash_operator DAG to verify that everything is working correctly.

5. Optimizing and Troubleshooting Airflow on Mac

5.1 Limiting Container Resources

To prevent Docker from consuming excessive resources on Mac, adjust the CPU and RAM limits in Preferences > Resources in Docker Desktop.

  • Also, be mindful of Disk Usage. On macOS, Docker storage is classified under "System Data," and its size increases proportionally with usage.

    • The main issue isn’t the shrinking of disk space, but the unpredictable expansion, which can make system control difficult.

    • I encountered system crashes twice due to almost 100% disk usage. Restarting the system resolved the issue, twice.

    • Instead of adjusting settings in Docker, I created a DAG that periodically removes old logs.

  • ChatGPT’s Recommendations for Log Management

    • Configure log retention settings in airflow.cfg
    base_log_folder = /path/to/logs
    logging_level = INFO
    log_retention_days = 7
  • Alternatively, adjust the logging level via docker-compose.yml by modifying AIRFLOW__LOGGING__LOGGING_LEVEL.

5.2 Resolving Port Conflicts

By default, Airflow uses port 8080. If this port is already in use by another process, you’ll need to change it. Modify docker-compose.yaml as follows:

  webserver:
    ports:
      - "9090:8080"

After this, you can access the UI at http://localhost:9090.

5.3 Fixing Volume Permission Issues

On macOS, volume mounting may cause permission issues. Check the volumes section in docker-compose.yaml and, if necessary, adjust permissions using the chmod command.

5.4 Running Multiple DAGs Concurrently and Optimization

  • To execute multiple DAGs simultaneously, increase the max_active_runs_per_dag value in airflow.cfg.

  • If certain DAGs depend on each other, use TriggerDagRunOperator to enforce sequential execution.

  • Prevent system overload by appropriately setting concurrency and dag_concurrency:

concurrency = 8
dag_concurrency = 4

This configuration allows up to 4 concurrent runs per DAG, with a total of 8 tasks running simultaneously.

6. Conclusion

This guide covered setting up and running Airflow using Docker Compose on a Mac Mini. I tested basic DAG execution and addressed common issues that may arise in a macOS environment.

Airflow enables the creation of complex data pipelines. Future topics to explore include integrating external data sources, adding custom operators, and using the Kubernetes Executor.


Additional Notes: Backfill

Airflow’s Backfill feature is used to retroactively execute DAG runs for missed periods, often necessary when adding or modifying DAGs.

🔹 Understanding Backfill

  • Airflow executes DAGs based on execution_date.

  • If DAG runs were missed or a new DAG needs to process historical data, backfill can be used.

  • Backfill runs DAGs for past dates according to their schedule, ensuring that missing task executions are completed.

🔹 Running Backfill

To manually trigger backfill for a specific DAG over a past period, use the following Airflow CLI command:

airflow dags backfill -s 2024-02-01 -e 2024-02-10 my_dag_id
  • -s 2024-02-01: Start date

  • -e 2024-02-10: End date

  • my_dag_id: DAG ID to run

This command runs my_dag_id from February 1 to February 10, 2024.

🔹 Things to Consider When Using Backfill

  1. Check catchup Setting

    • If catchup=False in the DAG definition, backfill won’t execute automatically.

    • To enable backfill, set catchup=True:

    dag = DAG(
        'my_dag',
        default_args=default_args,
        schedule_interval='@daily',
        catchup=True  # 과거 실행을 허용
    )
  1. Optimizing Parallel Execution

    • If processing a large backfill job, optimize execution by adjusting parallelism and max_active_runs_per_dag in airflow.cfg:
    parallelism = 10 
    max_active_runs_per_dag = 5
  1. Consider Resource Usage

    • Backfill runs multiple historical DAG executions simultaneously, which increases CPU/memory usage.

    • Adjust scheduler and worker settings accordingly.

🔹 When to Use Backfill

✅ Running a new DAG on historical data
✅ Applying DAG modifications retroactively to past data
✅ Re-executing DAG runs for periods when they failed or weren’t triggered


Additional Notes: Executors

Airflow’s Executor determines how tasks are executed. The main types of Executors are:

  • Since this setup is for a home server, LocalExecutor or StandaloneExecutor might be sufficient. However, CeleryExecutor was used here for testing.
  1. SequentialExecutor

    • Executes one task at a time

    • Default executor with SQLite

    • Recommended only for small test environments

  2. LocalExecutor

    • Allows parallel task execution

    • Runs on a single machine using multiprocessing

    • Suitable for development or small production setups

  3. CeleryExecutor

    • Distributes tasks across multiple worker nodes

    • Uses a message broker (Redis, RabbitMQ, etc.)

    • Ideal for large-scale distributed environments

  4. KubernetesExecutor

    • Runs each task in an isolated Kubernetes Pod

    • Provides strong resource isolation and scalability

    • Best for cloud environments

  5. DaskExecutor

    • Uses Dask for distributed execution

    • Supports dynamic scaling and parallel processing

  6. StandaloneExecutor (Airflow 2.7+)

    • Similar to LocalExecutor but with a simpler setup

    • Easily runs with airflow standalone


Additional Notes: Flower

What is Flower?

Flower is a web-based monitoring tool for Celery tasks. If using CeleryExecutor in Airflow, Flower allows tracking worker and task statuses.

Flower’s Key Features

  • Monitor currently running Celery tasks

  • Check the status of individual workers

  • Retry or terminate tasks

  • View execution logs and queue status

Running Flower in Airflow

If using CeleryExecutor, start Flower UI with:

airflow celery flower

However, in docker-compose.yml, Flower is configured to start with:

# You can enable flower by adding "--profile flower" option e.g. docker-compose --profile flower up
# or by explicitly targeted on the command line e.g. docker-compose up flower.

Once running, access Flower UI at http://localhost:5555.

  • I haven't tried running it.

Additional Notes: Using Airflow CLI with Docker Compose

  • In the current docker-compose.yaml, the airflow-cli service is assigned the profile debug.

    • To enable it, use: docker-compose --profile debug up

    • Simply running docker-compose up will not start airflow-cli unless the debug profile is explicitly included.

  • To execute Airflow commands (airflow dags list, etc.), run:

      docker-compose run --rm airflow-cli airflow dags list
    
    • This starts the airflow-cli container, executes the command, and then shuts it down.
  • For an interactive shell inside the container:

docker-compose run --rm airflow-cli bash

Additional Notes: Resolving Disk Usage Issues

  • To free up disk space, periodically clean up Docker volumes using:
docker system prune -a --volumes
  • To automatically clear old Airflow logs, schedule a cron job:
find /path/to/airflow/logs -type f -mtime +7 -delete

Automating Container Restarts

  • To ensure Airflow containers restart automatically after a system reboot on macOS,

    • use launchctl or cron to run: docker-compose up -d
0
Subscribe to my newsletter

Read articles from Linetor directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Linetor
Linetor