DuckDB×Metabase×Object Storage

helianthushelianthus
8 min read

This blog post provides a comprehensive guide on deploying Metabase with the DuckDB plugin using Docker, enabling users to efficiently manage and analyze data in both cloud and local environments. Key topics include:

  1. Background Knowledge 🌐: An introduction to Docker, DuckDB, Metabase, and Object Storage concepts.

  2. Deployment Guide 🛠️:

    • Custom Dockerfile to integrate the DuckDB plugin into Metabase.

    • Docker Compose configurations for testing and production environments, supporting data and plugin persistence.

  3. DuckDB Configuration 🦆📊:

    • Adding DuckDB databases to Metabase in two modes: Memory Mode and Path Mode (including S3 support).

    • A detailed comparison of these modes to help users choose the best setup.

  4. S3 Data Querying ☁️:

    • Configuring DuckDB to access S3 with credentials.

    • Examples of querying Parquet files directly from S3.


Background Knowledge

Docker 🐳

An open-source platform designed to simplify the development, deployment, and operation of applications. By packaging applications and their dependencies into standardized units called containers, Docker provides a lightweight virtualization solution, ensuring applications run consistently across any environment.
Prepare a Docker environment as per the official guide and verify Docker availability:

sudo docker compose version

DuckDB 🦆

An embedded online analytical processing (OLAP) database management system designed for fast analytical queries. It supports various data formats like Parquet and CSV and efficiently queries large-scale datasets locally or in cloud storage.

Metabase 📊

An open-source business intelligence (BI) tool that provides an intuitive interface to query, analyze, and visualize data. Without writing SQL, users can create charts, dashboards, and share insights with teams.

Object Storage ☁️

A data storage architecture for unstructured data such as documents, images, audio, and video files. Object storage treats data as objects, each containing the data itself, metadata, and a unique identifier. It is commonly used in cloud storage services like Amazon S3.


Getting Started 🚀

Before proceeding, read the following guidelines:

  1. Running Metabase on Docker: Deploy Metabase using Docker.

  2. Metabase 3rd-party Drivers: Currently, only self-hosted Metabase supports DuckDB.

  3. Metabase DuckDB Driver: DuckDB is a rapidly evolving project. Always use the latest version of the plugin to leverage its features and capabilities. Regularly check the GitHub repository for updates and follow the official documentation for installation and configuration to ensure optimal performance.


1. Deploying Metabase with the DuckDB Plugin 🛠️

1.1 Dockerfile Configuration (Adding the DuckDB Plugin) 🐳📥

Customize a Dockerfile to build a Metabase container and add the DuckDB plugin to the plugin directory. Below is a sample Dockerfile:

# Based on OpenJDK 19 on Debian
FROM openjdk:19-buster

# Set up the plugin directory
ENV MB_PLUGINS_DIR=/home/plugins/

# Download Metabase and the DuckDB plugin
ADD https://downloads.metabase.com/v0.52.4/metabase.jar /home
ADD https://github.com/MotherDuck-Open-Source/metabase_duckdb_driver/releases/download/0.2.12/duckdb.metabase-driver.jar /home/plugins/

# Set plugin permissions
RUN chmod 744 /home/plugins/duckdb.metabase-driver.jar

# Start Metabase
CMD ["java", "-jar", "/home/metabase.jar"]

1.2 Docker Compose Configuration (Testing Environment) 🧪

Below is a minimal Docker Compose configuration for quickly testing Metabase with the DuckDB plugin:

version: "3.7"

services:
  metaduck:
    image: metaduck:latest
    container_name: metaduck
    environment:
      - MB_PLUGINS_DIR=/home/plugins  # Specify the plugin directory
    ports:
      - "3000:3000"  # Map Metabase Web UI port
    restart: unless-stopped  # Restart container if it stops

1.3 Docker Compose Configuration (Production Environment) 🏭

After validation, use the following production-level Docker Compose configuration to ensure persistence and reliability:

version: "3.9"
services:
  metabase:
    build:
      context: .  # Build context directory
      dockerfile: Dockerfile  # Explicitly specify the Dockerfile
    image: metabase_with_duckdb:latest  # Custom image name
    container_name: metaduck  # Container name
    hostname: metaduck
    volumes:
      - /dev/urandom:/dev/random:ro  # Prevent entropy depletion issues
      - /srvdata/metaduck/data:/metabase-data  # Persist Metabase data
      - /srvdata/metaduck/plugins:/home/plugins  # Persist plugins
      - /srvdata/metaduck/duckdb:/home/plugins/duckdb  # Persist DuckDB plugin data
    ports:
      - 3000:3000  # Map Metabase Web UI port
    environment:
      MB_DB_TYPE: postgres
      MB_DB_DBNAME: metabaseappdb  # PostgreSQL database name
      MB_DB_PORT: 5432  # PostgreSQL port
      MB_DB_USER: metabase  # Database username
      MB_DB_PASS: mysecretpassword  # Database password
      MB_DB_HOST: postgres  # PostgreSQL container name
      MB_PLUGINS_DIR: /home/plugins  # Specify plugin directory
    networks:
      - metanet1  # Network isolation
    healthcheck:
      test: curl --fail -I http://localhost:3000/api/health || exit 1  # Health check
      interval: 15s
      timeout: 5s
      retries: 5
    restart: unless-stopped  # Restart container if it stops

  postgres:
    image: postgres:latest
    container_name: postgres
    hostname: postgres
    volumes:
      - /srvdata/postgres/data:/var/lib/postgresql/data  # Persist PostgreSQL data
    environment:
      POSTGRES_USER: metabase
      POSTGRES_DB: metabaseappdb
      POSTGRES_PASSWORD: mysecretpassword
    networks:
      - metanet1
    restart: unless-stopped

networks:
  metanet1:
    driver: bridge

In production environments, higher availability and persistent configurations are often required. In a Docker Compose file, you can specify persistent volumes to ensure that Metabase and DuckDB data are not lost when the container is restarted. For example:

  • /srvdata/metaduck/data: Persists Metabase data.

  • /srvdata/metaduck/duckdb: Persists DuckDB data.

  • /srvdata/metaduck/plugins: Persists Metabase plugins.

This configuration ensures that even if the container stops and restarts, the Metabase data, DuckDB plugin, and configuration will be retained, preventing data loss and maintaining consistency across restarts.


When adding a DuckDB database in Metabase, you can choose between Memory Mode and Path Mode. Each mode has its own characteristics, suitable for different use cases.

DuckDB Mode Comparison: Memory vs. Path

FeatureMemory ModePath Mode (Local)
Storage LocationIn-memory, temporaryLocal disk
PerformanceHigh (no disk I/O overhead)Moderate (limited by local disk I/O)
PersistenceNone (data lost on disconnect or restart)Persistent (data stored on disk)
ScalabilityLimited by available memoryLimited by local disk capacity
Suitable Use CaseQuick tests, temporary data loadingSingle-user environments, small-scale data
CostLow (no additional storage cost)Low (local storage only)

Selection Recommendations

Memory Mode (内存模式)

  • Suitable Use Case: Quick testing, temporary loading, and analysis of small-scale data.

  • Limitations: Data is not persistently stored, making it suitable for short-term tasks only and not ideal for production environments.

Path Mode (Local Storage)

  • Suitable Use Case: Single-user environments, projects that require long-term data storage but on a small scale.

  • Limitations: Limited by local disk capacity, making it difficult to collaborate in multi-user or distributed environments.


2.1 Setting S3 Access Keys in DuckDB

After configuring DuckDB, you can create a new SQL query to initialize the S3 access key configuration. Below is the SQL to set the S3 access key:

CREATE PERSISTENT SECRET secret1 (
    TYPE S3,
    KEY_ID 'YOURID',  -- Replace with your AWS Access Key ID
    SECRET 'YOURKEY',  -- Replace with your AWS Secret Access Key
    ENDPOINT 'YOURENDPOINT',  -- Replace with your S3 endpoint, e.g., oss-cn-hangzhou.aliyuncs.com
    SCOPE 's3://ex-dataset',  -- Specify your S3 bucket path
    REGION 'YOURREGION'  -- Specify the region, e.g., cn-hangzhou
);

This SQL statement will persist an S3 access configuration named secret1 in DuckDB, which will be used for future S3 data access.


2.2 Temporary vs. Persistent Secrets

DuckDB provides two types of secrets to manage sensitive configurations securely: Temporary Secrets and Persistent Secrets. Here are the key differences:

FeatureTemporary SecretsPersistent Secrets
StorageStored in memory, not persisted.Stored persistently in the database.
LifetimeExists only for the current session.Exists across sessions until deleted.
Use CaseShort-term, ephemeral tasks or testing.Long-term, production configurations.
CommandCREATE TEMPORARY SECRETCREATE PERSISTENT SECRET
Data Loss on RestartYesNo

In this documentation, we focus on Persistent Secrets, which are better suited for production use cases where configurations need to be retained across multiple sessions.


2.3 Key Features of DuckDB Secrets Manager

Key points from the DuckDB Secrets Manager documentation:

  1. Centralized Configuration: Secrets Manager allows sensitive data such as API keys and database credentials to be securely stored and reused in queries.

  2. Scope Restriction: Secrets can be scoped to specific datasets or paths (e.g., s3://bucket-name/), adding an extra layer of security.

  3. Ease of Use: Configured secrets can be easily referenced by SQL functions like which_secret.

  4. Security: Secrets are encrypted and stored securely, ensuring they are not exposed in plain text.

  5. Management Commands:

    • Create a Secret: Use CREATE PERSISTENT SECRET or CREATE TEMPORARY SECRET.
    • List Secrets: Secrets can be queried for their existence and scope.
    • Delete Secrets: Remove secrets using DROP PERSISTENT SECRET or the equivalent for temporary secrets.

Why Persistent Secrets?

We choose Persistent Secrets in this setup for the following reasons:

  • They are suitable for production environments where configurations need to persist across sessions.
  • They reduce the need to repeatedly configure access credentials, saving time and avoiding errors.
  • They provide a secure, centralized way to manage secrets for long-term use.

For more details, refer to the DuckDB Secrets Manager documentation.


2.4 Checking and Managing Secrets

Checking the Secret

To verify if the S3 configuration was successfully applied, use the which_secret query:

FROM which_secret('s3://ex-dataset/ext-file.parquet', 's3');

This will return the S3 access key configuration that is currently in use.

Note: The S3 secret configuration might take a few seconds to become effective. If the secret is not immediately recognized, wait a few moments before running the query again.

Deleting the Secret

If you need to remove the created S3 configuration, use the DROP PERSISTENT SECRET command:

DROP PERSISTENT SECRET secret1;

This section ensures that you can configure DuckDB in Metabase, set up S3 access, and query data stored in S3 directly.

3. Querying S3 Tables in Metabase ☁️🔍

3.1 Querying Parquet Files

  1. Test S3 and file accessibility:

     SELECT * FROM read_parquet('http://hostname/ex-dataset/ext-file.parquet');
    
  2. Query using the S3 protocol and verify credentials:

     SELECT * FROM read_parquet('s3://ex-dataset/ext-file.parquet');
    

3.2 Saving Models 📂

  1. Save the query as a model:

    • In the query interface, click Save.

    • Choose Save as Model.

  2. Edit metadata:

    • Add field descriptions, tags, and usage notes.
  3. Reuse the model:

    • Use saved models for dashboards or as data sources for new queries.

Collaboratively edited with ChatGPT for educational and discussion purposes.
© Helianthus. Sharing is encouraged; please credit the source.


0
Subscribe to my newsletter

Read articles from helianthus directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

helianthus
helianthus