DuckDB×Metabase×Object Storage


This blog post provides a comprehensive guide on deploying Metabase with the DuckDB plugin using Docker, enabling users to efficiently manage and analyze data in both cloud and local environments. Key topics include:
Background Knowledge 🌐: An introduction to Docker, DuckDB, Metabase, and Object Storage concepts.
Deployment Guide 🛠️:
Custom Dockerfile to integrate the DuckDB plugin into Metabase.
Docker Compose configurations for testing and production environments, supporting data and plugin persistence.
DuckDB Configuration 🦆📊:
Adding DuckDB databases to Metabase in two modes: Memory Mode and Path Mode (including S3 support).
A detailed comparison of these modes to help users choose the best setup.
S3 Data Querying ☁️:
Configuring DuckDB to access S3 with credentials.
Examples of querying Parquet files directly from S3.
Background Knowledge
Docker 🐳
An open-source platform designed to simplify the development, deployment, and operation of applications. By packaging applications and their dependencies into standardized units called containers, Docker provides a lightweight virtualization solution, ensuring applications run consistently across any environment.
Prepare a Docker environment as per the official guide and verify Docker availability:
sudo docker compose version
DuckDB 🦆
An embedded online analytical processing (OLAP) database management system designed for fast analytical queries. It supports various data formats like Parquet and CSV and efficiently queries large-scale datasets locally or in cloud storage.
Metabase 📊
An open-source business intelligence (BI) tool that provides an intuitive interface to query, analyze, and visualize data. Without writing SQL, users can create charts, dashboards, and share insights with teams.
Object Storage ☁️
A data storage architecture for unstructured data such as documents, images, audio, and video files. Object storage treats data as objects, each containing the data itself, metadata, and a unique identifier. It is commonly used in cloud storage services like Amazon S3.
Getting Started 🚀
Before proceeding, read the following guidelines:
Running Metabase on Docker: Deploy Metabase using Docker.
Metabase 3rd-party Drivers: Currently, only self-hosted Metabase supports DuckDB.
Metabase DuckDB Driver: DuckDB is a rapidly evolving project. Always use the latest version of the plugin to leverage its features and capabilities. Regularly check the GitHub repository for updates and follow the official documentation for installation and configuration to ensure optimal performance.
1. Deploying Metabase with the DuckDB Plugin 🛠️
1.1 Dockerfile Configuration (Adding the DuckDB Plugin) 🐳📥
Customize a Dockerfile to build a Metabase container and add the DuckDB plugin to the plugin directory. Below is a sample Dockerfile:
# Based on OpenJDK 19 on Debian
FROM openjdk:19-buster
# Set up the plugin directory
ENV MB_PLUGINS_DIR=/home/plugins/
# Download Metabase and the DuckDB plugin
ADD https://downloads.metabase.com/v0.52.4/metabase.jar /home
ADD https://github.com/MotherDuck-Open-Source/metabase_duckdb_driver/releases/download/0.2.12/duckdb.metabase-driver.jar /home/plugins/
# Set plugin permissions
RUN chmod 744 /home/plugins/duckdb.metabase-driver.jar
# Start Metabase
CMD ["java", "-jar", "/home/metabase.jar"]
1.2 Docker Compose Configuration (Testing Environment) 🧪
Below is a minimal Docker Compose configuration for quickly testing Metabase with the DuckDB plugin:
version: "3.7"
services:
metaduck:
image: metaduck:latest
container_name: metaduck
environment:
- MB_PLUGINS_DIR=/home/plugins # Specify the plugin directory
ports:
- "3000:3000" # Map Metabase Web UI port
restart: unless-stopped # Restart container if it stops
1.3 Docker Compose Configuration (Production Environment) 🏭
After validation, use the following production-level Docker Compose configuration to ensure persistence and reliability:
version: "3.9"
services:
metabase:
build:
context: . # Build context directory
dockerfile: Dockerfile # Explicitly specify the Dockerfile
image: metabase_with_duckdb:latest # Custom image name
container_name: metaduck # Container name
hostname: metaduck
volumes:
- /dev/urandom:/dev/random:ro # Prevent entropy depletion issues
- /srvdata/metaduck/data:/metabase-data # Persist Metabase data
- /srvdata/metaduck/plugins:/home/plugins # Persist plugins
- /srvdata/metaduck/duckdb:/home/plugins/duckdb # Persist DuckDB plugin data
ports:
- 3000:3000 # Map Metabase Web UI port
environment:
MB_DB_TYPE: postgres
MB_DB_DBNAME: metabaseappdb # PostgreSQL database name
MB_DB_PORT: 5432 # PostgreSQL port
MB_DB_USER: metabase # Database username
MB_DB_PASS: mysecretpassword # Database password
MB_DB_HOST: postgres # PostgreSQL container name
MB_PLUGINS_DIR: /home/plugins # Specify plugin directory
networks:
- metanet1 # Network isolation
healthcheck:
test: curl --fail -I http://localhost:3000/api/health || exit 1 # Health check
interval: 15s
timeout: 5s
retries: 5
restart: unless-stopped # Restart container if it stops
postgres:
image: postgres:latest
container_name: postgres
hostname: postgres
volumes:
- /srvdata/postgres/data:/var/lib/postgresql/data # Persist PostgreSQL data
environment:
POSTGRES_USER: metabase
POSTGRES_DB: metabaseappdb
POSTGRES_PASSWORD: mysecretpassword
networks:
- metanet1
restart: unless-stopped
networks:
metanet1:
driver: bridge
In production environments, higher availability and persistent configurations are often required. In a Docker Compose file, you can specify persistent volumes to ensure that Metabase and DuckDB data are not lost when the container is restarted. For example:
/srvdata/metaduck/data: Persists Metabase data.
/srvdata/metaduck/duckdb: Persists DuckDB data.
/srvdata/metaduck/plugins: Persists Metabase plugins.
This configuration ensures that even if the container stops and restarts, the Metabase data, DuckDB plugin, and configuration will be retained, preventing data loss and maintaining consistency across restarts.
When adding a DuckDB database in Metabase, you can choose between Memory Mode and Path Mode. Each mode has its own characteristics, suitable for different use cases.
DuckDB Mode Comparison: Memory vs. Path
Feature | Memory Mode | Path Mode (Local) |
Storage Location | In-memory, temporary | Local disk |
Performance | High (no disk I/O overhead) | Moderate (limited by local disk I/O) |
Persistence | None (data lost on disconnect or restart) | Persistent (data stored on disk) |
Scalability | Limited by available memory | Limited by local disk capacity |
Suitable Use Case | Quick tests, temporary data loading | Single-user environments, small-scale data |
Cost | Low (no additional storage cost) | Low (local storage only) |
Selection Recommendations
Memory Mode (内存模式)
Suitable Use Case: Quick testing, temporary loading, and analysis of small-scale data.
Limitations: Data is not persistently stored, making it suitable for short-term tasks only and not ideal for production environments.
Path Mode (Local Storage)
Suitable Use Case: Single-user environments, projects that require long-term data storage but on a small scale.
Limitations: Limited by local disk capacity, making it difficult to collaborate in multi-user or distributed environments.
2.1 Setting S3 Access Keys in DuckDB
After configuring DuckDB, you can create a new SQL query to initialize the S3 access key configuration. Below is the SQL to set the S3 access key:
CREATE PERSISTENT SECRET secret1 (
TYPE S3,
KEY_ID 'YOURID', -- Replace with your AWS Access Key ID
SECRET 'YOURKEY', -- Replace with your AWS Secret Access Key
ENDPOINT 'YOURENDPOINT', -- Replace with your S3 endpoint, e.g., oss-cn-hangzhou.aliyuncs.com
SCOPE 's3://ex-dataset', -- Specify your S3 bucket path
REGION 'YOURREGION' -- Specify the region, e.g., cn-hangzhou
);
This SQL statement will persist an S3 access configuration named secret1
in DuckDB, which will be used for future S3 data access.
2.2 Temporary vs. Persistent Secrets
DuckDB provides two types of secrets to manage sensitive configurations securely: Temporary Secrets and Persistent Secrets. Here are the key differences:
Feature | Temporary Secrets | Persistent Secrets |
Storage | Stored in memory, not persisted. | Stored persistently in the database. |
Lifetime | Exists only for the current session. | Exists across sessions until deleted. |
Use Case | Short-term, ephemeral tasks or testing. | Long-term, production configurations. |
Command | CREATE TEMPORARY SECRET | CREATE PERSISTENT SECRET |
Data Loss on Restart | Yes | No |
In this documentation, we focus on Persistent Secrets, which are better suited for production use cases where configurations need to be retained across multiple sessions.
2.3 Key Features of DuckDB Secrets Manager
Key points from the DuckDB Secrets Manager documentation:
Centralized Configuration: Secrets Manager allows sensitive data such as API keys and database credentials to be securely stored and reused in queries.
Scope Restriction: Secrets can be scoped to specific datasets or paths (e.g.,
s3://bucket-name/
), adding an extra layer of security.Ease of Use: Configured secrets can be easily referenced by SQL functions like
which_secret
.Security: Secrets are encrypted and stored securely, ensuring they are not exposed in plain text.
Management Commands:
- Create a Secret: Use
CREATE PERSISTENT SECRET
orCREATE TEMPORARY SECRET
. - List Secrets: Secrets can be queried for their existence and scope.
- Delete Secrets: Remove secrets using
DROP PERSISTENT SECRET
or the equivalent for temporary secrets.
- Create a Secret: Use
Why Persistent Secrets?
We choose Persistent Secrets in this setup for the following reasons:
- They are suitable for production environments where configurations need to persist across sessions.
- They reduce the need to repeatedly configure access credentials, saving time and avoiding errors.
- They provide a secure, centralized way to manage secrets for long-term use.
For more details, refer to the DuckDB Secrets Manager documentation.
2.4 Checking and Managing Secrets
Checking the Secret
To verify if the S3 configuration was successfully applied, use the which_secret
query:
FROM which_secret('s3://ex-dataset/ext-file.parquet', 's3');
This will return the S3 access key configuration that is currently in use.
Note: The S3 secret configuration might take a few seconds to become effective. If the secret is not immediately recognized, wait a few moments before running the query again.
Deleting the Secret
If you need to remove the created S3 configuration, use the DROP PERSISTENT SECRET
command:
DROP PERSISTENT SECRET secret1;
This section ensures that you can configure DuckDB in Metabase, set up S3 access, and query data stored in S3 directly.
3. Querying S3 Tables in Metabase ☁️🔍
3.1 Querying Parquet Files
Test S3 and file accessibility:
SELECT * FROM read_parquet('http://hostname/ex-dataset/ext-file.parquet');
Query using the S3 protocol and verify credentials:
SELECT * FROM read_parquet('s3://ex-dataset/ext-file.parquet');
3.2 Saving Models 📂
Save the query as a model:
In the query interface, click Save.
Choose Save as Model.
Edit metadata:
- Add field descriptions, tags, and usage notes.
Reuse the model:
- Use saved models for dashboards or as data sources for new queries.
Collaboratively edited with ChatGPT for educational and discussion purposes.
© Helianthus. Sharing is encouraged; please credit the source.
Subscribe to my newsletter
Read articles from helianthus directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
