Extract data from Databases into DuckLake


Extract data from Databases into DuckLake
In the ever-evolving landscape of data engineering, the tools we use are constantly getting better, faster, and more efficient. DuckLake is one such innovation, building on the phenomenal success of DuckDB to offer a robust, ACID-compliant data lake format. It's designed for scalability and flexibility, supporting various backends for both its catalog and data storage.
But how do you get your data into DuckLake? Whether your data lives in a production PostgreSQL database, a MySQL instance, or any other database, you need a simple and powerful way to extract and load it.
This is where Sling comes in. Sling is a modern data movement tool designed to make transferring data between different sources and destinations as easy as possible. In this article, we'll walk you through how to use Sling to extract data from most databases and load it directly into your DuckLake instance. The list of connections that Sling supports continues to grow. You can see the full list here, but it supports all the major platforms including Clickhouse, DuckDB, Google BigQuery, Google BigTable, MariaDB, MongoDB, MotherDuck, MySQL, Oracle, PostgreSQL, Prometheus, Redshift, Snowflake, SQL Server, SQLite, StarRocks, Databricks and more.
What is DuckLake?
Before we dive in, let's quickly recap what makes DuckLake special. DuckLake is a data lake format specification that combines the power of DuckDB with flexible catalog backends and scalable data storage. It provides versioned, ACID-compliant tables, which brings database-like reliability to your data lake. See Ducklake's website for more details: https://ducklake.select/.
Key features include:
- Flexible Catalog: Use DuckDB, SQLite, PostgreSQL, or MySQL as your catalog backend.
- Scalable Storage: Store your data files locally, or in cloud storage like AWS S3, Azure Blob Storage, or Google Cloud Storage.
Step 1: Configure Your Connections
First things first, we need to tell Sling how to connect to our source database and our target DuckLake instance. We'll use the sling conns
command, which makes managing connections a breeze.
Source Database: PostgreSQL
Let's assume our source data is in a PostgreSQL database. See the complete list of databases sling can connect to here.
We can set up a connection named PG_CONN
like this:
sling conns set PG_CONN type=postgres host=mypg.host user=myuser password=mypass port=5432 database=analytics
# or use environment variable
export PG_CONN='postgresql://myuser:mypass@mypg.host:5432/analytics?sslmode=require'
You can then run sling conns test pg_conn
to ensure it can successfully connect.
Target: DuckLake
Configuring DuckLake involves specifying the catalog and the data storage path. See the documentation for details here.
For this example, we'll use a local SQLite file for the catalog and a local directory for the data. This setup is great for multi-client access on the same machine.
Here's how to set up a connection named DUCKLAKE_CONN
:
sling conns set DUCKLAKE_CONN type=ducklake catalog_type=sqlite catalog_conn_string=ducklake_catalog.db data_path=./ducklake_data
# or use environment variable
export DUCKLAKE_CONN='{
type: ducklake,
catalog_type: sqlite,
catalog_conn_string: "ducklake_catalog.db",
data_path: "./ducklake_data"
}'
After setting these, you can verify they're configured correctly by running sling conns list
.
If you used the sling conns set
command, your ~/.sling/env.yaml
file should now contain these configurations:
connections:
PG_CONN:
type: postgres
host: mypg.host
user: myuser
password: mypass
port: 5432
database: analytics
DUCKLAKE_CONN:
type: ducklake
catalog_type: sqlite
catalog_conn_string: ducklake_catalog.db
data_path: ./ducklake_data
Step 2: Create the Replication YAML
Now for the fun part. We'll define our data movement task in a simple YAML file. Sling's replication configs are powerful because you can define defaults and replicate many streams (e.g., tables) at once. See here for full documentation.
Let's create a file named db_to_ducklake.yaml
:
source: pg_conn
target: ducklake_conn
defaults:
mode: full-refresh
object: '{stream_schema}.{stream_table}' # Dynamically name tables in DuckLake
streams:
# Replicate all tables from the 'public' schema
public.*:
# You can also add specific tables or disable some
public.forbidden:
disabled: true
analytics.users:
A few things to note here:
source
andtarget
refer to the connection names we just set up.defaults
applies thefull-refresh
mode to all our streams. This means the target tables in DuckLake will be dropped and recreated on each run.- The
object
name is the target table, and uses runtime variables{stream_schema}
and{stream_table}
. Sling will dynamically replace these with the actual schema and table names from the source. - The real power move is
public.*
. This single line tells Sling to find all tables in thepublic
schema of our PostgreSQL database and replicate every single one of them.
Step 3: Run the Replication
With our connections and replication file ready, all that's left is to run it:
sling run -r db_to_ducklake.yaml
That's it! Sling will connect to your PostgreSQL database, read the tables from the public
and analytics
schemas, and write the data into your DuckLake instance, creating the tables and structuring the data files in the ./ducklake_data
directory.
Going Further: Incremental Loads and Cloud Storage
This example is just the beginning. You can easily adapt this for more advanced use cases:
- Incremental Loads: Change the
mode
toincremental
and specify aprimary_key
and/orupdate_key
to only process new or updated records, making your pipelines much more efficient. - Cloud Storage: To use cloud storage for your data, simply update your DuckLake connection's
data_path
to an S3, GCS, or Azure URI and provide the necessary credentials.
For example, to use an S3 bucket, your DuckLake connection in env.yaml
might look like this:
connections:
DUCKLAKE_CONN_S3:
type: ducklake
catalog_type: postgres
catalog_conn_string: "host=db.example.com port=5432 user=ducklake password=secret dbname=ducklake_catalog"
data_path: "s3://my-data-lake-bucket/data/"
s3_access_key_id: "AKIA..."
s3_secret_access_key: "xxxx"
And here's a replication with incremental mode and custom SQL:
source: pg_conn
target: ducklake_conn
streams:
analytics.users:
object: '{stream_schema}.{stream_table}'
mode: incremental
primary_key: id
update_key: updated_at
custom_stream:
object: another_schema.report
sql: |
select ...
from ...
Conclusion
DuckLake brings exciting new capabilities to the world of data lakes, and with Sling, populating it from your existing databases is incredibly straightforward. With just a few lines of configuration, you can build scalable, repeatable, and robust data pipelines to feed your modern data stack.
Ready to give it a try? Install the Sling CLI and check out the official documentation to get started.
Subscribe to my newsletter
Read articles from Fritz Larco directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Fritz Larco
Fritz Larco
A technical data engineer / passionate technologist who loves tackling interesting problems with code, exploration & persistence.