Extract data from Databases into DuckLake

Fritz LarcoFritz Larco
5 min read

Extract data from Databases into DuckLake

In the ever-evolving landscape of data engineering, the tools we use are constantly getting better, faster, and more efficient. DuckLake is one such innovation, building on the phenomenal success of DuckDB to offer a robust, ACID-compliant data lake format. It's designed for scalability and flexibility, supporting various backends for both its catalog and data storage.

But how do you get your data into DuckLake? Whether your data lives in a production PostgreSQL database, a MySQL instance, or any other database, you need a simple and powerful way to extract and load it.

This is where Sling comes in. Sling is a modern data movement tool designed to make transferring data between different sources and destinations as easy as possible. In this article, we'll walk you through how to use Sling to extract data from most databases and load it directly into your DuckLake instance. The list of connections that Sling supports continues to grow. You can see the full list here, but it supports all the major platforms including Clickhouse, DuckDB, Google BigQuery, Google BigTable, MariaDB, MongoDB, MotherDuck, MySQL, Oracle, PostgreSQL, Prometheus, Redshift, Snowflake, SQL Server, SQLite, StarRocks, Databricks and more.

What is DuckLake?

Before we dive in, let's quickly recap what makes DuckLake special. DuckLake is a data lake format specification that combines the power of DuckDB with flexible catalog backends and scalable data storage. It provides versioned, ACID-compliant tables, which brings database-like reliability to your data lake. See Ducklake's website for more details: https://ducklake.select/.

Key features include:

  • Flexible Catalog: Use DuckDB, SQLite, PostgreSQL, or MySQL as your catalog backend.
  • Scalable Storage: Store your data files locally, or in cloud storage like AWS S3, Azure Blob Storage, or Google Cloud Storage.

Step 1: Configure Your Connections

First things first, we need to tell Sling how to connect to our source database and our target DuckLake instance. We'll use the sling conns command, which makes managing connections a breeze.

Source Database: PostgreSQL

Let's assume our source data is in a PostgreSQL database. See the complete list of databases sling can connect to here.

We can set up a connection named PG_CONN like this:

sling conns set PG_CONN type=postgres host=mypg.host user=myuser password=mypass port=5432 database=analytics

# or use environment variable
export PG_CONN='postgresql://myuser:mypass@mypg.host:5432/analytics?sslmode=require'

You can then run sling conns test pg_conn to ensure it can successfully connect.

Target: DuckLake

Configuring DuckLake involves specifying the catalog and the data storage path. See the documentation for details here.

For this example, we'll use a local SQLite file for the catalog and a local directory for the data. This setup is great for multi-client access on the same machine.

Here's how to set up a connection named DUCKLAKE_CONN:

sling conns set DUCKLAKE_CONN type=ducklake catalog_type=sqlite catalog_conn_string=ducklake_catalog.db data_path=./ducklake_data

# or use environment variable
export DUCKLAKE_CONN='{ 
  type: ducklake, 
  catalog_type: sqlite,
  catalog_conn_string: "ducklake_catalog.db",
  data_path: "./ducklake_data"
}'

After setting these, you can verify they're configured correctly by running sling conns list.

If you used the sling conns set command, your ~/.sling/env.yaml file should now contain these configurations:

connections:
  PG_CONN:
    type: postgres
    host: mypg.host
    user: myuser
    password: mypass
    port: 5432
    database: analytics

  DUCKLAKE_CONN:
    type: ducklake
    catalog_type: sqlite
    catalog_conn_string: ducklake_catalog.db
    data_path: ./ducklake_data

Step 2: Create the Replication YAML

Now for the fun part. We'll define our data movement task in a simple YAML file. Sling's replication configs are powerful because you can define defaults and replicate many streams (e.g., tables) at once. See here for full documentation.

Let's create a file named db_to_ducklake.yaml:

source: pg_conn
target: ducklake_conn

defaults:
  mode: full-refresh
  object: '{stream_schema}.{stream_table}' # Dynamically name tables in DuckLake

streams:
  # Replicate all tables from the 'public' schema
  public.*:

  # You can also add specific tables or disable some
  public.forbidden:
    disabled: true

  analytics.users:

A few things to note here:

  • source and target refer to the connection names we just set up.
  • defaults applies the full-refresh mode to all our streams. This means the target tables in DuckLake will be dropped and recreated on each run.
  • The object name is the target table, and uses runtime variables {stream_schema} and {stream_table}. Sling will dynamically replace these with the actual schema and table names from the source.
  • The real power move is public.*. This single line tells Sling to find all tables in the public schema of our PostgreSQL database and replicate every single one of them.

Step 3: Run the Replication

With our connections and replication file ready, all that's left is to run it:

sling run -r db_to_ducklake.yaml

That's it! Sling will connect to your PostgreSQL database, read the tables from the public and analytics schemas, and write the data into your DuckLake instance, creating the tables and structuring the data files in the ./ducklake_data directory.

Going Further: Incremental Loads and Cloud Storage

This example is just the beginning. You can easily adapt this for more advanced use cases:

  • Incremental Loads: Change the mode to incremental and specify a primary_key and/or update_key to only process new or updated records, making your pipelines much more efficient.
  • Cloud Storage: To use cloud storage for your data, simply update your DuckLake connection's data_path to an S3, GCS, or Azure URI and provide the necessary credentials.

For example, to use an S3 bucket, your DuckLake connection in env.yaml might look like this:

connections:
  DUCKLAKE_CONN_S3:
    type: ducklake
    catalog_type: postgres
    catalog_conn_string: "host=db.example.com port=5432 user=ducklake password=secret dbname=ducklake_catalog"
    data_path: "s3://my-data-lake-bucket/data/"
    s3_access_key_id: "AKIA..."
    s3_secret_access_key: "xxxx"

And here's a replication with incremental mode and custom SQL:

source: pg_conn
target: ducklake_conn

streams:

  analytics.users:
    object: '{stream_schema}.{stream_table}'
    mode: incremental
    primary_key: id
    update_key: updated_at

  custom_stream:
    object: another_schema.report
    sql: |
      select ...
      from ...

Conclusion

DuckLake brings exciting new capabilities to the world of data lakes, and with Sling, populating it from your existing databases is incredibly straightforward. With just a few lines of configuration, you can build scalable, repeatable, and robust data pipelines to feed your modern data stack.

Ready to give it a try? Install the Sling CLI and check out the official documentation to get started.

1
Subscribe to my newsletter

Read articles from Fritz Larco directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Fritz Larco
Fritz Larco

A technical data engineer / passionate technologist who loves tackling interesting problems with code, exploration & persistence.