Reactions to Ducklake

At my day job, I’ve been working with Apache Iceberg over the last 9 months or so, and upon hearing of yet another open table format - Ducklake - I was interested to dig in. Here is my reaction.

DuckDB - First of all, I’m a fan of duckdb, it’s really ubiquitous and just so darn easy to get started with. Early on I’ve used DuckDB for my local testing with Apache Iceberg, I’ve also used it to read a bunch of CSV and JSON dumps in object storage. It’s pretty great, as the vast majority of things I do are macbook scale - I don’t need a full datacenter worth of compute.

Apache Iceberg - a lot of the fundamentals of Apache Iceberg sounds fantastic. Immutable snapshots, bring any query engine you want, hidden partitioning, time travel queries and schema management. What’s not to love? As the Redpanda team implemented an entire Iceberg library in C++ for our Iceberg Topics feature, I couldn’t help but notice how complicated it is (sloc reports ~13k lines of code for our write-only use case). There are a bunch of different metadata formats, both Avro and Json, an unclear REST catalog spec, and (as with any proper open standard), a variety of implementations with different assumptions on subtle things like URI strings and file location assumptions. I can’t imagine how complex the query engine side of things are as well, with 3 different data file formats, juggling metadata caching and IO on object storage just for reading manifests.

Ducklake - this is where the initial pitch really landed with me, only support for Parquet, and a good ol’ ANSI SQL for metadata. It seems much simpler, and chances are you already have a database like this an know how to operate it. No need for a catalog server! (Yes REST catalogs did win the Apache Iceberg catalog wars). Only supporting parquet also means they were about to add the footer position - which contains a bunch of rich metadata about the parquet files - to the metadata, which otherwise you would have to guess or read the full file to get this information. However, it’s not all sunshine and lakeside lounging. At least for Redpanda, we would be hard pressed to implement support for these SQL Drivers in Seastar, as it has to respect our custom buddy memory allocator (hello memory fragmentation!) and CPU scheduler (only half a millisecond of CPU time before you have to yield for IO). Where we already have an HTTP client available, so for our unique environment it would be harder to integrate. Although for the rest of the world, there are plenty of good support for SQL drivers that are trivial to integrate like database/sql, JDBC, etc.

REST Catalogs - Iceberg REST Catalogs are adding more and more features, from everything to compaction in S3 Tables, to Role Based Access Control (RBAC) in Apache Polaris, and other features in the catalog spec like query planning. While these are complicated features, the importance of stuff like access controls should not be looked upon lightly. Catalogs have the ability to mint scoped credentials for accessing only the authorized data files, and giving you a centralized point to manage access. With Ducklake, I’m not even sure how you’d limit access to what tables exist (maybe with Postgres row level security?), let alone mint scoped credentials.

To summarize, I love the simplicity of Ducklake, it’s really clearly designed by someone who’s felt the pain of Apache Iceberg. If I had to “roll my own” data warehouse on my own I’d probably use the Ducklake standard and figure out how to roll authorization into the query engine instead of it being tied to the table format itself. However I think based on the industry adoption of Iceberg, that Iceberg adoption isn’t going to slow down. Time will tell!

Reactions to Ducklake

Subscribe to my newsletter

Tyler Rockwood

Tyler Rockwood