Recap: Data Warehouses, Data Lakes, Lakehouses?

As a short recap, what do these mean, and how are they differentiated?

Modern Data Warehouses, like Amazon Redshift, Google BigQuery, and Snowflake, offer fast, SQL-optimized performance for structured data and BI workloads. Their columnar storage, advanced indexing, and automatic optimization make them ideal for analytics. However, they come at a premium: storage and compute are tightly coupled, and costs can rise sharply with large-scale, always-on workloads or frequent queries over massive datasets.

Data Lakes, typically built on Amazon S3, Azure Data Lake Storage (ADLS), Google Cloud Storage (GCS), or Cloudflare R2, provide a low-cost solution for storing raw structured and unstructured data. They scale effortlessly and are the foundation of many batch and streaming pipelines. While storage is cheap, performance suffers without significant investment in tooling, such as Apache Spark, Presto, or Hive, to support transformation and query execution. Data governance and consistency also become challenges as systems scale.

The Lakehouse architecture, popularized by platforms like Databricks or Snowflake with table formats like Delta Lake and Apache Iceberg, offers a compelling middle ground. Lakehouses combine the low-cost, schema-flexible storage of data lakes with the performance and ACID transaction support of warehouses. Technologies like Databricks Lakehouse Platform, Snowflake’s Unistore, or Dremio provide unified data management, allowing teams to run SQL analytics, data science, and streaming workloads directly on cloud object storage, without needing separate ETL into a warehouse.

From a cost-performance perspective, Lakehouses reduce redundancy and simplify infrastructure by eliminating the need to maintain both a lake and a warehouse. They support open formats (e.g., Parquet, ORC) and bring advanced features like time travel (see Apache Iceberg), schema evolution, and fine-grained access control which previously was reserved for traditional Data Warehouses into the Data Lake domain.

Ultimately, the choice depends on your workload characteristics, query latency requirements, data volume, and ecosystem alignment. But as the major cloud providers and vendors converge around the Lakehouse paradigm, it's becoming an increasingly attractive option for teams seeking agility, scalability, and cost control in modern data platforms.

Existing open table formats

There are different “table formats” for the Lakehouse architecture, such as Apache Iceberg and Delta Lake.

Apache Iceberg is an open table format designed for huge analytic datasets in distributed data processing systems like Apache Spark, Trino, Presto, Flink, and Hive. It was developed at Netflix and later donated to the Apache Software Foundation.

Delta Lake is an open-source storage layer that brings ACID transactions, scalable metadata handling, and unified streaming and batch processing to Apache Spark and other big data engines. Originally developed by Databricks, it is now part of the Linux Foundation.

💡

Because I’m a bit lazy sometimes, I asked ChatGPT the following: “You are an expert data engineer. Can you write a comparison summary of open table formats of both Apache Iceberg and Delta Lake? Use the same criteria while doing so”. The shortened output is found below. I tried to verify the claims as far as I could. If you find an error, please leave a comment.

Comparison

ACID Transactions

Iceberg:
Fully supports ACID transactions using snapshot isolation. Transactions are handled via metadata manifests and atomic commits to the metadata file. Suitable for both streaming and batch operations.
Delta Lake:
Implements ACID transactions via a write-ahead log (_delta_log). Also supports snapshot isolation, enabling reliable concurrent read/write operations in batch and streaming.

Schema Evolution

Iceberg:
Offers flexible schema evolution and supports adding, renaming, reordering, and deleting columns. Field IDs are used to track schema changes safely.
Delta Lake:
Supports additive schema evolution (e.g., adding columns). Reordering and renaming are partially supported and can be more restrictive than Iceberg. Dropping columns my require rewriting of the underlying Parquet files.

Partitioning

Iceberg:
Uses hidden partitioning, abstracting physical layout from logical queries. Automatically rewrites queries for efficient partition pruning.
Delta Lake:
Relies on explicit partition columns, which are visible to users. Partition pruning depends on query awareness of partition structure.

Time Travel

Iceberg:
Native time travel support via snapshots. You can query historical versions based on snapshot ID or timestamp.
Delta Lake:
Also supports time travel using version numbers or timestamps. Retains transaction logs that allow rollback and historical queries.

Performance and Metadata Scalability

Iceberg:
Uses a metadata tree structure (manifests and manifest lists) to avoid file listing and support petabyte-scale datasets. More scalable with large numbers of files and partitions.
Delta Lake:
Stores metadata in a log of JSON files, which can become large over time and may require compaction (OPTIMIZE and VACUUM) for performance.

Streaming Support

Iceberg:
Has native support for streaming writes and reads, especially with Apache Flink and Spark Structured Streaming. Streaming write support is still evolving in some engines.
Delta Lake:
Designed with stream-batch unification in mind. Seamless support for Structured Streaming in Spark.

File Format Support

Iceberg:
Supports Parquet, Avro, and ORC. File format agnostic, with clean decoupling of metadata and data.
Delta Lake:
Primarily based on Parquet. No support for other formats as of now.

Ecosystem and Engine Support

Iceberg:
Integrates with Apache Spark, Flink, Trino, Presto, Hive, Dremio, Snowflake, and more. Supported by AWS Glue, Snowflake, and others.
Delta Lake:
Strong integration with Databricks, Apache Spark, and supported in Presto, Trino, Flink (via Delta-RS). Emerging ecosystem via Delta Kernel and Delta-RS.

Operations and Tooling

Iceberg:
Growing support for table maintenance (compaction, expiration, snapshot management). Still evolving CLI tools.
Delta Lake:
Mature operational tooling via Databricks (e.g., OPTIMIZE, VACUUM, Delta Live Tables). Better UX for managed environments.

Summary

Criteria	Apache Iceberg	Delta Lake
ACID Transactions	✅ Yes (Snapshot-based)	✅ Yes (Log-based)
Schema Evolution	✅ Flexible (Field IDs, reorder, rename)	⚠️ Limited (mostly additive)
Partitioning	✅ Hidden, automatic pruning	⚠️ Explicit, user-managed
Time Travel	✅ Snapshots, timestamps	✅ Versioned log, timestamps
Metadata Performance	✅ Scalable (manifest-based)	⚠️ Requires periodic compaction
Streaming Support	⚠️ Evolving (good Flink/Spark support)	✅ Mature (best with Spark)
File Format Support	✅ Parquet, Avro, ORC	⚠️ Parquet only
Engine/Ecosystem Support	✅ Broad, vendor-neutral	⚠️ Strong Spark/Databricks focus
Operational Tooling	⚠️ Growing	✅ Advanced in Databricks

The new kid on the block: DuckLake

DuckLake was introduced on 2025-05-27 by the by the founders of DuckDB. There’s an interesting podcast where Mark & Hannes explain more about the motivation and goals behind it, viewing it is strongly recommended! Also, they wrote a manifesto: SQL as a Lakehouse format.

Its key features are:

Supports snapshots, time travel queries, schema evolution and partitioning
Can have as many lightweight snapshots as you want without frequent compacting steps
Allows concurrent access with ACID transactional guarantees over multi-table operations
Uses statistics for filter pushdown, enabling fast queries even on large datasets

There’s also a nice introduction and overview in Jordan Tigani’s blog post in the MotherDuck blog: “A Duck walks into a lake”

The big difference compared to Apache Iceberg and Delta Lake is that DuckLake uses a database as a backing store for the metadata, instead of keeping it in complex files as JSON and/or Avro on object storage, like both others do. Only to eventually have a data catalog put on top of that.

DuckLake stores data in Parquet, and metadata in relational tables, plain and simple. This has many benefits, e.g. faster lookups (SQL query vs. an eventual cascade of S3 HEAD/GET requests), and the catalog is already built-in.

The catalog database should be chosen upon the following criteria (according to the DuckLake website):

If you would like to perform local data warehousing with a single client, use DuckDB as the catalog database
If you would like to perform local data warehousing using multiple local clients, use SQLite as the catalog database
If you would like to operate a multi-user Lakehouse with potentially remote clients, choose a transactional client-server database system as the catalog database: MySQL or PostgreSQL

Even though it’s a very early version, which still misses some functionality, it’s already looking very useful from a (metadata) complexity and performance perspective.

The necessary relational database for multi-user scenarios adds some complexity, but there are some SaaS solutions on the market that reduce or actually get rid of the operative burden of managing relational databases. The leading principle still is the separation of storage and compute. More on that in the next chapter.

Is Big Data dead?

Coming back to the article’s title, where’s the promised $10/month Lakehouse at? And why $10/month?

In his article “Big Data is dead”, Jordan Tigani shared some interesting insights from his days at Google’s BigQuery. One was that the 90th percentile of all queries only queried 100MB data, and the 99th no more than 10GB.

Another one was that most people don’t actually have that much data, but some have real Big Data. “Among customers who were using the service heavily, the median data storage size was much less than 100 GB”:

Amazon released a dataset about the Redshift usage and query sizes, which Jordan analyzed in another blog post that supports his original article.

So, what’ the relation to this blog post?

Basically, it validates that most of the people/projects/companies do neither have “real” Big Data, nor run queries that require actual “real” Big Data tools that are distributed, horizontally scaling etc. etc.

Many of them though have established systems that are much too complex and/or much too expensive for their actual need.

The $10/month Lakehouse

Requirements

If we apply the “separation of storage and compute” paradigm, we need to find a supplier for the (object) storage part, and one for the compute. For the storage, we want to use a S3-compatible API for flexibility reasons.

As we want to leverage the new DuckLake format, and we want to eventually have multiple reader and writer processes. This means we do not “only” have to search for a compute provider, but also find a provider for a relational database for the DuckLake metadata, which enables the desired multi-user capabilities.

Furthermore, we’d like to spend as little time on managing actual infrastructure as possible. This leads us to using serverless services for both compute/querying, and metadata storage.

Going forward, we assume the following:

We have 250GB of data we want to analyze (as monthly average)
We would like to query the remote data via HTTPs endpoints, as well as from out local machines
We want the whole Lakehouse to run in “auto-pilot mode”, meaning we don’t want to manage the underlying infrastructure
We want SQL to be our primary/sole interface to the data, so we want to use the new DuckLake, based on DuckDB

A suggested solution

Storage

For storage, there are many providers on the market, be it the big public cloud providers like AWS, Azure, Cloudflare or Google Cloud, or others, like Hetzner, or, more specialized, Backblaze.

If we want to optimize on costs, Cloudflare R2 comes into focus, mainly because the free egress fees, a generous free tier, and generally cheap standard storage costs of $0.015 / GB-month.

R2 Free Tier

Class A operations are basically write or list operations, Class B operations are mainly read operations, that will occur much more often as write operations in a Lakehouse setting.

R2 Pricing after Free Tier

The calculated cost per month, based on our requirements for the storage part would be the following:

250 GB-month - 10 GB-month (free tier) * $0.015 = $3.60
Write requests: We assume that the 1 million requests / month are sufficient, $0
Read requests: We assume that the 10 million requests / month are sufficient, $0
Egress is free, $0

Overall costs for storage for a month: $3.60

Compute

When wanting to run DuckDB in a serverless fashion, some options come into one’s mind. E.g. we could use AWS services such as Fargate, running DuckDB in a container, or AWS Lambda, running it in a serverless function. Google Cloud has CloudRun, Azure has Container Instances.

Cloudflare will launch it’s Containers service in June 2025, which has a pricing model of $0.072 per vCPU-hour, and $0.009 per GB-hour. This is not necessarily cheaper than the other providers, but it comes with 1TB / month free egress from the containers. Because we’ll already be using R2, we’ll use it anyways, despite of being able to safe maybe $1 with other providers, based on our assumed usage (see below).

The monthly costs, based on some assumptions (containers use 4GB and 2 vCPUs when run, meaning $0.18 / hour / container), would be:

25.000 queries, with an average of 3s runtime (=): $0.18 (25.000 × 3 / 3600) = $3.75
Container will spin down after 1 minute of inactivity, so we add another $2 to accommodate this behavior

Overall costs for compute for a month: $5.75

Metadata storage

Because we’d like to use DuckLake in multi-user/writer mode, we’ll need either a MySQL or Postgres compatible database to store the metadata. There are many options on the market, like Supabase or Neon, which both provide serverless Postgres services.

We’ll choose Neon, because it has a more generous free tier, that offers up to 0.5 GB of database storage, and auto-scaling of up to 2 vCPUs and 8 GB of memory, having 190 compute-hours included. This should also fit to our assumed compute usage as outlined above:

Overall costs for metadata storage for a month: $0

Cumulative costs

From the costs we derived for each part of the stack, we can calculate the cumulative costs of the stack per month: storage $3.60, compute $5.75, and metadata storage $0. This means overall monthly costs of $9.35!

Deploy a Serverless DuckLake

💡

The Cloudflare Containers service is currently in closed beta, but I got access a few weeks ago. It will GA in June 2025, as stated in the introduction article.

GitHub repo

The repository with the code can be found at tobilg/cloudflare-ducklake:

Once you cloned the code to your local machine, you can start with the deployment:

git clone git@github.com:tobilg/cloudflare-ducklake.git && cd cloudflare-ducklake

Preconditions

To deploy the project, you need to have the following preconditions to be present on the machine you want to use:

Node 20 or higher
Docker Desktop
wrangler (newest version)
An IDE like VS Code with TypeScript support

Installing dependencies

To install the dependencies, please run

npm i

Build image locally

Before being able to build the Docker image locally, you have to download the DuckDB extensions we'd like to package into the image, so that they don't need to be downloaded on each container start:

scripts/download_extensions.sh

Once this is done, you can run the following to build the image locally:

npm run build:docker

Run the image locally

To run the newly built image locally, run

npm run dev:docker

To query the DuckDB API within the locally running container, use

curl --location 'http://localhost:8080/query' \
--header 'Content-Type: application/json' \
--data '{
  "query": "SELECT * FROM '\''https://shell.duckdb.org/data/tpch/0_01/parquet/orders.parquet'\'' LIMIT 1000"
}'

💡

Currently it's not possible to use wrangler dev during local development. I guess this will eventually change once Containers become GA.

DuckDB API

DuckDB is exposed as a Hono.js-based API, that offers a few endpoints:

GET /: Will show a JSON welcome message
GET /_health: Enables potential container health checking (currently not used)
POST /query: Takes a application/json object body with a query property that contains the (encoded) SQL query. Returns the query result in application/json as well (see example above)

Securing the API

You can generate a unique API Token, e.g. with a tool like Strong Password Generator, and deploy a Workers secret named API_TOKEN that will automatically be used to secure the /query endpoint once it's present with the below script:

scripts/setup_api_secrets.sh

If you want to test this locally, too, you need to create a .dev.vars file in the root directory of this project. This will be appended for R2 Data Catalog & DuckLake usage later if you want to try these before deployment as well.

Creating a R2 bucket

If you want to either deploy the DuckLake or the Apache Iceberg integration via R2 Data Catalog, you have to set up a R2 bucket first. To create a new R2 bucket, you can run the following:

scripts/setup_r2.sh YOUR-BUCKET-NAME

where YOUR-BUCKET-NAME is your desired name for the bucket. It will automatically add the respective environment variables to .dev.vars. You'll also need the bucket name later for settng up the DuckLake secrets.

If you'd want to set a location hint or a jurisdiction, please edit the script accordingly before running it.

Setting up a catalog database

As described above, we chose Neon‘s free tier to act as the metadata storage provider. Once you initially signed-up, you can choose the project name, cloud provider and the region:

After you clicked on "Create project", you get directly taken to your Dashboard. The next step is to get the connection details. Therefore, click on the "Connect" button in the upper-right corner:

Note the following connection parameters, and their equivalent environment variable/secret names (and add them to the .dev.vars file):

The user (POSTGRES_USER)
The password (POSTGRES_PASSWORD)
The hostname (POSTGRES_HOST)
The database name (POSTGRES_DB)

You can also create another user, and use this if you don't want to use the main instance user.

Getting a R2 Access Token

Please follow the instructions in the R2 docs on how to create an API to ken.

You'll need to store this token in a secure location (and eventually in .dev.vars if you want to use the R2 Data Catalog when running the Docker image locally), as you'll need later for the R2 Data Catalog deployment.

Also, please note the the S3 APIs Access Key (R2_ACCESS_KEY_ID) and Secret Key (R2_SECRET_ACCESS_KEY), as well as your Cloudflare account ID (R2_ACCOUNT_ID), you'll need them in the next step. Put them in .dev.vars as well if you want to run it locally.

Create secrets for DuckLake

Running the following script will create eight new Workers secrets needed for deployment:

scripts/setup_ducklake_secrets.sh

Deployment

If you followed the above steps, you can now run the actual deployment. This will create a Worker, a DurableObject, build the Docker image locally and the upload it to the Cloudflare container registry.

💡

You need access to Cloudflare containers beta to be able to deploy it to Cloudflare!

npm run deploy

Running queries

Replace the WORKERS_URL with the real URL, and the API_TOKEN with your real API token:

curl --location 'https://WORKERS_URL/query' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer API_TOKEN' \
--data '{
  "query": "CREATE TABLE ducklake.orders AS SELECT * FROM '\''https://shell.duckdb.org/data/tpch/0_01/parquet/orders.parquet'\''"
}'

The response should look like this:

[
  {"Count":"15000"}
]

Get the first row of the new table:

curl --location 'https://WORKERS_URL/query' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer API_TOKEN' \
--data '{
  "query": "SELECT * FROM ducklake.orders LIMIT 1"
}'

The response should look like this:

[
  {
    "o_orderkey": 1,
    "o_custkey": 370,
    "o_orderstatus": "O",
    "o_totalprice": 172799.49,
    "o_orderdate": {
      "days": 9497
    },
    "o_orderpriority": "5-LOW",
    "o_clerk": "Clerk#000000951",
    "o_shippriority": 0,
    "o_comment": "nstructions sleep furiously among "
  }
]

Deployment with the additional R2 Data Catalog / Iceberg integration

With the v1.3.0 release of DuckDB, it became possible to connect to the R2 Data Catalog. This means that you can now also read Iceberg data from R2's Object Storage directly from a SQL statement issued by DuckDB.

Enable R2 Data Catalog for existing R2 bucket

You can enable the Apache Iceberg support via the R2 Data Catalogby running the following command:

scripts/setup_r2_data_catalog.sh YOUR-BUCKET-NAME

where YOUR-BUCKET-NAME is the name for the bucket you used in the R2 bucket creation step before.

Getting the R2 Data Catalog information

The information about the R2 Data Catalog URI (R2_ENDPOINT secret) and warehouse name (R2_CATALOG secret) can be gathered by running:

scripts/get_catalog_info.sh YOUR-BUCKET-NAME

where YOUR-BUCKET-NAME is the name for the bucket you used in the R2 bucket creation step before.

Please also store this information, because you'll need in in the next step. Also, please add the three variables to .dev.vars if you want to use the R2 Data Catalog when running the Docker image locally.

Creating secrets for R2 Data Catalog

To create the necessary Workers secrets, run:

scripts/create_r2_data_catalog_secrets.sh

and copy & paste the respective values you noted in the last two steps.

Writing Iceberg example data

As we need some example data if we want to test the new Iceberg capabilities, we need to create this data manually with a Python script.

For convenience, we'll use Marimo for this. It requires a working Python installation on your machine.

Setup Marimo

To setup Marimo, run the following npm task:

npm run iceberg:setup

This will create a new directory and install Marimo and some dependencies in a virtual environment.

Create the Iceberg example data

To create the Iceberg example data, run the following npm task:

npm run iceberg:create

This will start Marimo, and load the respective Python script. You'll need to edit the variables for WAREHOUSE, CATALOG_URI and TOKEN with the values gathered in the last steps. This is also described in the R2 Data Catalog docs.

After you did that, you can run the Python cells in the Marimo notebook, and should end up with some data created in the R2 Bucket. You can check in the Cloudflare Dashboard, or via wrangler.

Running DuckDB with Iceberg support

Once you created the secrets as outlined above (all of them!), the application will automatically create a DuckDB secret for accessing the R2 Data Catalog, and attach the catalog as well under the name of r2lake when it starts. The example table created in the last step is called people and was created in the default schema.

Deployment

If you followed the above steps, you can now run the deployment. This will update the existing deployment when you already deployed the DuckLake configuration.

Hint: You need access to Cloudflare containers beta to be able to deploy it to Cloudflare!

npm run deploy

The wrangler deployment output shows the workers URL where the service can be reached. Note it down if you want to run test queries (WORKERS_URL in the step below).

Running queries

Replace the WORKERS_URL with the real URL, and the API_TOKEN with your real API token:

curl --location 'https://WORKERS_URL/query' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer API_TOKEN' \
--data '{
  "query": "SELECT * FROM r2lake.default.people"
}'

This should return the response below:

[
  {
    "id": "1",
    "name": "Alice",
    "score": 80
  },
  {
    "id": "2",
    "name": "Bob",
    "score": 92.5
  },
  {
    "id": "3",
    "name": "Charlie",
    "score": 88
  }
]

Summary

We’ve demonstrated that it’s possible to build (and deploy) a Lakehouse infrastructure based on DuckLake and DuckDB, that can stay below $10 / month, while running completely serverless. This can be a suitable way to run a flexible Lakehouse for many smaller project or teams.

There’s no need anymore to build up complex and costly infrastructure projects on big public cloud providers.

Let me know if you have any feedback in the comments!

Welcome to the age of $10/month Lakehouses

Table of contents

Recap: Data Warehouses, Data Lakes, Lakehouses?

Existing open table formats

Comparison

ACID Transactions

Schema Evolution

Partitioning

Time Travel

Performance and Metadata Scalability

Streaming Support

File Format Support

Ecosystem and Engine Support

Operations and Tooling

Summary

The new kid on the block: DuckLake

Is Big Data dead?

The $10/month Lakehouse

Requirements

A suggested solution

Storage

Compute

Metadata storage

Cumulative costs

Deploy a Serverless DuckLake

GitHub repo

Preconditions

Installing dependencies

Build image locally

Run the image locally

DuckDB API

Securing the API

Creating a R2 bucket

Setting up a catalog database

Getting a R2 Access Token

Create secrets for DuckLake

Deployment

Running queries

Deployment with the additional R2 Data Catalog / Iceberg integration

Enable R2 Data Catalog for existing R2 bucket

Getting the R2 Data Catalog information

Creating secrets for R2 Data Catalog

Writing Iceberg example data

Setup Marimo

Create the Iceberg example data

Running DuckDB with Iceberg support

Deployment

Running queries

Summary

Subscribe to my newsletter

Tobias Müller

Tobias Müller