Welcome to the age of $10/month Lakehouses

Table of contents
- Recap: Data Warehouses, Data Lakes, Lakehouses?
- Existing open table formats
- The new kid on the block: DuckLake
- Is Big Data dead?
- The $10/month Lakehouse
- Deploy a Serverless DuckLake
- GitHub repo
- Preconditions
- Installing dependencies
- Build image locally
- Run the image locally
- DuckDB API
- Securing the API
- Creating a R2 bucket
- Setting up a catalog database
- Getting a R2 Access Token
- Create secrets for DuckLake
- Deployment
- Running queries
- Deployment with the additional R2 Data Catalog / Iceberg integration
- Summary

Recap: Data Warehouses, Data Lakes, Lakehouses?
As a short recap, what do these mean, and how are they differentiated?
Modern Data Warehouses, like Amazon Redshift, Google BigQuery, and Snowflake, offer fast, SQL-optimized performance for structured data and BI workloads. Their columnar storage, advanced indexing, and automatic optimization make them ideal for analytics. However, they come at a premium: storage and compute are tightly coupled, and costs can rise sharply with large-scale, always-on workloads or frequent queries over massive datasets.
Data Lakes, typically built on Amazon S3, Azure Data Lake Storage (ADLS), Google Cloud Storage (GCS), or Cloudflare R2, provide a low-cost solution for storing raw structured and unstructured data. They scale effortlessly and are the foundation of many batch and streaming pipelines. While storage is cheap, performance suffers without significant investment in tooling, such as Apache Spark, Presto, or Hive, to support transformation and query execution. Data governance and consistency also become challenges as systems scale.
The Lakehouse architecture, popularized by platforms like Databricks or Snowflake with table formats like Delta Lake and Apache Iceberg, offers a compelling middle ground. Lakehouses combine the low-cost, schema-flexible storage of data lakes with the performance and ACID transaction support of warehouses. Technologies like Databricks Lakehouse Platform, Snowflake’s Unistore, or Dremio provide unified data management, allowing teams to run SQL analytics, data science, and streaming workloads directly on cloud object storage, without needing separate ETL into a warehouse.
From a cost-performance perspective, Lakehouses reduce redundancy and simplify infrastructure by eliminating the need to maintain both a lake and a warehouse. They support open formats (e.g., Parquet, ORC) and bring advanced features like time travel (see Apache Iceberg), schema evolution, and fine-grained access control which previously was reserved for traditional Data Warehouses into the Data Lake domain.
Ultimately, the choice depends on your workload characteristics, query latency requirements, data volume, and ecosystem alignment. But as the major cloud providers and vendors converge around the Lakehouse paradigm, it's becoming an increasingly attractive option for teams seeking agility, scalability, and cost control in modern data platforms.
Existing open table formats
There are different “table formats” for the Lakehouse architecture, such as Apache Iceberg and Delta Lake.
Apache Iceberg is an open table format designed for huge analytic datasets in distributed data processing systems like Apache Spark, Trino, Presto, Flink, and Hive. It was developed at Netflix and later donated to the Apache Software Foundation.
Delta Lake is an open-source storage layer that brings ACID transactions, scalable metadata handling, and unified streaming and batch processing to Apache Spark and other big data engines. Originally developed by Databricks, it is now part of the Linux Foundation.
Comparison
ACID Transactions
Iceberg:
Fully supports ACID transactions using snapshot isolation. Transactions are handled via metadata manifests and atomic commits to the metadata file. Suitable for both streaming and batch operations.Delta Lake:
Implements ACID transactions via a write-ahead log (_delta_log
). Also supports snapshot isolation, enabling reliable concurrent read/write operations in batch and streaming.
Schema Evolution
Iceberg:
Offers flexible schema evolution and supports adding, renaming, reordering, and deleting columns. Field IDs are used to track schema changes safely.Delta Lake:
Supports additive schema evolution (e.g., adding columns). Reordering and renaming are partially supported and can be more restrictive than Iceberg. Dropping columns my require rewriting of the underlying Parquet files.
Partitioning
Iceberg:
Uses hidden partitioning, abstracting physical layout from logical queries. Automatically rewrites queries for efficient partition pruning.Delta Lake:
Relies on explicit partition columns, which are visible to users. Partition pruning depends on query awareness of partition structure.
Time Travel
Iceberg:
Native time travel support via snapshots. You can query historical versions based on snapshot ID or timestamp.Delta Lake:
Also supports time travel using version numbers or timestamps. Retains transaction logs that allow rollback and historical queries.
Performance and Metadata Scalability
Iceberg:
Uses a metadata tree structure (manifests and manifest lists) to avoid file listing and support petabyte-scale datasets. More scalable with large numbers of files and partitions.Delta Lake:
Stores metadata in a log of JSON files, which can become large over time and may require compaction (OPTIMIZE
andVACUUM
) for performance.
Streaming Support
Iceberg:
Has native support for streaming writes and reads, especially with Apache Flink and Spark Structured Streaming. Streaming write support is still evolving in some engines.Delta Lake:
Designed with stream-batch unification in mind. Seamless support for Structured Streaming in Spark.
File Format Support
Iceberg:
Supports Parquet, Avro, and ORC. File format agnostic, with clean decoupling of metadata and data.Delta Lake:
Primarily based on Parquet. No support for other formats as of now.
Ecosystem and Engine Support
Iceberg:
Integrates with Apache Spark, Flink, Trino, Presto, Hive, Dremio, Snowflake, and more. Supported by AWS Glue, Snowflake, and others.Delta Lake:
Strong integration with Databricks, Apache Spark, and supported in Presto, Trino, Flink (via Delta-RS). Emerging ecosystem via Delta Kernel and Delta-RS.
Operations and Tooling
Iceberg:
Growing support for table maintenance (compaction, expiration, snapshot management). Still evolving CLI tools.Delta Lake:
Mature operational tooling via Databricks (e.g.,OPTIMIZE
,VACUUM
, Delta Live Tables). Better UX for managed environments.
Summary
Criteria | Apache Iceberg | Delta Lake |
ACID Transactions | ✅ Yes (Snapshot-based) | ✅ Yes (Log-based) |
Schema Evolution | ✅ Flexible (Field IDs, reorder, rename) | ⚠️ Limited (mostly additive) |
Partitioning | ✅ Hidden, automatic pruning | ⚠️ Explicit, user-managed |
Time Travel | ✅ Snapshots, timestamps | ✅ Versioned log, timestamps |
Metadata Performance | ✅ Scalable (manifest-based) | ⚠️ Requires periodic compaction |
Streaming Support | ⚠️ Evolving (good Flink/Spark support) | ✅ Mature (best with Spark) |
File Format Support | ✅ Parquet, Avro, ORC | ⚠️ Parquet only |
Engine/Ecosystem Support | ✅ Broad, vendor-neutral | ⚠️ Strong Spark/Databricks focus |
Operational Tooling | ⚠️ Growing | ✅ Advanced in Databricks |
The new kid on the block: DuckLake
DuckLake was introduced on 2025-05-27 by the by the founders of DuckDB. There’s an interesting podcast where Mark & Hannes explain more about the motivation and goals behind it, viewing it is strongly recommended! Also, they wrote a manifesto: SQL as a Lakehouse format.
Its key features are:
Supports snapshots, time travel queries, schema evolution and partitioning
Can have as many lightweight snapshots as you want without frequent compacting steps
Allows concurrent access with ACID transactional guarantees over multi-table operations
Uses statistics for filter pushdown, enabling fast queries even on large datasets
There’s also a nice introduction and overview in Jordan Tigani’s blog post in the MotherDuck blog: “A Duck walks into a lake”
The big difference compared to Apache Iceberg and Delta Lake is that DuckLake uses a database as a backing store for the metadata, instead of keeping it in complex files as JSON and/or Avro on object storage, like both others do. Only to eventually have a data catalog put on top of that.
DuckLake stores data in Parquet, and metadata in relational tables, plain and simple. This has many benefits, e.g. faster lookups (SQL query vs. an eventual cascade of S3 HEAD/GET requests), and the catalog is already built-in.
The catalog database should be chosen upon the following criteria (according to the DuckLake website):
If you would like to perform local data warehousing with a single client, use DuckDB as the catalog database
If you would like to perform local data warehousing using multiple local clients, use SQLite as the catalog database
If you would like to operate a multi-user Lakehouse with potentially remote clients, choose a transactional client-server database system as the catalog database: MySQL or PostgreSQL
Even though it’s a very early version, which still misses some functionality, it’s already looking very useful from a (metadata) complexity and performance perspective.
The necessary relational database for multi-user scenarios adds some complexity, but there are some SaaS solutions on the market that reduce or actually get rid of the operative burden of managing relational databases. The leading principle still is the separation of storage and compute. More on that in the next chapter.
Is Big Data dead?
Coming back to the article’s title, where’s the promised $10/month Lakehouse at? And why $10/month?
In his article “Big Data is dead”, Jordan Tigani shared some interesting insights from his days at Google’s BigQuery. One was that the 90th percentile of all queries only queried 100MB data, and the 99th no more than 10GB.
Another one was that most people don’t actually have that much data, but some have real Big Data. “Among customers who were using the service heavily, the median data storage size was much less than 100 GB”:
Amazon released a dataset about the Redshift usage and query sizes, which Jordan analyzed in another blog post that supports his original article.
So, what’ the relation to this blog post?
Basically, it validates that most of the people/projects/companies do neither have “real” Big Data, nor run queries that require actual “real” Big Data tools that are distributed, horizontally scaling etc. etc.
Many of them though have established systems that are much too complex and/or much too expensive for their actual need.
The $10/month Lakehouse
Requirements
If we apply the “separation of storage and compute” paradigm, we need to find a supplier for the (object) storage part, and one for the compute. For the storage, we want to use a S3-compatible API for flexibility reasons.
As we want to leverage the new DuckLake format, and we want to eventually have multiple reader and writer processes. This means we do not “only” have to search for a compute provider, but also find a provider for a relational database for the DuckLake metadata, which enables the desired multi-user capabilities.
Furthermore, we’d like to spend as little time on managing actual infrastructure as possible. This leads us to using serverless services for both compute/querying, and metadata storage.
Going forward, we assume the following:
We have 250GB of data we want to analyze (as monthly average)
We would like to query the remote data via HTTPs endpoints, as well as from out local machines
We want the whole Lakehouse to run in “auto-pilot mode”, meaning we don’t want to manage the underlying infrastructure
We want SQL to be our primary/sole interface to the data, so we want to use the new DuckLake, based on DuckDB
A suggested solution
Storage
For storage, there are many providers on the market, be it the big public cloud providers like AWS, Azure, Cloudflare or Google Cloud, or others, like Hetzner, or, more specialized, Backblaze.
If we want to optimize on costs, Cloudflare R2 comes into focus, mainly because the free egress fees, a generous free tier, and generally cheap standard storage costs of $0.015 / GB-month.
R2 Free Tier
Class A operations are basically write or list operations, Class B operations are mainly read operations, that will occur much more often as write operations in a Lakehouse setting.
R2 Pricing after Free Tier
The calculated cost per month, based on our requirements for the storage part would be the following:
250 GB-month - 10 GB-month (free tier) * $0.015 = $3.60
Write requests: We assume that the 1 million requests / month are sufficient, $0
Read requests: We assume that the 10 million requests / month are sufficient, $0
Egress is free, $0
Overall costs for storage for a month: $3.60
Compute
When wanting to run DuckDB in a serverless fashion, some options come into one’s mind. E.g. we could use AWS services such as Fargate, running DuckDB in a container, or AWS Lambda, running it in a serverless function. Google Cloud has CloudRun, Azure has Container Instances.
Cloudflare will launch it’s Containers service in June 2025, which has a pricing model of $0.072 per vCPU-hour, and $0.009 per GB-hour. This is not necessarily cheaper than the other providers, but it comes with 1TB / month free egress from the containers. Because we’ll already be using R2, we’ll use it anyways, despite of being able to safe maybe $1 with other providers, based on our assumed usage (see below).
The monthly costs, based on some assumptions (containers use 4GB and 2 vCPUs when run, meaning $0.18 / hour / container), would be:
25.000 queries, with an average of 3s runtime (=): $0.18 (25.000 × 3 / 3600) = $3.75
Container will spin down after 1 minute of inactivity, so we add another $2 to accommodate this behavior
Overall costs for compute for a month: $5.75
Metadata storage
Because we’d like to use DuckLake in multi-user/writer mode, we’ll need either a MySQL or Postgres compatible database to store the metadata. There are many options on the market, like Supabase or Neon, which both provide serverless Postgres services.
We’ll choose Neon, because it has a more generous free tier, that offers up to 0.5 GB of database storage, and auto-scaling of up to 2 vCPUs and 8 GB of memory, having 190 compute-hours included. This should also fit to our assumed compute usage as outlined above:
Overall costs for metadata storage for a month: $0
Cumulative costs
From the costs we derived for each part of the stack, we can calculate the cumulative costs of the stack per month: storage $3.60, compute $5.75, and metadata storage $0. This means overall monthly costs of $9.35!
Deploy a Serverless DuckLake
GitHub repo
The repository with the code can be found at tobilg/cloudflare-ducklake:
Once you cloned the code to your local machine, you can start with the deployment:
git clone git@github.com:tobilg/cloudflare-ducklake.git && cd cloudflare-ducklake
Preconditions
To deploy the project, you need to have the following preconditions to be present on the machine you want to use:
Installing dependencies
To install the dependencies, please run
npm i
Build image locally
Before being able to build the Docker image locally, you have to download the DuckDB extensions we'd like to package into the image, so that they don't need to be downloaded on each container start:
scripts/download_extensions.sh
Once this is done, you can run the following to build the image locally:
npm run build:docker
Run the image locally
To run the newly built image locally, run
npm run dev:docker
To query the DuckDB API within the locally running container, use
curl --location 'http://localhost:8080/query' \
--header 'Content-Type: application/json' \
--data '{
"query": "SELECT * FROM '\''https://shell.duckdb.org/data/tpch/0_01/parquet/orders.parquet'\'' LIMIT 1000"
}'
wrangler dev
during local development. I guess this will eventually change once Containers become GA.DuckDB API
DuckDB is exposed as a Hono.js-based API, that offers a few endpoints:
GET /
: Will show a JSON welcome messageGET /_health
: Enables potential container health checking (currently not used)POST /query
: Takes aapplication/json
object body with aquery
property that contains the (encoded) SQL query. Returns the query result inapplication/json
as well (see example above)
Securing the API
You can generate a unique API Token, e.g. with a tool like Strong Password Generator, and deploy a Workers secret named API_TOKEN
that will automatically be used to secure the /query
endpoint once it's present with the below script:
scripts/setup_api_secrets.sh
If you want to test this locally, too, you need to create a .dev.vars
file in the root directory of this project. This will be appended for R2 Data Catalog & DuckLake usage later if you want to try these before deployment as well.
Creating a R2 bucket
If you want to either deploy the DuckLake or the Apache Iceberg integration via R2 Data Catalog, you have to set up a R2 bucket first. To create a new R2 bucket, you can run the following:
scripts/setup_r2.sh YOUR-BUCKET-NAME
where YOUR-BUCKET-NAME
is your desired name for the bucket. It will automatically add the respective environment variables to .dev.vars
. You'll also need the bucket name later for settng up the DuckLake secrets.
If you'd want to set a location hint or a jurisdiction, please edit the script accordingly before running it.
Setting up a catalog database
As described above, we chose Neon‘s free tier to act as the metadata storage provider. Once you initially signed-up, you can choose the project name, cloud provider and the region:
After you clicked on "Create project", you get directly taken to your Dashboard. The next step is to get the connection details. Therefore, click on the "Connect" button in the upper-right corner:
Note the following connection parameters, and their equivalent environment variable/secret names (and add them to the .dev.vars
file):
The user (
POSTGRES_USER
)The password (
POSTGRES_PASSWORD
)The hostname (
POSTGRES_HOST
)The database name (
POSTGRES_DB
)
You can also create another user, and use this if you don't want to use the main instance user.
Getting a R2 Access Token
Please follow the instructions in the R2 docs on how to create an API token.
You'll need to store this token in a secure location (and eventually in .dev.vars
if you want to use the R2 Data Catalog when running the Docker image locally), as you'll need later for the R2 Data Catalog deployment.
Also, please note the the S3 APIs Access Key (R2_ACCESS_KEY_ID
) and Secret Key (R2_SECRET_ACCESS_KEY
), as well as your Cloudflare account ID (R2_ACCOUNT_ID
), you'll need them in the next step. Put them in .dev.vars
as well if you want to run it locally.
Create secrets for DuckLake
Running the following script will create eight new Workers secrets needed for deployment:
scripts/setup_ducklake_secrets.sh
Deployment
If you followed the above steps, you can now run the actual deployment. This will create a Worker, a DurableObject, build the Docker image locally and the upload it to the Cloudflare container registry.
npm run deploy
Running queries
Replace the WORKERS_URL
with the real URL, and the API_TOKEN
with your real API token:
curl --location 'https://WORKERS_URL/query' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer API_TOKEN' \
--data '{
"query": "CREATE TABLE ducklake.orders AS SELECT * FROM '\''https://shell.duckdb.org/data/tpch/0_01/parquet/orders.parquet'\''"
}'
The response should look like this:
[
{"Count":"15000"}
]
Get the first row of the new table:
curl --location 'https://WORKERS_URL/query' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer API_TOKEN' \
--data '{
"query": "SELECT * FROM ducklake.orders LIMIT 1"
}'
The response should look like this:
[
{
"o_orderkey": 1,
"o_custkey": 370,
"o_orderstatus": "O",
"o_totalprice": 172799.49,
"o_orderdate": {
"days": 9497
},
"o_orderpriority": "5-LOW",
"o_clerk": "Clerk#000000951",
"o_shippriority": 0,
"o_comment": "nstructions sleep furiously among "
}
]
Deployment with the additional R2 Data Catalog / Iceberg integration
With the v1.3.0
release of DuckDB, it became possible to connect to the R2 Data Catalog. This means that you can now also read Iceberg data from R2's Object Storage directly from a SQL statement issued by DuckDB.
Enable R2 Data Catalog for existing R2 bucket
You can enable the Apache Iceberg support via the R2 Data Catalogby running the following command:
scripts/setup_r2_data_catalog.sh YOUR-BUCKET-NAME
where YOUR-BUCKET-NAME
is the name for the bucket you used in the R2 bucket creation step before.
Getting the R2 Data Catalog information
The information about the R2 Data Catalog URI (R2_ENDPOINT
secret) and warehouse name (R2_CATALOG
secret) can be gathered by running:
scripts/get_catalog_info.sh YOUR-BUCKET-NAME
where YOUR-BUCKET-NAME
is the name for the bucket you used in the R2 bucket creation step before.
Please also store this information, because you'll need in in the next step. Also, please add the three variables to .dev.vars
if you want to use the R2 Data Catalog when running the Docker image locally.
Creating secrets for R2 Data Catalog
To create the necessary Workers secrets, run:
scripts/create_r2_data_catalog_secrets.sh
and copy & paste the respective values you noted in the last two steps.
Writing Iceberg example data
As we need some example data if we want to test the new Iceberg capabilities, we need to create this data manually with a Python script.
For convenience, we'll use Marimo for this. It requires a working Python installation on your machine.
Setup Marimo
To setup Marimo, run the following npm task:
npm run iceberg:setup
This will create a new directory and install Marimo and some dependencies in a virtual environment.
Create the Iceberg example data
To create the Iceberg example data, run the following npm task:
npm run iceberg:create
This will start Marimo, and load the respective Python script. You'll need to edit the variables for WAREHOUSE
, CATALOG_URI
and TOKEN
with the values gathered in the last steps. This is also described in the R2 Data Catalog docs.
After you did that, you can run the Python cells in the Marimo notebook, and should end up with some data created in the R2 Bucket. You can check in the Cloudflare Dashboard, or via wrangler
.
Running DuckDB with Iceberg support
Once you created the secrets as outlined above (all of them!), the application will automatically create a DuckDB secret for accessing the R2 Data Catalog, and attach the catalog as well under the name of r2lake
when it starts. The example table created in the last step is called people
and was created in the default
schema.
Deployment
If you followed the above steps, you can now run the deployment. This will update the existing deployment when you already deployed the DuckLake configuration.
Hint: You need access to Cloudflare containers beta to be able to deploy it to Cloudflare!
npm run deploy
The wrangler
deployment output shows the workers URL where the service can be reached. Note it down if you want to run test queries (WORKERS_URL
in the step below).
Running queries
Replace the WORKERS_URL
with the real URL, and the API_TOKEN
with your real API token:
curl --location 'https://WORKERS_URL/query' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer API_TOKEN' \
--data '{
"query": "SELECT * FROM r2lake.default.people"
}'
This should return the response below:
[
{
"id": "1",
"name": "Alice",
"score": 80
},
{
"id": "2",
"name": "Bob",
"score": 92.5
},
{
"id": "3",
"name": "Charlie",
"score": 88
}
]
Summary
We’ve demonstrated that it’s possible to build (and deploy) a Lakehouse infrastructure based on DuckLake and DuckDB, that can stay below $10 / month, while running completely serverless. This can be a suitable way to run a flexible Lakehouse for many smaller project or teams.
There’s no need anymore to build up complex and costly infrastructure projects on big public cloud providers.
Let me know if you have any feedback in the comments!
Subscribe to my newsletter
Read articles from Tobias Müller directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
