My Attempt - And Failure - In Creating A Poor Man's Data Warehouse Using DuckDB


I know what some of you are probably thinking: "Oh great, another article about DuckDB." And honestly, I get it. A few months ago, I was in the same boat. It felt like DuckDB was following me around—blog posts, tweets, random mentions. It was everywhere. Kind of like a friendly ghost that wouldn't leave me alone.
For those who haven’t crossed paths with it yet, DuckDB is an “analytical in-process SQL database management system. Think of it like SQLite, but built for analytics instead of transactional workloads.". In simpler terms, it's an analytics engine you can embed directly into your program. It lets you run analytical queries on large datasets efficiently, without needing to spin up a massive database server. It also supports multiple client APIs—so whether you're writing in Python, Java, Go, NodeJS, or something else, you’re probably covered.
Now, since DuckDB is an in-process engine, you can’t really treat it like a traditional database. Sure, it supports ACID and can persist to disk, and yeah, multiple processes can technically access the same files. But it doesn’t have a network interface, nor does it handle users, authentication, or any of the usual database additional functionalities. That’s just not what it was built for. It’s designed to live and operate only inside the process that launched it.
The Dream
So what was I trying to do with DuckDB? I wanted to use it as my main database engine. That’s it. I knew DuckDB could read and write Parquet files, and that was all I needed. My idea was to write data as Parquet files to S3 and have DuckDB read them from there. And since DuckDB is lightweight, I could even run it inside AWS Lambda to keep costs low—no need to spin up a full EC2 instance. This setup is sometimes called a "Poor Man's Warehouse." Ironically, it's more of a data lake, but whatever, the name stuck.
And honestly, it’s a cool idea. Following this pattern, my so-called data warehouse would be extremely cost-efficient. You’d only pay per query. DuckDB is open-source so I don't have to worry about any license fees, and AWS Lambda gives you up to 10 GB of memory and 10 GB of ephemeral storage which is with DuckDB's efficiency plenty for many queries.
The BI Tools Problem
But of course, there was a catch: BI tools. I needed a way for BI tools to talk to DuckDB. While there is a JDBC driver for DuckDB, it doesn’t work over the network. That’s because, again, DuckDB was never designed to be accessed like a traditional server.
So I started digging. I wanted to understand how BI tools usually connect to databases. And that’s when I stumbled upon a gem.
Enter mysql_mimic
mysql_mimic is a Python framework that pretends to be a MySQL server. It speaks the MySQL wire protocol, including the handshake, so to any BI tool, it just looks like a regular MySQL database.
I gave it a quick spin on my local machine, and it actually worked. I pointed DBeaver at mysql_mimic, and boom—it connected just like it would to a MySQL server. I was able to query both CSV and Parquet files without any issues.
import logging
import asyncio
import duckdb
import os
import glob
from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler
from sqlglot import transpile
from mysql_mimic import MysqlServer, Session
# Define supported file formats and their read methods
SUPPORTED_FORMATS = {
'.parquet': lambda file: f"READ_PARQUET('{file}')",
'.csv': lambda file: f"READ_CSV('{file}')",
'.json': lambda file: f"READ_JSON('{file}')",
}
# Initialize DuckDB
db = duckdb.connect(":memory:")
db.sql("SET memory_limit='80MB'")
def create_view_for_file(file_path):
file_ext = os.path.splitext(file_path)[1].lower()
if file_ext in SUPPORTED_FORMATS:
file_name = os.path.splitext(os.path.basename(file_path))[0]
read_func = SUPPORTED_FORMATS[file_ext]
try:
db.sql(f"CREATE VIEW {file_name} AS SELECT * FROM {read_func(file_path)}")
logging.info(f"Created view for {file_path}")
except Exception as e:
logging.error(f"Failed to create view for {file_path}: {e}")
class DataFolderHandler(FileSystemEventHandler):
def on_created(self, event):
if not event.is_directory:
asyncio.run(self.handle_new_file(event.src_path))
async def handle_new_file(self, file_path):
# Small delay to ensure file is fully written
await asyncio.sleep(1)
create_view_for_file(file_path)
# Initialize views for existing files
data_path = os.path.join(os.path.dirname(__file__), "data")
for ext in SUPPORTED_FORMATS.keys():
for file_path in glob.glob(os.path.join(data_path, f"*{ext}")):
create_view_for_file(file_path)
class MySession(Session):
async def _execute_query(self, result):
# Yield rows one at a time
while True:
row = result.fetchone()
if row is None:
break
yield row
async def query(self, expression, sql, attrs):
try:
# Transpile MySQL to DuckDB dialect
duckdb_sql = transpile(sql, read='mysql', write='duckdb')[0]
# Execute query in DuckDB
result = db.sql(duckdb_sql)
# Get column names
columns = [col[0] for col in result.description]
return self._execute_query(result), columns
except Exception as e:
print(f"Error executing query: {e}")
return [], []
async def schema(self):
return duckdb.sql("SELECT * FROM information_schema.tables").fetchall()
async def main():
logging.basicConfig(level=logging.DEBUG)
# Set up file system observer
observer = Observer()
observer.schedule(DataFolderHandler(), data_path, recursive=False)
observer.start()
try:
server = MysqlServer(session_factory=MySession, **{"port": "3309"})
await server.serve_forever()
finally:
observer.stop()
observer.join()
if __name__ == "__main__":
asyncio.run(main())
The Rollout Plan
So I had a plan. Launch DuckDB inside mysql_mimic and host it on AWS Lambda. But that idea quickly fell apart. AWS Lambda doesn’t support the MySQL wire protocol—it mostly talks HTTP and other web-friendly protocols. So that was a dead end.
Onto plan B: I’d spin up a tiny EC2 instance to act as a “shell server.” This instance would handle the MySQL protocol for the BI tools, then delegate the heavy lifting to Lambda functions. Those Lambdas would run the actual queries in DuckDB, return the results, and the EC2 instance would just forward them along. This would increase the cost a little bit, as well as the latency, but that was a price I was willing to pay.
This setup sounded clever, but again, I ran into a wall—AWS Lambda has a 6 MB limit on response payloads. That just wasn’t going to cut it for most BI queries. Of course, I could have AWS Lambda write the data into S3 and have the shell server read it from there, but that would increase the latency too much. So I finally called it. I decided not to pursue the idea any further.
But I wanted to share my journey here, just in case someone else wants to pick up where I left off. Maybe there’s a workaround I didn’t think of. Or maybe this’ll spark a better idea in someone else’s head.
Either way, good luck—and may your queries be fast and your costs low.
Subscribe to my newsletter
Read articles from Ahmed Shaaban directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Ahmed Shaaban
Ahmed Shaaban
Hi there, My name is Ahmed. I am a data engineer currently based in Egypt. I enjoy learning new things and writing about them. Feel free to reach out to me even if you just want to say hello.