Episode 2: The Warehouse Awakens: Order from the Chaos


📍Recap of where we paused:
Jigyās had just learned why relational databases break at analytical scale. He asked about columnar databases (like ClickHouse), and Jñānesh explained how they optimize analytical queries by storing data by column rather than by row.
Now we’ll continue the conversation as Jñānesh transitions from columnar storage to full-blown data warehouses, setting the stage for star schemas, batch processing, and the pain points that led to data lakes.
Let’s jump in. 🎬
(Part 1: When Columns Took Over)
🧑💻 “So… columnar databases store one column at a time, right?”
🧙♂️ “Exactly. Imagine you’re scanning millions of sales records just to find the total revenue by country. A columnar store reads just the amount column — not the entire row. It’s like reading only the words you care about in a paragraph.”
🧙♂️ “This makes analytical queries faster — much faster — especially on massive datasets.”
Jigyās squints.
🧑💻 “Yeah yeah… I’ve heard that before. ‘Faster at scale,’ they all say. Show me proof, Guru. Or am I just supposed to believe every data blog I read at 2 AM?”
Jñānesh chuckles — not offended, but amused.
🧙♂️ “Fair enough. Let’s imagine two tables. Both have 100 million rows. You want to know the average purchase amount per country.”
He sketches a simple schema in the air:
user_id | name | country | purchase_amount | quantity |
101 | Rahul | India | 100 | 2 |
102 | Narendra | India | 157 | 3 |
🧙♂️ “Now, in a traditional row-based store, every record is stored one after the other — all fields packed together. So even if you only need
purchase_amount
andcountry
, the engine still reads entire rows.”🧙♂️ “That’s like reading every word of every book in a library just to find how many times the word ‘India’ shows up.”
Jigyās raises an eyebrow.
🧑💻 “And in columnar?”
🧙♂️ “Each column is stored separately. So the query engine goes directly to the
purchase_amount
andcountry
columns — skipping the rest.”🧙♂️ “It’s like having a shelf where every book only has the chapter you care about.”
📊 Columnar Storage: Columns stored in separate sections (aka pages or blocks)
+------------------------+ +-----------------------+ +---------------------+
| Column: user_id | | Column: country | | Column: amount |
| 101, 102, 103, ... | | 'India', 'India', ... | | 400, 150, 700, ... |
+------------------------+ +-----------------------+ +---------------------+
🧙♂️ “Now in columnar formats like Parquet, ORC, and Arrow — each column is stored in its own block or page on disk.”
🧙♂️ “So when you run:
SELECT country, SUM(amount)
the query engine skips over everything else — likename
,purchase_time
— and goes straight to the relevant blocks.”🧙♂️ “Less disk I/O. Faster scans. Better cache utilization. And, crucially — better compression.”
🧙♂️ “If every value in a block is from the same column and mostly similar… compression shines.”
🧑💻 “So... instead of scanning across all fields in every row, it just… laser-focuses on the column I want?”
🧙♂️ “Exactly. That’s what makes columnar formats perfect for analytical workloads.”
Jigyās leans back, a little triumphant.
🧑💻 “Okay, so columnar databases like ClickHouse store data column-by-column, which helps scan only what you need and makes queries fast. That sounds like it solves the problem—so why isn’t everyone just using them?”
Jñānesh pauses — a longer pause this time.
🧙♂️ “Ah, you’ve found the limitation.”
🧙♂️ “Columnar formats solved speed*. But speed alone isn’t a system.”*
🧠 Jñānesh Explains the Real Problems
🧙♂️ “Let’s say your company dumps all raw data into a columnar database. Great. Now answer me this…”
How do teams know what each table means?
Who owns the logic for calculating revenue?
What happens if someone renames a column or changes a data type?
How do you enforce data freshness and quality across teams?
Can your marketing team and finance team query data with confidence — and get the same answer?”
Jigyās doesn’t answer — because, well, there is no answer.
🧙♂️ “Columnar databases are just a way of storing data efficiently.
They don’t provide structure. They don’t orchestrate transformation.
They don’t manage governance. They don’t model business logic.”
🏗️ What We Needed Was More Than Storage
🧙♂️ “As data volumes grew, and teams depended on data for decisions — we needed more than raw performance.”
🧙♂️ “We needed systems that could:”
✅ Clean and transform raw data into useful tables
✅ Define consistent metrics (like 'Monthly Active Users')
✅ Track lineage — where data came from and how it changed
✅ Control access and versioning
✅ Serve dashboards, reports, and business logic
✅ Run reliable, repeatable batch jobs (ETL)
🧙♂️ “That’s when we realized — we don’t just need fast tables.
We need a warehouse. A place where data can live, evolve, and be trusted.”
Jigyās leans in.
🧑💻 “So a data warehouse is not just a fast database — it’s an entire system with… structure, logic, and lifecycle?”
🧙♂️ “Precisely.”
🧙♂️ “It’s where we define how data flows — from ingestion, to transformation, to curated reporting.”
🧙♂️ “And at the center of it: ETL jobs, semantic models, and the governance to keep it consistent.”
Jigyās takes a sip of now-cold coffee, still trying to process the idea of warehouses being more than just fast tables.
🧑💻 “Okay… I think I get it.”
“Warehouses aren’t just storage — they’re structured, governed, and come with ETL pipelines and modeled metrics.”
He pauses. Then his eyes narrow.
🧑💻 “But I have another question.”
🧑💻 “If a company has, like… two data warehouses, is that what people call a data lake?”
Jñānesh blinks. Then laughs — not mockingly, but with the warmth of someone who’s heard it all.
🧙♂️ “That, my friend, is not a data lake. That’s a data swamp waiting to happen.”
💡 Jñānesh Gently Corrects
🧙♂️ “A data lake isn’t a bunch of warehouses.”
It’s actually quite the opposite.🧙♂️ “A data warehouse is structured, governed, curated — like a well-run factory.”
🧙♂️ “A data lake is raw, flexible, and open-ended — like a storage yard where anything can be dropped: logs, images, audio, Parquet, CSVs, events, nested JSON…”
🧑💻 “Wait — so is it… like a big folder? Like Dropbox for data?”
🧙♂️ “Not a bad metaphor. Except this Dropbox lives on object storage — S3, GCS, Azure Blob, or even MinIO. And it holds everything.”
🧙♂️ “It was born because warehouses couldn’t handle:”
❌ Unstructured data (like images, logs, audio) ❌ Semi-structured formats (JSON, Avro, nested structures) ❌ Real-time data streams (IoT, Kafka) ❌ Cost-effective, elastic storage at petabyte scale
🧙♂️ “So engineers started dumping raw data into cheap object storage — and called it a data lake.”
🧑💻 “So no structure, no schema, just… vibes?”
🧙♂️ “Correct. Vibes and buckets. And chaos — unless you’re careful.”
Jigyās leans back, eyes wide.
🧑💻 “Okay… that explains a lot.
So warehouses are about discipline, and lakes are like... creative anarchy?”🧙♂️ “That’s one way to put it.”
🧙♂️ “Which is also why companies that build data warehouses — the ones that help teams bring order to that chaos — make a LOT of money.”
🧙♂️ “Because at some point, someone will look at all the JSON files floating around in S3 and say...
‘How do I make this into a dashboard?’”*🧑💻 “Let me guess — insert $100k/yr warehouse subscription here?”
🧙♂️ “Now you’re getting it.”
Jñānesh sips an imaginary cup of wisdom-tea and continues.
🧙♂️ “The beauty of data lakes is that you can store almost anything — not just tables.”
🧙♂️ “Audio recordings from support calls. Sensor readings from IoT devices. Web logs. Clickstreams. PDF scans. Even TikTok video metadata if you’re into that sort of thing.”
🧙♂️ “It’s cheap, elastic, and doesn’t care about structure.”
🧙♂️ “You see, Jigyās, a data warehouse gave you control and trust… but a data lake gave you freedom.”
“And not just freedom of structure — freedom of storage, scale, and format.”🧙♂️ “In a warehouse, you’re often tied to a single vendor — your data sits inside their walled garden.”
“But in a data lake, your data lives in open formats — Parquet, ORC, Avro — on object storage you control.”🧙♂️ “And it scales like a dream — object storage is effectively infinite, and cost-effective too.”
🧙♂️ “The lake doesn’t care what kind of data it is.
If it’s bytes — it belongs.”
Jigyās, now with just the right amount of skepticism, smirks:
🧑💻 “So basically… if it breathes, throw it in the lake?”
🧙♂️ “If it can be serialized, yes.”
🧑💻 “Okay… so then what is a data lakehouse now?
Is it like… a data lake that is built inhouse?
Jñānesh chuckles — this time almost giggling like a backend engineer who finally fixed a bug by removing one line of code.
🧙♂️ “You joke… but you're closer than you think.”
🧙♂️ “A lakehouse is what happens when we try to bring the best parts of a warehouse — structure, governance, transactions — into the lake.”
🧙♂️ “It’s not just storage. It’s not just schema-on-read. It’s structure layered over freedom.”
“The lakehouse is what happens when you take the raw power of the lake… and give it discipline.”
Jigyās leans in again, brow furrowed — not in panic this time, but in focused curiosity.
🧑💻 “So... wait. You're saying we can store raw logs, audio files, user events, and still query them with SQL?”
🧑💻 “How to even update and delete records? Isn’t data in the object storage immutable”
🧑💻 “Aren’t lakes just file dumps? How do you upsert into an object storage bucket?”
🧑💻 “Also — what keeps things from falling apart?
There’s no strict schema, no table definitions. What if the format changes? Or someone writes partial files?”🧑💻 “Where’s the control? The lineage? The contract that says ‘this column will always be a timestamp’?”
🧑💻 “And if things go wrong… how do I roll back? Do I just delete files manually from S3?”
🧑💻 “Isn’t object storage supposed to be immutable anyway?”
🧑💻 “And… what about performance? If everything’s dumped in folders, how do you even find what you need, let alone optimize it?”use?”
Jñānesh doesn’t interrupt. He lets Jigyās go.
Finally, he smiles — not surprised, not smug — just… pleased.
🧙♂️ “Excellent questions. You're finally asking like an architect.”
🧙♂️ “But instead of giving you answers… I want you to pause.”
🧙♂️ “Think about this — what did you love about relational databases?”
“What did warehouses make possible at scale?”
“What did lakes unlock that neither could handle?”
Jigyās blinks. He doesn’t answer — not yet. But he’s thinking.
And just before Jñānesh fades into the mental mist again...
🧙♂️ “Good. Think about it. That’s where we’ll start next.”
🧙♂️ “Now if you’ll excuse me, Elon Musk is stuck debugging a merge conflict inside a time-traveling query on his Martian warehouse.”
🧙♂️ “He has a rocket launch next week, and apparently the dashboard shows negative fuel usage. Can’t let that slide.”
He vanishes — just a trace of wisdom left in the air.
Jigyās looks at the screen.
74 open tabs. 12 terminal windows. 4 open Notion pages.
But for once — his mind feels less cluttered than his browser.
🧑💻 “…a warehouse, a lake, and now a house on a lake. Who even names these things? I swear the next one will be a data submarine”
He shakes his head, cracks his knuckles, and opens a new tab.
🏁 End of Episode 2: The Warehouse Awakens: Order from the Chaos
Next Up: Episode 3 — The Lake Learns to Wear a Hoodie
Subscribe to my newsletter
Read articles from achyuth reddy directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
