Why CQRS Was Conceived: When Read-Optimized Databases Are Asked to Write

Table of contents
- What Exactly Is a Read-Optimized System?
- What Goes Wrong When You Try to Write
- Real Scenarios That Broke Read-Optimized Systems
- Known Failures in the Wild — Case Studies That Should've Been Warnings
- I Know What You’re Thinking: “Can’t We Just Hack Around This?”
- Final Word: The Misunderstanding That Made You Bleed

Most people explain CQRS by talking about what it is — a split between commands and queries, two models, two paths, and so on. That’s fine. But that’s not how CQRS was born.
It was born from pain.
This series isn’t about patterns. It’s about pressure. The kind of architectural pressure where perfectly well-behaved systems start misbehaving. This first post walks you through one of the most common — and invisible — sources of that pressure:
What happens when you ask a read-optimized system to write?
It sounds innocent. But in production, this one mistake has silently cost engineering teams months of debugging, millions in infra, and careers that spent more time writing backfill scripts than real features.
Let’s get into it.
What Exactly Is a Read-Optimized System?
Read-optimized systems — often referred to as OLAP (Online Analytical Processing) engines — are databases that are built for one thing:
Answering complex questions over large volumes of data, really fast.
They’re not designed to be your system-of-record. They’re not built to handle thousands of tiny updates per second. They’re designed to scan, aggregate, and summarize.
Here’s what they usually look like under the hood:
1. Columnar Storage
Instead of storing rows together (like OLTP systems do), they store columns together. So if you want to know the average age of 100 million users, the engine can read just the "age" column and skip the rest.
📌 Example: BigQuery, ClickHouse, and Snowflake all store data in this format. It makes SELECT avg(age)
blazing fast — but try to update a single row, and things get messy.
2. Batched Ingestion Over Real-Time Writes
These systems want data in chunks. Bulk insertions are fast, predictable, and easy to compact. Row-by-row writes, on the other hand, are slow and often trigger background processes like merges or compactions.
📌 Example: Snowflake recommends using COPY INTO
from staged files instead of individual row inserts. ClickHouse warns you to batch inserts above 1000 rows for performance reasons.
3. Vectorized Execution
This means the query engine processes data in blocks — tens of thousands of rows at a time — using SIMD instructions. That’s what allows sub-second queries on billion-row tables.
📌 Example: ClickHouse’s execution model scans 65,536 rows at a time. Great for analytics, but not ideal when you want to upsert one tiny event.
4. Materialized Views and Pre-Aggregation
Since some queries are expensive, many systems precompute answers and cache them as materialized views. These views are updated on a schedule or after batch jobs.
📌 Example: In Redshift, it’s common to pre-aggregate revenue by day, country, and SKU into a materialized view that dashboards can read instantly.
5. Light or No Indexing
To maintain write speed and enable distributed storage, most OLAP engines minimize or avoid secondary indexes. They rely on partitioning and large scans instead.
📌 Example: BigQuery and Athena don’t support indexes. If your query needs filtering, it scans entire partitions.
So What’s the Catch?
The catch is simple: these systems are read monsters and write cowards.
They’re optimized for:
Append-only workloads
Immutable facts
Scheduled processing
Ask them to act like a real-time app database and they will either:
Slow down
Delay data visibility
Corrupt queries
Or worse — silently drop or misalign data without telling you
You can’t treat an observatory like a marketplace.
Just because it stores your data doesn’t mean it’s supposed to be the one changing it.
What Goes Wrong When You Try to Write
So what happens when you try to make it write? Not just once, but repeatedly — like a stream of user events, or real-time preferences, or row-by-row updates?
Here’s the honest answer: it fights back.
Not loudly. Not with errors. But with slowdowns, delays, and silent edge cases that are almost impossible to debug unless you already know what’s happening under the hood.
Let’s break it down, symptom by symptom.
1. Unpredictable Insert Latency
You start small. Maybe you’re streaming 100 rows a second — nothing fancy. Everything looks fine. Then the insert latency spikes. You’re not sure why. You add retries. The queue backs up. And suddenly your entire ingest pipeline looks like it’s under a DDoS.
Why it happens:
Read-optimized systems are often backed by append-only storage formats. To keep storage efficient, they run background processes like:
Merge trees (in ClickHouse)
Segment compactions (in Snowflake, Redshift)
File rewrites (in Delta Lake, Iceberg)
These tasks compete with your inserts for disk, memory, and IO. So even if your write volume hasn’t changed, the system starts to self-throttle.
📌 Example: ClickHouse MergeTree tables trigger background merges that cause write latency to spike every few minutes — and there's no easy way to predict when.
2. Delayed Visibility of Fresh Data
You write data. Run a query. Nothing shows up. You wait 5 seconds. Now it’s there. That inconsistency might be fine for dashboards. But what if you’re using the DB for fraud alerts or decision systems?
Why it happens:
To keep writes fast, most OLAP systems decouple ingestion from availability. They buffer incoming data in memory or write-ahead logs before flushing it into queryable storage.
📌 Example: BigQuery’s streaming inserts can take 2–10 seconds to appear in results. That’s intentional — it’s a performance tradeoff, not a bug.
3. Update and Delete Penalties
Say your system needs to correct records, handle failed events, or delete GDPR requests. In a row-store database, that’s a simple UPDATE
or DELETE
. But in a columnar system, that’s usually a mark-and-merge operation.
Why it happens:
These systems don’t support in-place updates. Updates are stored as tombstones (deletion markers), and a background merge job physically removes or rewrites the affected rows later.
📌 Example: Snowflake handles updates by marking old records as deleted and writing new ones. This introduces bloat, breaks freshness guarantees, and requires vacuuming.
4. Weak Transaction Guarantees
You try to write to two tables in a single logical operation. One succeeds. The other fails. There’s no rollback — and now your data is inconsistent.
Why it happens:
Read-optimized engines often skip transactional semantics to prioritize performance. There’s no support for:
Atomic multi-row/multi-table writes
Serializable isolation
Conflict detection in concurrent writes
📌 Example: In Amazon Athena or Presto, there's no support for transactional DML. If your pipeline writes partial state, you’ll need to clean it up manually.
Let’s Recap with a Map
Symptom | Root Cause | Example |
Insert latency spikes | Background merges compete with writes | ClickHouse MergeTree |
Delayed visibility | Ingestion buffers not queryable immediately | BigQuery streaming inserts |
Expensive deletes/updates | No in-place mutation support | Snowflake compaction model |
Missing ACID guarantees | OLAP engines don’t prioritize transactions | Athena/Presto no rollback support |
The Invisible Failure Pattern
The worst part? These systems don’t crash. They degrade.
There are no fatally loud errors, just:
Dashboards timing out
Backpressure in pipelines
Stale data that’s hard to notice
Engineers blaming Kafka, network, or the query — not the design mismatch
The system doesn’t scream. It quietly tells you:
“I wasn’t built for this. Please stop.”
Real Scenarios That Broke Read-Optimized Systems
So far, we’ve looked at what read-optimized systems are good at, and what happens when they’re forced to do something they weren’t built for — write.
But here’s the thing: no one wakes up and says, “Let’s use an OLAP system for real-time writes.”
This mistake creeps in slowly.
It starts with a dashboard, then a small feature, then a marketing ask, and before you know it — you’ve duct-taped your way into a state where writes are fighting reads, consistency is falling apart, and your team is writing custom logic just to make the system behave.
Let’s walk through four real-world paths that lead to this trap.
Scenario 1: The Reporting System That Became a Real-Time Dashboard
You start with the classic setup — batch data pipelines that populate a reporting dashboard once a day. Everything’s clean and performant.
Then the product team says:
“Can we show real-time sales per region on the same dashboard?”
Sure. You add streaming ingestion to the same OLAP system. It works at first — until it doesn’t.
Insert latency starts fluctuating
Query freshness becomes unpredictable
Dashboard charts show stale or partial data mid-meeting
📌 Example: A team used BigQuery for sales reports, then began streaming transactions via Pub/Sub for a live widget. Streaming slots were exhausted within a week, and queries showed inconsistent totals due to ingestion lag.
Scenario 2: The Analytics DB Turned into a Feedback Store
You already log user behavior to an OLAP system — say ClickHouse or Snowflake. Then someone proposes:
“Let’s also store likes, shares, and comments in the same table. It’ll help us analyze user engagement.”
Suddenly, what used to be immutable event data now requires:
Real-time writes
Updates (when someone unlikes something)
Multi-field inserts with user metadata
The read-optimized engine wasn’t built for this. Now you’re seeing:
Merge conflicts
Increased memory usage from constant small writes
Materialized views breaking because schema expectations changed
📌 Example: A team used ClickHouse to analyze article views. They added a like/dislike toggle stored in the same table. Frequent updates forced compactions to run 3× more often, severely degrading read performance.
Scenario 3: The Metrics Store That Was Asked to Store Everything
At some point, centralization sounds like a good idea:
“Why not store logs, metrics, events, and exceptions all in one warehouse? Saves time and makes querying easy.”
What this actually means:
Each data type has a different schema
Each pipeline has a different ingestion pattern
Cardinality explodes
Storage and query costs rise exponentially
Queries that were once quick now scan bloated partitions with low selectivity.
📌 Example: A team dumped all monitoring data — logs, traces, metrics — into Snowflake. Logs had free-form text, metrics had tags, and exceptions had stack traces. Partition pruning failed often, and a simple count(*)
over a day's data took 9 minutes.
Scenario 4: The Quiet Drift into CQRS (Without Naming It)
This is the most common — and dangerous — failure path.
You start with a replica of your OLTP DB for analytical reads. It works. Then:
A sidecar process writes derived metrics into the same replica
A background job corrects stale rows
A feature toggle or config flag adds minor writes
A dev writes to the replica “just for testing”
Now your replica is both read-heavy and write-active. You didn’t plan for it. But you’ve reached a point where:
Queries are slower
Replication lags
Data looks different in prod vs. replica
The team debates whether the replica is “read-only or not”
📌 Example: A team used a Postgres read replica to offload reporting. A feature that updated “last_active_time” started writing to it asynchronously. Replication started lagging, reports showed stale data, and caching layers had to be added to fix the fallout.
Pattern Summary
Scenario | What Started It | What Broke |
Dashboard creep | Added live widgets to batch reports | Streaming limits, query freshness |
Feedback in OLAP | Stored user-generated content in analytics DB | Compaction overload, schema drift |
Centralized data lake | Logged everything into one warehouse | Exploding cardinality, scan cost |
Read replica misuse | Quiet background writes to a read DB | Replication lag, eventual divergence |
Most production issues caused by misuse of OLAP systems look like “weird latency,” “random query failures,” or “inconsistent data.”
But those are surface symptoms.
Underneath is a deeper problem: you’ve confused a system of analysis with a system of record. And now, it's breaking both roles.
Known Failures in the Wild — Case Studies That Should've Been Warnings
By now, you’ve seen how read-optimized systems start to crack under write-heavy scenarios. But this isn't just theory or internal tooling horror stories. These issues are well-documented, and even the vendors admit them — if you read between the lines.
Let’s walk through some real cases where production systems tried to push these databases beyond their design—and paid for it.
❄️ Snowflake: Streaming Inserts Are Not a First-Class Citizen
Snowflake is amazing for batch analytics. But once you move to streaming ingestion:
Data becomes available with noticeable delay.
Write latencies can fluctuate heavily.
There's no built-in mechanism to handle high-frequency updates or upserts.
Even Snowflake recommends that you stage your data in cloud storage (e.g., S3) and load it in batches via COPY INTO
. Anything else is an anti-pattern at scale.
📌 Lesson: Real-time streaming into Snowflake is possible—but it’s not what it’s built for.
☁️ BigQuery: Consistency Is Optional, and Cost Is Punishing
BigQuery allows streaming inserts via its API. Sounds great—until you realize:
Streamed data can take 2–10 seconds to appear in query results.
There are daily and regional quotas for streaming inserts.
Cost per MB for streaming is much higher than batch ingestion.
And perhaps worst of all? Queries might return partial results during ingestion windows, leading to subtle inconsistencies in dashboards or real-time alerts.
📌 Lesson: You can stream into BigQuery, but the pricing model and consistency behavior actively discourage it.
🧱 ClickHouse: Fast Until You Try to Mutate
ClickHouse is built for speed—billions of rows, sub-second queries. But if you start using it for row updates or deletes, it begins to suffer.
Mutations aren’t instant. They’re async, background tasks.
If you queue too many mutations, the merge scheduler gets overwhelmed.
You’ll find yourself manually scheduling or rebalancing merges.
The official docs even tell you: avoid frequent updates and batch your inserts.
📌 Lesson: ClickHouse loves append-only data. Everything else is tolerated at best.
🪓 Athena & Presto: No Upserts, No ACID, No Hope
Athena and Presto are query engines over object storage (like S3). They don’t even pretend to support writes properly. There’s:
No native support for
UPDATE
,DELETE
, orUPSERT
No transactions
No enforcement of schema or table constraints
Any form of state correction becomes a multi-step, external process. For example: delete the old file, rewrite it with new data, and update the pointer manually or with a Glue job.
📌 Lesson: These systems are read-only by design. Trying to bend that leads to a data lake full of half-broken truths.
I Know What You’re Thinking: “Can’t We Just Hack Around This?”
That’s the classic engineering instinct.
You see a limitation, and your mind immediately looks for a workaround:
“We’ll cache the writes and sync them later.”
“Let’s slap a Redis layer on top — done.”
“Maybe we can materialize views to hide the pain.”
Smart thoughts. Reasonable thoughts.
But at scale, they introduce more problems than they solve.
Let’s walk through why each of these go-to fixes falls apart in production.
🧠 “But Can’t I Just Cache the Writes in Redis and Sync Later?”
That’s a fair idea. But let’s break it down.
Problem 1: State Divergence
If Redis holds “latest” state, and the OLAP DB is updated asynchronously, you’ve created a fork in your truth. Any delay, failure, or retry inconsistency can lead to a cache-DB mismatch.
Example: A like-count stored in Redis is
1201
. Snowflake still shows1194
. Which one do you trust?
Problem 2: Replay Complexity
What if the write fails? You need to retry. But now:
Is the write idempotent?
Will it trigger double counts?
What if the schema changed during retry?
You’ve built an eventually-consistent pipeline without the support systems of one.
🧠 “Can Materialized Views Solve This?”
Only if your data is immutable. Materialized views are great for aggregating facts that never change (e.g., sales by day). But if the underlying rows are mutated or updated frequently:
The view becomes stale.
Recomputations become expensive.
Backfills cascade across dependent views.
🧠 “What About Apache Pinot or Materialize?”
These newer systems offer:
Hybrid OLAP+streaming ingestion
Real-time query capabilities
Support for materialized views that auto-refresh
But:
Transactional support is weak
Write concurrency at scale still introduces hiccups
Tooling and debugging maturity are not at OLTP levels yet
They are promising — but still evolving.
Final Word: The Misunderstanding That Made You Bleed
Let’s wrap this one up.
The problem wasn’t the database.
The problem was asking it to become something it’s not. Something it never claimed to be.
Read-optimized systems were not betrayed — they were misunderstood.
You asked a librarian to host a rave party, and now you're wondering why the bookshelves are on fire.
This is where the idea of CQRS started to emerge. Not as a buzzword. Not as a pattern. But as a survival mechanism:
Split the reads from the writes. Let each system do what it’s best at.
In the next section, we’ll explore the opposite trap — the one teams fall into when they try to read heavily from write-optimized systems.
👉 Up next: When Write-Optimized Systems Are Asked to Read
Coming soon…
Subscribe to my newsletter
Read articles from Harshavardhanan directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
