The CQRS Sync Architecture: The Bridge Between Two Worlds


By now, we’ve covered why CQRS exists.
We split the system because one DB couldn’t serve two masters — and that split gave reads and writes the space to do what they’re good at.
But that split came with a new responsibility:
👉 How do you keep those two worlds connected?
👉 How do you make sure your read model reflects what actually happened on the write side — without falling apart under lag, replays, or failures?
That’s where the CQRS sync architecture lives.
It’s not the glamorous part of CQRS. You won’t see it on pretty diagrams.
But in production?
It’s the part you’ll fight with the most.
This post is about that bridge:
How sync actually works
The techniques teams use
The failure modes that sneak in
And the principles that keep it sane at scale
Let’s break it down.
Why Sync Architecture Matters
When you decide to separate your reads and writes, you’re not just creating two models — you’re creating a contract between them.
That contract says:
The read model will always reflect the reality of the write model — eventually.
The problem is: this doesn’t just happen.
You need architecture that ensures:
Every meaningful change in the write model is communicated clearly
The read model updates in a way that’s reliable, idempotent, and correct
Failures, lag, and out-of-order delivery don’t silently corrupt your system
📌 Why sync isn’t “just an event bus”
In theory, CQRS diagrams look simple:
[Write Model] → [Event] → [Read Model]
In production, that arrow hides a lot:
What format are those events in?
How do you guarantee delivery?
What happens if the read model misses an event?
How do you handle duplicate or out-of-order events?
How much lag is acceptable before the system becomes unusable?
The sync layer isn’t just an arrow. It’s:
A transport mechanism (event bus, CDC, queue)
A processing system (consumer logic, idempotency checks, replay handlers)
An operational contract (monitoring, lag tracking, recovery)
Without robust sync architecture, you end up with:
Stale or incorrect reads: the read model no longer reflects business truth
Data drift: no one notices until customers or auditors do
Invisible lag: no alert fires, but your read model is minutes behind
Painful debugging: tracing the lifecycle of a fact across systems becomes slow and error-prone
The point is simple:
CQRS doesn’t end at the split. The system only works if the bridge between write and read is solid.
That’s why the sync architecture is the real heart of CQRS. It’s what stops your read model from becoming an unreliable cache pretending to be a source of truth.
What Needs to Be Synced
It sounds obvious:
“The read model just needs to know what happened.”
But in practice, what needs to be synced is more than just facts. It’s meaningful changes in system state, captured in a way that the read model can use safely, even under failure, lag, or replay conditions.
Let’s break it down.
1️⃣ Domain Events — Not Just Database State
The write model doesn’t sync raw table diffs or row updates.
It syncs events that represent intent:
scssCopyEditOrderPlaced(orderId, userId, amount, timestamp)
UserProfileUpdated(userId, newCity, timestamp)
PaymentReceived(paymentId, orderId, amount, timestamp)
These are atomic, meaningful facts — not just DB deltas.
📌 Why? Because the read model is supposed to build projections based on what happened, not how your write DB happens to store it.
2️⃣ All Projections and Views That Serve Queries
Every projection your system depends on needs to be fed by the sync layer:
Denormalized document views (e.g. Mongo, Redis, Elasticsearch)
Aggregates (e.g. daily revenue summaries, leaderboard scores)
Precomputed filters and indexes for UI
If that projection answers queries, it relies on the sync layer.
3️⃣ Multiple Read Models (If You Have Them)
In a mature CQRS system, you rarely have one read model:
The search system might be in Elasticsearch
The dashboard aggregates in ClickHouse
The user-facing app in Redis or a custom API cache
Each of these needs to be kept in sync, often from the same event stream — but with different projection logic, performance requirements, and tolerance for lag.
4️⃣ Replay and Recovery State
Your sync layer doesn’t just feed live projections.
It must support:
Event replays to rebuild projections after failure
Backfills when a new read model or view is added
Versioning of events if your domain model evolves
If you don’t design for this up front, adding or recovering a read model later becomes a nightmare.
The trap:
“We’ll just sync what we need right now.”
That’s how you end up bolting on workarounds later — ETL jobs, one-off scripts, manual fixes — because the sync layer wasn’t built to scale with the system.
Common Sync Mechanisms
There’s no single “right” way to keep your CQRS models in sync.
There are patterns — and each comes with its own trade-offs, failure modes, and operational realities.
Let’s break down the most common ones you’ll see in production.
1️⃣ Event Bus (Kafka, NATS, RabbitMQ, Pulsar)
👉 How it works:
Your write model emits domain events into an event bus.
One or more consumers subscribe, process these events, and update the read models.
👉 Why teams choose it:
Highly decoupled — write model doesn’t care how many read models there are
Durable and scalable — can handle high throughput
Natural support for multiple consumers (different projections, audit log, downstream systems)
👉 What can go wrong:
Ordering issues: events may arrive out of order unless you partition carefully
Duplication: consumers need idempotency — they will see retries and duplicates
Lag risk: if consumers fall behind, your read model drifts silently
Replay complexity: reprocessing old events can be tricky if schema evolved
📌 This is the most common approach in modern CQRS systems — but it demands solid consumer design.
2️⃣ Change Data Capture (CDC)
👉 How it works:
Instead of emitting domain events, you capture changes at the DB level — usually via the database’s write-ahead log or binlog.
These changes get published to a bus or applied directly to the read model.
👉 Why teams choose it:
No need for your app code to emit events separately — fewer moving parts
Easier to bolt onto existing systems (no need for domain event plumbing)
👉 What can go wrong:
You’re syncing DB state, not domain intent — harder to reason about projections
Schema drift: changing write-side tables breaks your read model sync
No business-level semantics: CDC knows a row changed, but not why
📌 CDC works well for systems where business meaning maps cleanly to row changes. It’s fragile when domain logic is complex.
3️⃣ Dual Writes (anti-pattern warning)
👉 How it works:
Your app tries to write to the write model and the read model at the same time, typically in the same transaction or handler.
👉 Why teams try it:
Looks simple: no event bus, no consumer logic
Immediate sync between models (in theory)
👉 What can go wrong:
No atomicity across systems: one write may succeed, the other fail — now you’re out of sync
Harder to retry safely: no clear source of truth for what should exist
Tight coupling: every write now cares about both models’ storage shape
📌 Teams try this for “quick wins” — but it’s a footgun at scale.
4️⃣ Materializer Jobs / ETL Pipelines
👉 How it works:
Batch jobs or stream processors scan the write DB and build projections offline — e.g. nightly jobs that recompute reports or pre-join tables.
👉 Why teams choose it:
Simple to build initially
Works when lag is acceptable (e.g. reports, exports)
👉 What can go wrong:
Stale data: read models are only as fresh as the last job run
Difficult to incrementally update: expensive to recompute full views repeatedly
No real-time guarantees
📌 Useful for batch reporting, but doesn’t solve live sync needs.
⚡ Summary
Mechanism | Strength | Weakness |
Event Bus | Scalable, decoupled | Needs strong idempotency, ordering care |
CDC | Easy to attach, no domain events needed | Syncs low-level state, not meaning |
Dual Writes | Looks simple | Fails atomically, couples logic |
ETL / Materializers | Easy for reports | Stale data, no live sync |
Eventual Consistency in Practice
Every CQRS diagram with an event bus or sync layer comes with a quiet disclaimer:
“The read model will eventually reflect the write model.”
But what does eventual consistency actually mean in production?
Let’s break it down — beyond the theory.
What Eventual Consistency Actually Looks Like
When you split your models:
The write model applies changes immediately.
The read model catches up — after the event is processed, the projection is updated, and any lag is absorbed.
That “eventual” window might be:
A few milliseconds (ideal case, fast consumers)
A few seconds (common under load)
Minutes (if consumers lag or fail)
📌 It’s not a bug — it’s baked into the design.
Where You Feel It in Production
A user places an order → Dashboard still shows 0 orders for that user (until sync catches up).
A profile is updated → Search filter shows the old city for a few seconds.
A payment is received → Account balance in the UI shows stale data briefly.
These are normal, expected behaviors in CQRS — unless your design or users can’t tolerate it.
The Risk: Hidden Lag
Because everything still “works,” lag in your sync layer can go unnoticed:
The app keeps running.
The read API keeps responding.
But the data it returns isn’t what’s true right now.
If you don’t monitor this, you won’t know you’re drifting until users complain — or worse, business decisions get made on stale data.
Designing for Eventual Consistency
Good CQRS systems don’t try to eliminate eventual consistency — they design around it:
UI hints (e.g. “Updating…” banners, optimistic UI)
Clear documentation on what’s real-time and what’s not
Lag monitoring: metrics on consumer lag, oldest unprocessed event
Backpressure handling: if lag crosses thresholds, alert, scale consumers, or pause non-critical projections
📌 Your users will tolerate eventual consistency — if you’re honest about it and handle it gracefully.
Failure Modes and Recovery
In CQRS, your sync architecture is where failures get creative.
You’re not just worried about a DB query failing — you’re managing moving parts:
Event publishing
Transport reliability
Consumer logic
Read model updates
Here’s what can (and does) go wrong — and how resilient CQRS systems handle it.
1️⃣ Consumers Fall Behind
What happens:
Your event consumers can’t keep up with event volume. Maybe load spikes, maybe one consumer slows down.
The lag grows silently.
📌 Symptoms:
Read models are minutes or hours out of date
Dashboards show stale data
“Edge case” bugs suddenly show up because data is inconsistent
Recovery strategies:
Monitor consumer lag — always
Scale consumers horizontally or partition more granularly
Support event replay to catch up cleanly
Have SLOs on lag so teams can react before users notice
2️⃣ Out-of-Order or Duplicate Events
What happens:
Your event bus doesn’t guarantee strict ordering (e.g., Kafka without careful partitioning).
Or retries cause duplicates to hit consumers.
📌 Symptoms:
Aggregates computed incorrectly (e.g., double-counted revenue)
Read model shows invalid states
Recovery strategies:
All projection logic must be idempotent
Use event versioning or sequence numbers where possible
Design aggregates to tolerate replays without double-counting
3️⃣ Events Get Dropped
What happens:
A bug, infra outage, or misconfig causes an event to never reach its consumer.
📌 Symptoms:
Read model drifts permanently unless manually repaired
Hard-to-debug gaps (e.g., missing transactions, partial dashboards)
Recovery strategies:
Build replay tools — consumers should be able to reprocess from a point in history
Ensure your bus (or CDC) is durable — don’t rely on in-memory only
Validate completeness periodically (e.g., read model counts vs. write model counts)
4️⃣ Projection Corruption
What happens:
A consumer bug or invalid event payload writes bad data to the read model.
📌 Symptoms:
Dashboards with wrong totals
Search returning invalid results
Stuck or broken UIs
Recovery strategies:
Support full rebuilds of projections (replay from scratch)
Snapshot known-good states periodically (for faster recovery)
Alert on anomalies (e.g., negative balances, impossible aggregates)
5️⃣ Catch-up Pressure Causes New Failures
What happens:
Your consumer falls behind, then floods the read DB while trying to catch up — causing cascading failures.
📌 Symptoms:
Read DB chokes under replay load
Fresh events get delayed further
Recovery strategies:
Throttle replays to protect infra
Prioritize fresh events over old replays
Consider staging rebuilds separately from live consumers
The point is:
Failure is normal in the sync layer. What matters is how predictable, observable, and recoverable it is.
Designing the Sync Layer Well
A good CQRS sync layer isn’t about making failures impossible — it’s about making them manageable.
Here’s what resilient, production-ready sync architectures have in common:
✅ Align Events to Domain Intent
Don’t sync raw DB state.
Emit domain-level events that express what happened in business terms:
✔️ OrderPlaced(orderId, userId, totalAmount)
✔️ ProfileUpdated(userId, newCity)
❌ RowChanged(table=orders, id=123, column=amount)
📌 This gives you clean, meaningful replays, reduces coupling to DB schema, and makes projections easier to reason about.
✅ Design for Failure from Day One
Assume:
Events will be duplicated
Events will arrive out of order
Consumers will crash
Lag will build up
📌 Build idempotency into your projection logic.
📌 Plan replay and recovery tooling early — not after the first failure.
📌 Alert on lag and drift — don’t wait for users to tell you.
✅ Make Rebuilds a First-Class Operation
Your projections will need rebuilding:
When schema evolves
When a bug corrupts data
When a new read model is added
📌 Make replays predictable, observable, and resource-managed (no infra blowups during rebuilds).
📌 Consider periodic snapshotting to speed up full replays.
✅ Keep Business Logic Out of the Read Model
Never put critical decisions (e.g. fraud checks, quota validation) on the read model.
It’s stale by design.
📌 The write model owns business truth — the read model serves queries.
✅ Monitor, Monitor, Monitor
Lag, replay progress, consumer health, event backlog depth — these aren’t nice-to-haves.
📌 Without visibility, you’re blind to the drift that CQRS always brings.
⚡ The principle that keeps sync layers sane:
You’re not designing for happy paths. You’re designing for drift, replay, lag, and failure — because they’re inevitable.
Closing Thought: The Sync Layer Is the System
CQRS doesn’t end at splitting reads and writes.
That’s just the start.
The sync architecture — the part most diagrams hide behind a neat arrow — is the system.
It’s where:
Failures quietly build up
Data drift sneaks in
Operational debt piles up if you’re not ready
The sync layer is the bridge that keeps your two worlds connected.
Get it right, and CQRS gives you clean separation, scale, and clarity.
Get it wrong, and all you’ve done is create two systems that can’t trust each other.
The split gave your system space to breathe.
The sync layer keeps it alive.
Next up: we’ll dive deeper into how to choose and tune write path databases — the side that starts it all.
Subscribe to my newsletter
Read articles from Harshavardhanan directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
