The CQRS Sync Architecture: The Bridge Between Two Worlds

HarshavardhananHarshavardhanan
13 min read

By now, we’ve covered why CQRS exists.
We split the system because one DB couldn’t serve two masters — and that split gave reads and writes the space to do what they’re good at.

But that split came with a new responsibility:

👉 How do you keep those two worlds connected?

👉 How do you make sure your read model reflects what actually happened on the write side — without falling apart under lag, replays, or failures?

That’s where the CQRS sync architecture lives.
It’s not the glamorous part of CQRS. You won’t see it on pretty diagrams.
But in production?

It’s the part you’ll fight with the most.

This post is about that bridge:

  • How sync actually works

  • The techniques teams use

  • The failure modes that sneak in

  • And the principles that keep it sane at scale

Let’s break it down.


Why Sync Architecture Matters

When you decide to separate your reads and writes, you’re not just creating two models — you’re creating a contract between them.
That contract says:

The read model will always reflect the reality of the write model — eventually.

The problem is: this doesn’t just happen.
You need architecture that ensures:

  • Every meaningful change in the write model is communicated clearly

  • The read model updates in a way that’s reliable, idempotent, and correct

  • Failures, lag, and out-of-order delivery don’t silently corrupt your system


📌 Why sync isn’t “just an event bus”

In theory, CQRS diagrams look simple:

[Write Model][Event][Read Model]

In production, that arrow hides a lot:

  • What format are those events in?

  • How do you guarantee delivery?

  • What happens if the read model misses an event?

  • How do you handle duplicate or out-of-order events?

  • How much lag is acceptable before the system becomes unusable?

The sync layer isn’t just an arrow. It’s:

  • A transport mechanism (event bus, CDC, queue)

  • A processing system (consumer logic, idempotency checks, replay handlers)

  • An operational contract (monitoring, lag tracking, recovery)


Without robust sync architecture, you end up with:

  • Stale or incorrect reads: the read model no longer reflects business truth

  • Data drift: no one notices until customers or auditors do

  • Invisible lag: no alert fires, but your read model is minutes behind

  • Painful debugging: tracing the lifecycle of a fact across systems becomes slow and error-prone


The point is simple:

CQRS doesn’t end at the split. The system only works if the bridge between write and read is solid.

That’s why the sync architecture is the real heart of CQRS. It’s what stops your read model from becoming an unreliable cache pretending to be a source of truth.


What Needs to Be Synced

It sounds obvious:

“The read model just needs to know what happened.”

But in practice, what needs to be synced is more than just facts. It’s meaningful changes in system state, captured in a way that the read model can use safely, even under failure, lag, or replay conditions.

Let’s break it down.


1️⃣ Domain Events — Not Just Database State

The write model doesn’t sync raw table diffs or row updates.
It syncs events that represent intent:

scssCopyEditOrderPlaced(orderId, userId, amount, timestamp)
UserProfileUpdated(userId, newCity, timestamp)
PaymentReceived(paymentId, orderId, amount, timestamp)

These are atomic, meaningful facts — not just DB deltas.

📌 Why? Because the read model is supposed to build projections based on what happened, not how your write DB happens to store it.


2️⃣ All Projections and Views That Serve Queries

Every projection your system depends on needs to be fed by the sync layer:

  • Denormalized document views (e.g. Mongo, Redis, Elasticsearch)

  • Aggregates (e.g. daily revenue summaries, leaderboard scores)

  • Precomputed filters and indexes for UI

If that projection answers queries, it relies on the sync layer.


3️⃣ Multiple Read Models (If You Have Them)

In a mature CQRS system, you rarely have one read model:

  • The search system might be in Elasticsearch

  • The dashboard aggregates in ClickHouse

  • The user-facing app in Redis or a custom API cache

Each of these needs to be kept in sync, often from the same event stream — but with different projection logic, performance requirements, and tolerance for lag.


4️⃣ Replay and Recovery State

Your sync layer doesn’t just feed live projections.
It must support:

  • Event replays to rebuild projections after failure

  • Backfills when a new read model or view is added

  • Versioning of events if your domain model evolves

If you don’t design for this up front, adding or recovering a read model later becomes a nightmare.


The trap:

“We’ll just sync what we need right now.”

That’s how you end up bolting on workarounds later — ETL jobs, one-off scripts, manual fixes — because the sync layer wasn’t built to scale with the system.


Common Sync Mechanisms

There’s no single “right” way to keep your CQRS models in sync.
There are patterns — and each comes with its own trade-offs, failure modes, and operational realities.

Let’s break down the most common ones you’ll see in production.


1️⃣ Event Bus (Kafka, NATS, RabbitMQ, Pulsar)

👉 How it works:
Your write model emits domain events into an event bus.
One or more consumers subscribe, process these events, and update the read models.

👉 Why teams choose it:

  • Highly decoupled — write model doesn’t care how many read models there are

  • Durable and scalable — can handle high throughput

  • Natural support for multiple consumers (different projections, audit log, downstream systems)

👉 What can go wrong:

  • Ordering issues: events may arrive out of order unless you partition carefully

  • Duplication: consumers need idempotency — they will see retries and duplicates

  • Lag risk: if consumers fall behind, your read model drifts silently

  • Replay complexity: reprocessing old events can be tricky if schema evolved

📌 This is the most common approach in modern CQRS systems — but it demands solid consumer design.


2️⃣ Change Data Capture (CDC)

👉 How it works:
Instead of emitting domain events, you capture changes at the DB level — usually via the database’s write-ahead log or binlog.
These changes get published to a bus or applied directly to the read model.

👉 Why teams choose it:

  • No need for your app code to emit events separately — fewer moving parts

  • Easier to bolt onto existing systems (no need for domain event plumbing)

👉 What can go wrong:

  • You’re syncing DB state, not domain intent — harder to reason about projections

  • Schema drift: changing write-side tables breaks your read model sync

  • No business-level semantics: CDC knows a row changed, but not why

📌 CDC works well for systems where business meaning maps cleanly to row changes. It’s fragile when domain logic is complex.


3️⃣ Dual Writes (anti-pattern warning)

👉 How it works:
Your app tries to write to the write model and the read model at the same time, typically in the same transaction or handler.

👉 Why teams try it:

  • Looks simple: no event bus, no consumer logic

  • Immediate sync between models (in theory)

👉 What can go wrong:

  • No atomicity across systems: one write may succeed, the other fail — now you’re out of sync

  • Harder to retry safely: no clear source of truth for what should exist

  • Tight coupling: every write now cares about both models’ storage shape

📌 Teams try this for “quick wins” — but it’s a footgun at scale.


4️⃣ Materializer Jobs / ETL Pipelines

👉 How it works:
Batch jobs or stream processors scan the write DB and build projections offline — e.g. nightly jobs that recompute reports or pre-join tables.

👉 Why teams choose it:

  • Simple to build initially

  • Works when lag is acceptable (e.g. reports, exports)

👉 What can go wrong:

  • Stale data: read models are only as fresh as the last job run

  • Difficult to incrementally update: expensive to recompute full views repeatedly

  • No real-time guarantees

📌 Useful for batch reporting, but doesn’t solve live sync needs.


⚡ Summary

MechanismStrengthWeakness
Event BusScalable, decoupledNeeds strong idempotency, ordering care
CDCEasy to attach, no domain events neededSyncs low-level state, not meaning
Dual WritesLooks simpleFails atomically, couples logic
ETL / MaterializersEasy for reportsStale data, no live sync

Eventual Consistency in Practice

Every CQRS diagram with an event bus or sync layer comes with a quiet disclaimer:

“The read model will eventually reflect the write model.”

But what does eventual consistency actually mean in production?
Let’s break it down — beyond the theory.


What Eventual Consistency Actually Looks Like

When you split your models:

  • The write model applies changes immediately.

  • The read model catches up — after the event is processed, the projection is updated, and any lag is absorbed.

That “eventual” window might be:

  • A few milliseconds (ideal case, fast consumers)

  • A few seconds (common under load)

  • Minutes (if consumers lag or fail)

📌 It’s not a bug — it’s baked into the design.


Where You Feel It in Production

  • A user places an order → Dashboard still shows 0 orders for that user (until sync catches up).

  • A profile is updated → Search filter shows the old city for a few seconds.

  • A payment is received → Account balance in the UI shows stale data briefly.

These are normal, expected behaviors in CQRS — unless your design or users can’t tolerate it.


The Risk: Hidden Lag

Because everything still “works,” lag in your sync layer can go unnoticed:

  • The app keeps running.

  • The read API keeps responding.

  • But the data it returns isn’t what’s true right now.

If you don’t monitor this, you won’t know you’re drifting until users complain — or worse, business decisions get made on stale data.


Designing for Eventual Consistency

Good CQRS systems don’t try to eliminate eventual consistency — they design around it:

  • UI hints (e.g. “Updating…” banners, optimistic UI)

  • Clear documentation on what’s real-time and what’s not

  • Lag monitoring: metrics on consumer lag, oldest unprocessed event

  • Backpressure handling: if lag crosses thresholds, alert, scale consumers, or pause non-critical projections

📌 Your users will tolerate eventual consistency — if you’re honest about it and handle it gracefully.


Failure Modes and Recovery

In CQRS, your sync architecture is where failures get creative.
You’re not just worried about a DB query failing — you’re managing moving parts:

  • Event publishing

  • Transport reliability

  • Consumer logic

  • Read model updates

Here’s what can (and does) go wrong — and how resilient CQRS systems handle it.


1️⃣ Consumers Fall Behind

What happens:
Your event consumers can’t keep up with event volume. Maybe load spikes, maybe one consumer slows down.
The lag grows silently.

📌 Symptoms:

  • Read models are minutes or hours out of date

  • Dashboards show stale data

  • “Edge case” bugs suddenly show up because data is inconsistent

Recovery strategies:

  • Monitor consumer lag — always

  • Scale consumers horizontally or partition more granularly

  • Support event replay to catch up cleanly

  • Have SLOs on lag so teams can react before users notice


2️⃣ Out-of-Order or Duplicate Events

What happens:
Your event bus doesn’t guarantee strict ordering (e.g., Kafka without careful partitioning).
Or retries cause duplicates to hit consumers.

📌 Symptoms:

  • Aggregates computed incorrectly (e.g., double-counted revenue)

  • Read model shows invalid states

Recovery strategies:

  • All projection logic must be idempotent

  • Use event versioning or sequence numbers where possible

  • Design aggregates to tolerate replays without double-counting


3️⃣ Events Get Dropped

What happens:
A bug, infra outage, or misconfig causes an event to never reach its consumer.

📌 Symptoms:

  • Read model drifts permanently unless manually repaired

  • Hard-to-debug gaps (e.g., missing transactions, partial dashboards)

Recovery strategies:

  • Build replay tools — consumers should be able to reprocess from a point in history

  • Ensure your bus (or CDC) is durable — don’t rely on in-memory only

  • Validate completeness periodically (e.g., read model counts vs. write model counts)


4️⃣ Projection Corruption

What happens:
A consumer bug or invalid event payload writes bad data to the read model.

📌 Symptoms:

  • Dashboards with wrong totals

  • Search returning invalid results

  • Stuck or broken UIs

Recovery strategies:

  • Support full rebuilds of projections (replay from scratch)

  • Snapshot known-good states periodically (for faster recovery)

  • Alert on anomalies (e.g., negative balances, impossible aggregates)


5️⃣ Catch-up Pressure Causes New Failures

What happens:
Your consumer falls behind, then floods the read DB while trying to catch up — causing cascading failures.

📌 Symptoms:

  • Read DB chokes under replay load

  • Fresh events get delayed further

Recovery strategies:

  • Throttle replays to protect infra

  • Prioritize fresh events over old replays

  • Consider staging rebuilds separately from live consumers


The point is:

Failure is normal in the sync layer. What matters is how predictable, observable, and recoverable it is.


Designing the Sync Layer Well

A good CQRS sync layer isn’t about making failures impossible — it’s about making them manageable.
Here’s what resilient, production-ready sync architectures have in common:


✅ Align Events to Domain Intent

Don’t sync raw DB state.
Emit domain-level events that express what happened in business terms:

✔️ OrderPlaced(orderId, userId, totalAmount)  
✔️ ProfileUpdated(userId, newCity)  
❌ RowChanged(table=orders, id=123, column=amount)

📌 This gives you clean, meaningful replays, reduces coupling to DB schema, and makes projections easier to reason about.


✅ Design for Failure from Day One

Assume:

  • Events will be duplicated

  • Events will arrive out of order

  • Consumers will crash

  • Lag will build up

📌 Build idempotency into your projection logic.
📌 Plan replay and recovery tooling early — not after the first failure.
📌 Alert on lag and drift — don’t wait for users to tell you.


✅ Make Rebuilds a First-Class Operation

Your projections will need rebuilding:

  • When schema evolves

  • When a bug corrupts data

  • When a new read model is added

📌 Make replays predictable, observable, and resource-managed (no infra blowups during rebuilds).
📌 Consider periodic snapshotting to speed up full replays.


✅ Keep Business Logic Out of the Read Model

Never put critical decisions (e.g. fraud checks, quota validation) on the read model.
It’s stale by design.
📌 The write model owns business truth — the read model serves queries.


✅ Monitor, Monitor, Monitor

Lag, replay progress, consumer health, event backlog depth — these aren’t nice-to-haves.
📌 Without visibility, you’re blind to the drift that CQRS always brings.


⚡ The principle that keeps sync layers sane:

You’re not designing for happy paths. You’re designing for drift, replay, lag, and failure — because they’re inevitable.


Closing Thought: The Sync Layer Is the System

CQRS doesn’t end at splitting reads and writes.
That’s just the start.

The sync architecture — the part most diagrams hide behind a neat arrow — is the system.
It’s where:

  • Failures quietly build up

  • Data drift sneaks in

  • Operational debt piles up if you’re not ready

The sync layer is the bridge that keeps your two worlds connected.
Get it right, and CQRS gives you clean separation, scale, and clarity.
Get it wrong, and all you’ve done is create two systems that can’t trust each other.


The split gave your system space to breathe.
The sync layer keeps it alive.

Next up: we’ll dive deeper into how to choose and tune write path databases — the side that starts it all.

0
Subscribe to my newsletter

Read articles from Harshavardhanan directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Harshavardhanan
Harshavardhanan