Building a Resilient Multi-Site Synchronization Architecture

The Challenge of Multi-Site Data Autonomy

In distributed environments, especially in industrial or regulated contexts, applications are often deployed across multiple geographically separated sites. These sites must remain fully operational even in the event of a network outage lasting several days.

What the heck is the challenge?

Ensuring local application autonomy, while maintaining data consistency and eventual synchronization across all different sites.

A Hybrid, On-Prem, Multi-DB Ecosystem

In our case, we manage several on-premise sites, each with its own local database. Some use PostgreSQL, others Oracle. A cloud-native solution wasn't viable due to strict network constraints, and we needed every site to operate independently without relying on real-time connectivity.

Classic Replication vs Resilient Architecture

I explored traditional database replication methods, including cross-database replication tools like GoldenGate or SymmetricDS. These tools inevitably introduce operational complexity. Moreover, synchronization conflicts must be addressed regardless.

Instead, I chose to design a resilient, event-driven architecture based on asynchronous communication and eventual consistency, with near real-time propagation when neither conflicts nor network failures are present.

Event-Driven, Site-Aware Synchronization

I designed an architecture that combines the Outbox Pattern, a RabbitMQ cluster with 3 nodes, and an event fanout strategy to broadcast the event to all sites. Each site manages its own data and produces events locally.

Key Concepts:

Outbox Pattern: Events are stored in a dedicated table in the local DB in the same local transaction.
RabbitMQ Cluster: Three brokers deployed in a logical grouping with quorum queues to ensure reliability across nodes.
Fanout Mode: Events are broadcast to all consumers, including the originating site.
Deduplication: Each event has a globally unique identifier id to support idempotent processing.
Historization: A dedicated queue that receives all events to be stored in the reference site's database for auditing and replay purposes.

Architecture Overview

Event Propagation Flow – London as Propagator, Paris as Golden Source

Below is a simple Event Propagation Flow Design involving three sites. In this setup, London acts as the Propagator, while Paris, which also processes events, serves as the Golden Source responsible for archiving all events. This architecture deploys an eventual consistency model and does not include automatic conflict resolution.
In case of a conflict, events are routed to a Dead Letter Queue (DLQ) for manual inspection and processing.

Principal Actors

Local Write: The application (API) writes to the local database and inserts a new event with a unique ID into the outbox table within the same transaction.
Outbox Dispatcher: A background service reads events from the outbox table and publishes them to a RabbitMQ fanout exchange.
Fanout Exchange: Broadcasts the event to all sites.
Event Handling Consumer: Each site checks whether it has already processed the event (using the unique ID). If not, it applies the corresponding change to its local database.
Self-Cleanup: The originating site deletes the event from its outbox table once it has been successfully consumed from its queue, confirming that the event was propagated correctly.
Historization: The event is also sent to the SITE_PARIS_HISTO queue and stored in the Paris database for auditing and replay.
Retention: A scheduled task deletes historical events older than 90 days (configurable via application settings).

Event Payload Structure

Each event includes a fixed first-level structure:

id: A unique identifier for the event, composed of the site name followed by a UUID (e.g., London_9f8a4c2e-3b1d-4a6b-91d9-df4b2e6e8a7f)
timestamp: The date and time when the event was created.
origin: The originating site name (i.e., the propagating site).
type: A unique event name, such as User_Added, Dinner_Reserved, etc.
data: The payload containing the event-specific information.

Key Design Choices

Decentralized event production: Any site can produce events.
Resilience first: The system tolerates network outages thanks to the outbox pattern.
Replayability: Events stored in the Paris archive can be replayed to recover a site.
Scalability: New sites can be added with minimal configuration.
Idempotency: Guaranteed through the use of globally unique event IDs.

Advanced Option – Confirmed Event Processing

To further secure event handling, I introduced a confirmation event handler. Each site sends a confirmation after successfully processing its event. These confirmation events follow the same structure as the original business events:

id: A unique identifier for the confirmation event.
timestamp: The date and time when the event was created.
origin: The name of the consuming site (not the propagator).
type: A prefix like Confirmation_ followed by the original event type (e.g., Confirmation_User_Added).
data: Specific details of the confirmation event.

The confirmation event is propagated through a topic exchange to ensure it is delivered both to the original propagator and to the historization site (SITE_PARIS_HISTO).

This enables:

Reliable replay detection (without reprocessing).
Auditing of which site processed which event and when.
Easier debugging in multi-step processing chains.

This confirmation layer complements idempotency and ensures clarity in multi-site data flow.

Clustering, Federation, or Shovel?

When building a resilient, site-aware RabbitMQ architecture, one common trap is to pick the wrong mechanism for interconnecting brokers — often without fully considering business needs and constraints. Clustering, Federation, and Shovel each solve very different problems—understanding when to use which can make or break your system’s reliability.

Here’s a quick overview to guide your choice:

Clustering – For Local Resilience

RabbitMQ clustering links multiple nodes into a single logical broker. Nodes share queues and state (with quorum queues), making it a good fit for high availability within a single site or region. It ensures a failover mechanism between nodes with minimal delay.
Federation – For Loose, Selective Links

Federation connects independent brokers, pulling messages from upstream exchanges into downstream exchanges. It’s useful when you want loose coupling across sites, and only care about replicating certain queues. However, federation doesn’t preserve ordering or offer guaranteed delivery in all failure scenarios.
Shovel – For Reliable, Directed Transfer

Shovel is a purpose-built plugin that relays messages from a queue in one broker to an exchange or queue in another one. However, Shovel is heavier operationally and doesn’t mirror broker behavior—it moves messages as-is.

Final Thoughts

In distributed multi-site environments, ensuring data autonomy and consistency — especially during network disruptions — is essential. A resilient, event-driven architecture based on the Outbox Pattern, RabbitMQ Cluster, and eventual consistency offers a robust foundation for synchronizing data without sacrificing local control.

With features like decentralized event production, idempotency, scalability, and replayability, this approach enables reliable and maintainable cross-site communication, even in the face of failure.

Try It Yourself

I have created a GitHub repository that includes:

A simple flow diagram that illustrates the architecture
A Docker Compose setup to launch a RabbitMQ Cluster with 3 nodes behind an HAProxy
A README file with detailed instructions about the whole setup

Feel free to fork the repository, run the setup locally, and customize it to fit your needs.

Thanks for reading!
If you enjoyed this post, follow me on Twitter or LinkedIn for more. Got feedback or suggestions? Drop a comment below—I’d love to hear from you 👌

Building a Resilient Site-Aware Synchronization