System Design: Event-Driven Architecture: Design Patterns

The scene is a familiar one. A late-night war room, lukewarm pizza, and the strained faces of a good engineering team staring at a monitoring dashboard. The OrderService, the glorious centerpiece of their e-commerce monolith, is buckling. Every deployment is a high-stakes gamble. Every new feature, from a simple promotional discount to a complex fraud check, requires a delicate, terrifying surgery on its core logic.

The team, sharp and motivated, lands on a modern solution: Event-Driven Architecture (EDA). The plan is simple, elegant even. "Let's just have the OrderService publish an OrderCreated event to a Kafka topic," the lead engineer suggests. "Then the Inventory, Shipping, and Notification services can just listen to that topic. We'll be decoupled! We can deploy services independently. No more deployment terror."

It sounds perfect. It’s the textbook answer you’d find in a thousand blog posts. They implement it in a sprint. The OrderService now emits a comprehensive OrderCreated event containing every conceivable piece of data: customer details, shipping address, line items, payment information, promotional codes. And for a while, it feels like magic. The monolith's tight grip has been loosened.

But a few months later, the late-night war rooms return. A minor change to add a new loyalty_points field to the event for the marketing team's new service caused the Shipping service's consumer to crash because its parser wasn't expecting it. The Fraud team complains that their service is wasting cycles deserializing product images it never uses. The OrderCreated event has become a monstrous, 200-field data blob. They haven't achieved decoupling; they've just shifted the coupling from a direct API call to an implicit, brittle event schema.

My thesis is this: Adopting event-driven architecture without a ruthless focus on the design of the events themselves is more dangerous than staying with a monolith. You are not building a loosely coupled system. You are building a distributed monolith, a system with all the complexity of a distributed architecture and all the tight coupling of a monolith. It is the worst of both worlds, and it's a trap many talented teams fall into.

Unpacking the Hidden Complexity: The Anatomy of a Distributed Monolith

The team's initial approach, the "fat event" pattern, is seductive because it feels like a one-to-one replacement for a function call. Instead of placeOrder(orderData), you just fire OrderCreated(orderData). The problem is that in a distributed system, this seemingly simple swap introduces insidious forms of coupling and operational drag.

When one service publishes a single, large event for many different consumers, you create a public contract that is impossibly rigid. Every consumer is now coupled to the entire schema, even if it only needs 5% of the data.

%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#fff3e0", "primaryTextColor": "#424242", "primaryBorderColor": "#ef6c00", "lineColor": "#424242", "secondaryColor": "#fbe9e7", "tertiaryColor": "#fff3e0"}}}%%
flowchart TD
    subgraph PD ["Producer Domain"]
        OrderService[Order Service]
    end

    subgraph BC ["The Brittle Contract"]
        CentralTopic[Kafka Topic OrderEvents]
    end

    subgraph CD ["Consumer Domains"]
        Inventory[Inventory Service]
        Shipping[Shipping Service]
        Notifications[Notification Service]
        Fraud[Fraud Service]
        Analytics[Analytics Service]
    end

    OrderService -->|Publishes Fat OrderCreated Event| CentralTopic
    CentralTopic --> Inventory
    CentralTopic --> Shipping
    CentralTopic --> Notifications
    CentralTopic --> Fraud
    CentralTopic --> Analytics

    classDef brittleContract fill:#ffcdd2,stroke:#c62828,stroke-width:2px,stroke-dasharray:5 5
    class CentralTopic brittleContract

This diagram illustrates the "fat event" anti-pattern. The central Kafka topic becomes a point of extreme coupling. A change required by the Analytics Service can inadvertently break the Shipping Service. The event schema becomes a battleground for competing interests, and the producing OrderService team becomes a bottleneck, hesitant to make any changes for fear of unknown downstream consequences. This is not decoupling; it's chaos masquerading as architecture.

Let's use an analogy. Imagine your system is a team of specialist chefs in a large kitchen. The naive event-driven approach is like having a single, massive chalkboard where every order detail is written for all to see. The fish chef, the pastry chef, and the saucier all have to scan the entire board, mentally filtering out the details they don't need, just to find the one or two lines relevant to them. If the head chef changes the format of the board to add a new section for allergies, it might confuse the pastry chef who was relying on the old format. It’s inefficient and error-prone.

A better kitchen runs on specific, targeted tickets. The head chef writes a small ticket for the fish station, a separate one for the pastry station. Each ticket contains only the information that specialist needs. This is the mental model we need for event design.

Here’s a breakdown of the second-order effects of the "fat event" approach versus a more disciplined, "lean event" approach.

Architectural Concern	Fat Event (Shared Generic Event)	Lean Event (Specific Domain Event)
Coupling	High. All consumers are coupled to a single, large schema. Changes are high-risk.	Low. Consumers are coupled only to small, relevant event schemas. Producers and consumers can evolve independently.
Scalability	Poor. High network bandwidth and CPU for serialization/deserialization of unused data.	Excellent. Minimal data transfer and processing overhead. Services scale based on their actual needs.
Security	Weak. Services have access to sensitive data they don't need (e.g., Notifications seeing payment details).	Strong. Events contain minimal data, adhering to the principle of least privilege. Sensitive data is not broadcast.
Resilience	Brittle. A single malformed field can break multiple, unrelated consumer services.	Robust. The blast radius of a bad event is contained. A poison pill in one event stream doesn't affect others.
Cognitive Load	High. Developers must understand a massive, complex event structure to do simple tasks.	Low. Developers only need to understand the small, domain-specific events relevant to their service.
Evolution	Difficult. The schema becomes "write-only." Teams are afraid to change it. Leads to versioning nightmares.	Simple. New event types can be added without affecting existing consumers. Schemas can be versioned easily.

The core takeaway is that the goal is not to eliminate coupling, which is impossible, but to make it explicit, minimal, and purposeful.

The Pragmatic Solution: A Blueprint for Resilient Event-Driven Systems

Building a robust EDA isn't about picking the right message broker. It's about adhering to a set of principles that force discipline into your event and flow design. This is not a step-by-step tutorial but a blueprint for thinking.

Principle 1: Events Announce Facts, They Don't Issue Commands. An event is a record of something that has already happened. It should be named in the past tense: OrderPlaced, PaymentProcessed, ShipmentDispatched. It is an immutable fact. This is a crucial distinction from a command, which is a request to do something: PlaceOrder, ProcessPayment. By treating events as facts, you decouple the producer's intent from the consumer's reaction. The OrderService doesn't know or care that the Analytics service will later analyze the OrderPlaced event. It simply reports the fact.

Principle 2: The Outbox Pattern for Unbreakable Reliability. How do you guarantee that you can both save a change to your database and publish an event atomically? What if the database commit succeeds but the call to Kafka fails? You end up with inconsistent state. The Outbox Pattern solves this elegantly.

You perform two actions within the same local database transaction:

Insert/update your business data (e.g., the orders table).
Insert a record representing the event into an outbox table.

Because this happens in a single transaction, it's atomic. It either all succeeds or all fails. A separate, asynchronous process then tails the outbox table, publishes the events to the message broker, and marks them as sent.

sequenceDiagram
    actor Client
    participant AppService as Application Service
    participant DB as Database
    participant Relay as Outbox Relay
    participant Broker as Message Broker

    Client->>AppService: POST /orders
    activate AppService
    AppService->>DB: BEGIN TRANSACTION
    activate DB
    AppService->>DB: INSERT INTO orders ...
    AppService->>DB: INSERT INTO outbox (event_payload) ...
    DB-->>AppService: COMMIT
    deactivate DB
    AppService-->>Client: 202 Accepted
    deactivate AppService

    loop Poll for new events
        Relay->>DB: SELECT * FROM outbox WHERE sent = false
        activate Relay
        DB-->>Relay: event_payload
        Relay->>Broker: PUBLISH event_payload
        activate Broker
        Broker-->>Relay: ACK
        deactivate Broker
        Relay->>DB: UPDATE outbox SET sent = true WHERE ...
        deactivate Relay
    end

This sequence diagram shows the Outbox Pattern in action. The client's request returns quickly after the database transaction is committed, ensuring a responsive system. The critical step of publishing the event is handled by a separate, reliable relay process. This decouples the business transaction from the potential failures of the messaging infrastructure. Technologies like Debezium can act as a powerful, off-the-shelf relay by using Change Data Capture (CDC) on your outbox table.

Principle 3: Design Lean Events and Use Callbacks for Details. Instead of a "fat event," publish a lean "notification event." The OrderPlaced event might only contain order_id, customer_id, and timestamp. That's it. It’s a notification that a fact occurred.

Any downstream service that needs more information, like the ShippingService needing the address, is then responsible for calling back to the owning service's API (e.g., GET /orders/{order_id}/shipping_details).

This has several profound benefits:

No Data Bloat: The event stream remains lightweight.
Data is Always Fresh: The ShippingService gets the absolute latest address from the OrderService, avoiding issues with stale data in a long-lived event.
Clear Ownership and Security: The OrderService retains control over its data, exposing it via a well-defined, secured API. The NotificationService never even has the chance to see the customer's address.

Principle 4: Manage Workflows with Sagas, Not Two-Phase Commits. In a distributed system, you cannot rely on traditional distributed transactions (like two-phase commit) for business processes that span multiple services. They are brittle and don't scale. The Saga pattern manages failure and compensation in a long-running process.

There are two main types of Sagas:

Choreography: Services communicate by publishing and listening to each other's events. There is no central controller. It's like a troupe of dancers who know the routine; each one reacts to the previous dancer's move. This is highly decoupled but can be hard to debug.
Orchestration: A central orchestrator service explicitly calls other services and manages the state of the workflow. It's like a conductor leading an orchestra. This is less decoupled but much easier to observe and manage.

For many use cases, choreography is a good starting point. An order process could be modeled as a state machine driven by events from different domains.

stateDiagram-v2
    direction LR
    [*] --> PendingCreation

    PendingCreation --> AwaitingPayment : OrderPlaced event
    AwaitingPayment --> PaymentFailed : PaymentDeclined event
    AwaitingPayment --> AwaitingStockCheck : PaymentSucceeded event
    PaymentFailed --> Cancelled

    AwaitingStockCheck --> StockReserved : InventoryReserved event
    AwaitingStockCheck --> OnBackorder : InventoryUnavailable event
    OnBackorder --> StockReserved : StockReplenished event

    StockReserved --> Shipped : ShipmentDispatched event
    Shipped --> Completed : ShipmentDelivered event

    Cancelled --> [*]
    Completed --> [*]

This state diagram visualizes the lifecycle of an order in a choreographed saga. Each transition is triggered by a domain event from a different service. For example, the PaymentSucceeded event from the PaymentService moves the order state from AwaitingPayment to AwaitingStockCheck. This makes the entire business process visible and understandable, even though it's implemented across multiple, independent services.

Traps the Hype Cycle Sets for You

As you navigate EDA, you'll be bombarded with hype. Here are a few traps to sidestep.

The "Kafka is the only answer" Trap: Kafka is a phenomenal piece of technology. It's a distributed, partitioned, replicated commit log. It's also complex to operate. Do you really need strict ordering and the ability to replay events from the beginning of time? Or do you just need a simple pub/sub mechanism to decouple a few services? Sometimes, a managed service like AWS SQS/SNS or Google Pub/Sub is a far simpler, more cost-effective solution. Don't use a sledgehammer to crack a nut. The tool should fit the problem, not the resume.
The "Everything must be async" Trap: The goal is not to make everything asynchronous. The goal is to build resilient, scalable systems. Sometimes, a simple, synchronous request-response call is the right answer. It's easier to reason about, provides immediate feedback, and has less operational overhead. Introducing a message broker adds a new point of failure and significant complexity. Use asynchronous communication where it provides a clear benefit, such as for background jobs, fanning out events, or absorbing load spikes, not as a default for all interactions.
The "We must use Event Sourcing" Trap: Event Sourcing is the logical extreme of the "events as facts" principle. In this pattern, the state of your application is not stored in a traditional database table. Instead, you only store the log of events. The current state is derived by replaying those events. It's an incredibly powerful pattern for systems that require a full audit history (like finance or banking). It is also mind-bendingly complex. It changes everything about how you model, query, and evolve your application. For 98% of applications, using stateful services that produce domain events via the Outbox pattern provides most of the benefits of EDA with a fraction of the complexity.

Architecting for the Future: Your First Move on Monday Morning

We started with a team that replaced a monolith's problems with a distributed monolith's worse problems. They focused on the plumbing (the message broker) instead of the information architecture (the events). The pragmatic solution is to invert this. Obsess over the events first.

Your core argument should be this: The contract is the event, not the service endpoint. In a well-designed EDA, services become almost secondary to the event streams they consume and produce. The real architecture is defined by the topology of these streams and the schemas of the facts that flow through them.

So, what is your first move? Don't plan a "Big Bang" rewrite. That's how these failed projects start.

Go on an Event Safari: On Monday morning, pick one significant event in your current system. If you don't have events, pick one significant cross-service API call.
Map the Consumers: Identify every service that consumes this event or receives this call.
Perform a Data Audit: For each consumer, create a list of the data fields it actually uses. Be ruthless. I guarantee you will find that most consumers ignore most of the data. You now have a concrete map of your current system's waste and implicit coupling.
Prototype One Lean Event: Choose one workflow. Can you introduce a new, lean "notification event" using the Outbox pattern? For example, instead of sending the whole user object when a profile is updated, can you publish a UserProfileChanged event with only the user_id? Let the consumers who care call back for the details they need.

By taking these small, forensic steps, you start building the muscle memory for good event design. You prove the value with minimal risk and begin steering your architecture toward resilience and clarity, one event at a time.

This leads to a final, forward-looking question: If the event schemas and stream topologies are the most important contracts in our architecture, are we spending enough time designing, governing, and discovering them? Or are we leaving the most critical part of our modern systems to chance?

TL;DR: Summary for the Busy Architect

The Problem: Simply replacing API calls with a message broker often creates a "distributed monolith." A single "fat event" shared by many services leads to high coupling, brittleness, and waste.
The Core Idea: The design of the events themselves is more important than the choice of messaging technology. Focus on information architecture.
Key Principles & Patterns:
- Events are Facts: Name them in the past tense (e.g., OrderPlaced). They are immutable records of what happened.
- Use the Outbox Pattern: Ensure atomic writes to your database and event publishing to the message broker for ultimate reliability.
- Design Lean Events: Publish small notification events (e.g., just an ID) and have consumers call back via an API for more details. This prevents data bloat and tight schema coupling.
- Use Sagas for Workflows: Manage long-running business processes across services using choreography (event-driven) or orchestration (central controller) instead of brittle distributed transactions.
Common Traps: Avoid assuming Kafka is always the answer, making everything asynchronous just because you can, or jumping to complex patterns like Event Sourcing before you need them.
First Actionable Step: Audit one of your existing events or cross-service calls. See how much data is unused by consumers. This will reveal your hidden coupling and waste. Start by introducing one new, lean event for a single workflow.

Event-Driven Architecture: Design Patterns

Table of contents

Unpacking the Hidden Complexity: The Anatomy of a Distributed Monolith

The Pragmatic Solution: A Blueprint for Resilient Event-Driven Systems

Traps the Hype Cycle Sets for You

Architecting for the Future: Your First Move on Monday Morning

Subscribe to my newsletter

Felipe Rodrigues

Felipe Rodrigues