When Outbox Didn’t Fit — But Its Mindset Saved the Day

HarshavardhananHarshavardhanan
7 min read

The Real Problem: When Side Effects Matter More Than Data

While working on the billing pipeline, I had a straightforward goal:
Listen to events from Kafka and send billing reports to customers.

Simple on paper.
Until I started thinking about what could go wrong.

Each message represented a billing event — and triggered an email to the customer with the invoice. This wasn’t just a notification. It was a financial artifact. The kind that customers archive, question, or forward to their accounting team.

Which meant:

  • I couldn’t send it twice

  • I couldn’t afford to skip it

  • And I couldn’t trust everything to just “work”

But Kafka, by design, offers at-least-once delivery.
If my consumer read the message, started the email send, and crashed midway — Kafka would retry.
And the same customer would get the same billing email twice.

If I acknowledged the message too early, I risked dropping it.
If I delayed the ack, I risked duplicates.

It wasn’t a data consistency problem. It was a side effect guarantee problem.

And none of the usual patterns — retries, idempotent APIs, even Outbox — seemed to fit cleanly.

I needed to make sure that once a billing event was consumed, the email was sent exactly once — no matter what happened between read and send.

That’s when I started thinking:
If this isn’t an Outbox problem, why does it feel like an Outbox mindset would still solve it?


Why Outbox Didn’t Apply Directly

The Outbox Pattern solves a specific problem:
When you need to write to your local database and publish a message to a broker like Kafka in a single atomic step.

It ensures that if the DB commit succeeds, the intent to publish is persisted — and eventually, the event will go out.

But in my case, I wasn’t publishing anything.
I was consuming a Kafka message — and triggering a side effect (email) that existed entirely outside the system.

There was:

  • No upstream DB transaction I could hook into

  • No broker-level transactional publish to rely on

  • No rollback mechanism for an email that had already gone out

So a standard Outbox setup — where you save the event and publish it later — didn’t apply.
There was no event to publish. The event had already arrived.

But the mindset of Outbox still stuck with me:

“Persist the intent before doing anything risky.”

In the classic Outbox, the intent is to publish.
In my case, the intent was: this message has arrived and must trigger an email.

I didn’t need an Outbox.
But I needed something Outbox-like — just flipped.


The Solution: A Decoupled, Outbox-Inspired Mail Delivery Flow

After weighing the risks of tight Kafka-to-email coupling, I redesigned the flow around a core principle:

Separate intent capture from execution.

Here’s what the updated system looks like:

  1. Kafka consumer reads a billing message

  2. It immediately writes an entry to a local email_outbox table:

    • message_id (Kafka key or UUID)

    • payload

    • status = pending

    • created_at

  3. Only after the DB insert succeeds, the Kafka offset is ACKed — not before

  4. A separate Email Relay Service scans the email_outbox table:

    • Sends the email

    • Updates the status to sent or failed

    • Retries based on configurable policy (max attempts, backoff, DLQ)

  5. A background job archives sent records and monitors for delivery delays or failures


We made a conscious choice to delay the Kafka ACK until the DB write succeeded.
That way:

  • If the service crashed before writing → Kafka retries

  • If the service crashed after writing but before ACK → Kafka retries

  • But the DB already contains the message → relay deduplicates, no double-send

No distributed transaction is needed.
Just strict ordering: write first, ACK second.


Throughput vs Guarantee: Where We Drew the Line

Building reliable systems isn't just about getting everything right — it's about knowing what to guarantee, and what to let go for the sake of scale.

Once we decoupled Kafka consumption from email delivery, the next question was timing:

When should we ACK the Kafka message?

ACK too early → you risk losing side effects.
ACK too late → you risk slowing the system down.

Here's how we navigated that balance.


Why Not ACK After Sending the Email?

A natural instinct might be to delay the Kafka ACK until the email is successfully sent.
That way, if the email delivery fails or the service crashes midway, Kafka retries the message — and you get another chance.

It’s technically safe. But operationally? It’s dangerous.

Email delivery is a slow, unpredictable external operation.
Depending on the provider, rate limits, or network conditions, a single send can take anywhere from a few hundred milliseconds to several seconds. In some cases, retries could block the consumer for minutes.

If you delay the ACK until after sending the email, you’re tying Kafka throughput to email latency — and that means:

  • Slower consumer lag recovery

  • Higher partition imbalance

  • Potential stalls in the event pipeline due to one flaky external system

Instead, we persisted the message locally and ACKed immediately.
That gave Kafka the green light to move on, while our email relay handled side effect delivery in isolation — with its own retry, failure logging, and monitoring.

This gave us scale without giving up control.


Isn’t DB Write Also I/O? Why Not Worry There?

Yes — even delaying the ACK until after writing to the database introduces a small I/O delay.
But this is a measured risk, not an unbounded one.

Unlike email, the database is:

  • Local to the service or tightly controlled

  • Consistently fast (typically 1–5 ms per write)

  • Transactionally reliable — either the write commits or fails clearly

Delaying ACK until the DB write finishes gives us a hard guarantee:

If the message was persisted, we ACK.
If the DB write fails, we don’t ACK — and Kafka retries.

It’s a minimal cost for the safety it provides.
By contrast, delaying for email would mean waiting on a system we don’t fully control — where errors are partial, retries are noisy, and delivery is non-deterministic.

So yes, we paid the small DB latency cost to preserve correctness.
But we drew the line before reaching unbounded I/O risk.

That’s where pragmatic system design lives.


Why Not SAGA?

At one point during the design process, I paused to ask:

“Is this a candidate for a SAGA pattern?”

Emails often show up as part of long-running workflows. And SAGA is the go-to approach when multiple services need to coordinate — especially if something fails mid-way.

But the deeper I looked, the more obvious it became: this wasn’t that kind of problem.


Here’s why SAGA didn’t fit:

  • There was no multistep business transaction to orchestrate

  • No cross-service rollback was required

  • No compensating action made sense — you can’t unsend an email

  • And no choreography was needed — this wasn’t a workflow, just a one-shot message-to-action mapping

SAGA is ideal when:

You have a series of actions across services, and failure in one requires undoing or compensating the others.

Here, we didn’t need to orchestrate.
We needed to guarantee delivery.

We weren’t trying to roll back a system state.
We were trying to make sure a billing report hit an inbox exactly once — and didn’t cause a mess if retried.

So yes, I considered it.
But this wasn’t a SAGA problem. And knowing that mattered.


Wrap-Up: What I Took from Outbox, and What I Left Behind

This wasn’t the textbook Outbox Pattern.
There was no event to publish. No producer to coordinate. No downstream system waiting for a Kafka topic.

But the mindset of Outbox stuck — and it worked:

Persist the intent. Deliver the side effect. Let systems recover safely.

That simple sequence kept us honest.

  • We didn’t block Kafka on slow SMTP calls.

  • We didn’t retry blindly.

  • We didn’t pretend side effects were just function calls.

Instead:

  • We decoupled Kafka from email

  • We tracked delivery explicitly in a DB table

  • We let an isolated relay handle retries, visibility, and ownership

That wasn’t a known pattern when I started.
But by the end of it, I realized:

Outbox doesn’t have to mean “publish later.” Sometimes it just means “don’t pretend the side effect didn’t happen.”

If your system ever depends on exactly-once side effects — and retries aren’t safe — this kind of thinking might help.

You don’t always need a pattern name.
Sometimes you just need a boundary, a log, and a plan.

0
Subscribe to my newsletter

Read articles from Harshavardhanan directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Harshavardhanan
Harshavardhanan