Master Kafka: DLQ, Async Callbacks, Partitioning

We all know Kafka is more than just a pub-sub system—it's the backbone of modern, scalable event-driven architecture. Whether you're handling data ingestion, system decoupling, or stream processing, understanding Kafka's key elements like partitions, offsets, async callbacks, and dead letter queues (DLQs) is crucial for building reliable services.

In this blog I will try to covers Kafka basics and explores advanced topics like custom partitioning using a round-robin strategy with keys which I have used in by enterprise projects for evenly balancing the load on partition, and why handling producer errors with async callbacks is so important.

High level Kafka Architecture -
Partitions and Offsets-

A partition is a unit of parallelism in Kafka. Each topic is divided into one or more partitions. Kafka appends incoming messages in order within a single partition.

More partitions → more parallelism. Each partition is independently read by a consumer in a group.

An offset is nothing but a unique ID assigned to each record within a partition. Think of it as a line number in a log file.

Offsets are immutable. Consumers use them to keep track of their reading position.
```
  Partition 0:
  [0] "LoginEvent"
  [1] "LogoutEvent"
  [2] "PurchaseEvent"
  ..
  .. and so on.
```
Producer: Sync vs Async Send() -

This is very important inbuild method from Kafka, responsible for many important think, before using any inbuild function provided by any library always know the the implementation of it.
Custom Partitioning with Round-Robin for Unique Keys-

Default Kafka Partitioner Behavior is -

If a key is provided, Kafka hashes the key and assigns a partition. If no key is provided, Kafka uses a sticky round-robin strategy.
```
  ProducerRecord<String, String> record = new ProducerRecord<>("my-topic", "userId123", "value");
```
To implement custom partitioning behavior kafka allows us to implements its Partitioner class as shown below-

  public class CustomRoundRobinPartitioner implements Partitioner {
      private final Map<String, Integer> keyToPartitionMap = new ConcurrentHashMap<>();
      private final AtomicInteger counter = new AtomicInteger(0);

      @Override
      public int partition(String topic, Object key, byte[] keyBytes,
                           Object value, byte[] valueBytes, Cluster cluster) {
          List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);
          int numPartitions = partitions.size();

          String stringKey = (String) key;
          return keyToPartitionMap.computeIfAbsent(stringKey, k ->
              counter.getAndIncrement() % numPartitions
          );
      }

      @Override public void close() {}
      @Override public void configure(Map<String, ?> configs) {}
  }

This helps in to distribute data evenly across partitions even when keys are skewed. Retains partition stickiness per unique key, aiding ordering, and ordering is very important for critical notification delivery. Avoids hot partitions when some keys dominate traffic.

DLQ Strategy: Safety Net for Failures-

When consumers fail to process a record (due to exceptions or bad data), instead of crashing or skipping, send the failed message to a Dead letter Queue(DLQ) topic. Same behavior as normal topic producer/consumer. Called it as a back-up separate out in the normal execution flow Kafka topic, where all failed messages are sent.
- Data never lost
- Enables offline debugging or repair
- Keeps your main pipeline clean

Strategies for Retrying DLQ Messages-

Once messages land in the DLQ, it's not the end of the road. Often, these messages fail due to transient issues such as:
- Temporary downstream service unavailability
- Network glitches
- Schema evolution lag

Instead of discarding these messages, we retry them intelligently.

Strategies for Retrying DLQ Messages

Strategy	Description	When to Use
Exponential Backoff	Retry after exponentially increasing delays (e.g., 1s, 2s, 4s, 8s)	Transient network or external service failures
Fixed Delay	Retry every X seconds	Predictable retry for recoverable issues
Max Attempts + Drop	Cap retries to avoid infinite loops	Prevent resource exhaustion

Now I am sure there are more retry strategies out there that is tweaked accordingly on the bases of individual use cases. But the above mentioned are the standards algorithm that is used widely across the industry.

If you're using Spring Boot like me, you can leverage the @Retryable annotation provided by Spring Retry to implement automatic retries with exponential backoff—even for messages consumed from a Dead Letter Topic (DLT), it is very handy and simple.

@Service
public class DltConsumerService {

    @KafkaListener(topics = "my-topic-dlq", groupId = "dlt-retry-consumer-group")
    public void listenFromDLT(String message) {
        handleDltMessage(message);
    }

    @Retryable(
        value = {TransientDataException.class},
        maxAttempts = 4,
        backoff = @Backoff(delay = 2000, multiplier = 2)
    ) // the values like maxattemps, multplier, delay can be configured via app props/config maps
    public void handleDltMessage(String message) {
        // Try to reprocess the message
        if (Math.random() > 0.5) { // Simulate failure
            throw new TransientDataException("Random failure for demo");
        }

        log.info("Successfully processed: " + message);
    }

    @Recover // optional
    public void recover(TransientDataException e, String message) {
        // Called when all retries failed
        log.info("All retries exhausted for message: " + message);
        // Optionally forward to a final DLQ topic or alert
    }
}

Retry Flow with Exponential Backoff

First failure → wait 2s → retry
Second failure → wait 4s → retry
Third failure → wait 8s → retry

After 4th total attempt → call @Recover method or you can update the failure metrics up in monitoring tool like prometheus, grafana etc.

Note: Sync vs Async Kafka Send – Tradeoff Between Performance and Reliability-

Real-World Production Lesson(Ghost transactions)

In a high-throughput system, we originally used asynchronous Kafka send (kafkaTemplate.send(...)) to improve performance and latency. However, we encountered a critical issue in production:

Messages that failed silently in the background were assumed to be delivered, and the upstream system proceeded as if the operation succeeded.

❗ The Problem

Kafka producer was configured for async sending.
Errors during send(...) were not handled via callback properly.
As a result, if Kafka broker was temporarily unavailable or a serialization error occurred, the message was never delivered.
The calling service (upstream) moved on, assuming the message was committed.
Downstream systems never saw the event, causing data inconsistency or "ghost" transactions.

✅ Why It Happened

Behavior	Async Send()	Sync Send()
Producer sends message	Immediately (non-blocking)	Waits for broker acknowledgment
Failure handling	Via callback (optional)	Exception thrown to caller
Assumption on success	Easy to miss errors if callback is ignored	Caller is forced to handle exceptions

🧠 Key Takeaway: Performance vs Reliability Tradeoff

Factor	Async Send()	Sync Send()
Performance	✅ Higher throughput (non-blocking)	❌ Slower (blocking per send)
Reliability	❌ Can fail silently if not handled	✅ Guaranteed acknowledgment or error
Backpressure	✅ Queue builds up silently	❌ Fails early under load
Error Control	Needs proper callback handling	Built-in exception propagation

🎯 Our Solution (in Production)

We made the following changes:

Switched critical workflows (e.g., order API events, Billing, Audit events ) to synchronous Kafka send using .send().get(), ensuring the message is fully acknowledged by the Kafka broker before proceeding.
For non-critical telemetry, we continued using async send with robust callbacks that:
- Logged errors
- Sent failed messages to a retry queue or DLQ
Introduced retry mechanisms using Spring Retry to auto-retry transient send failures. And also added a manual Queue API for manual intervention of the critical messages like OTP’s among others, along with ignore capability for non critical messages.
Added metrics and alerts for failed Kafka publishes.

If your upstream system relies on Kafka for critical state changes (e.g., orders, payments, provisioning), always prefer sync sends—or ensure async callbacks are bulletproof.

✅ Conclusion

Apache Kafka is a robust foundation for building scalable, event-driven systems—but using it effectively goes beyond just publishing and consuming messages. Real-world reliability requires deep attention to:

Producer patterns: Choosing between sync vs async send based on the criticality of data
Message partitioning: Ensuring even distribution and message ordering with custom partitioners
Error handling: Leveraging Dead Letter Queues (DLQ) and retry strategies like exponential backoff
Observability: Instrumenting callbacks, logs, and DLQ metrics to catch silent failures
Resilience: Combining Spring Boot’s retry, Kafka’s fault-tolerant guarantees, and system-level safeguards

These considerations become essential when your system transitions from development scale to production-grade reliability.

In our case, trading off some latency for reliability by moving from async to sync Kafka sends in critical paths prevented data mismatches, and allowed downstream systems to behave predictably.

“In backend systems, speed is optional—delivery is not."

#Kafka #SpringBoot #BackendDevelopment #SystemDesign #SoftwareEngineering #DLQ #RetryMechanism #Microservices #EventDrivenArchitecture

Advanced Kafka Techniques: Mastering DLQ, Async Callbacks, and Custom Partitioning

High level Kafka Architecture -

Partitions and Offsets-

Producer: Sync vs Async Send() -

Custom Partitioning with Round-Robin for Unique Keys-

DLQ Strategy: Safety Net for Failures-

Strategies for Retrying DLQ Messages-

Strategies for Retrying DLQ Messages

Note: Sync vs Async Kafka Send – Tradeoff Between Performance and Reliability-

❗ The Problem

✅ Conclusion

Subscribe to my newsletter

Moni MK

Moni MK