Advanced Kafka Techniques: Mastering DLQ, Async Callbacks, and Custom Partitioning


We all know Kafka is more than just a pub-sub system—it's the backbone of modern, scalable event-driven architecture. Whether you're handling data ingestion, system decoupling, or stream processing, understanding Kafka's key elements like partitions, offsets, async callbacks, and dead letter queues (DLQs) is crucial for building reliable services.
In this blog I will try to covers Kafka basics and explores advanced topics like custom partitioning using a round-robin strategy with keys which I have used in by enterprise projects for evenly balancing the load on partition, and why handling producer errors with async callbacks is so important.
High level Kafka Architecture -
Partitions and Offsets-
A partition is a unit of parallelism in Kafka. Each topic is divided into one or more partitions. Kafka appends incoming messages in order within a single partition.
More partitions → more parallelism. Each partition is independently read by a consumer in a group.
An offset is nothing but a unique ID assigned to each record within a partition. Think of it as a line number in a log file.
Offsets are immutable. Consumers use them to keep track of their reading position.
Partition 0: [0] "LoginEvent" [1] "LogoutEvent" [2] "PurchaseEvent" .. .. and so on.
Producer: Sync vs Async Send() -
This is very important inbuild method from Kafka, responsible for many important think, before using any inbuild function provided by any library always know the the implementation of it.
Custom Partitioning with Round-Robin for Unique Keys-
Default Kafka Partitioner Behavior is -
If a key is provided, Kafka hashes the key and assigns a partition. If no key is provided, Kafka uses a sticky round-robin strategy.
ProducerRecord<String, String> record = new ProducerRecord<>("my-topic", "userId123", "value");
To implement custom partitioning behavior kafka allows us to implements its Partitioner class as shown below-
public class CustomRoundRobinPartitioner implements Partitioner { private final Map<String, Integer> keyToPartitionMap = new ConcurrentHashMap<>(); private final AtomicInteger counter = new AtomicInteger(0); @Override public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) { List<PartitionInfo> partitions = cluster.partitionsForTopic(topic); int numPartitions = partitions.size(); String stringKey = (String) key; return keyToPartitionMap.computeIfAbsent(stringKey, k -> counter.getAndIncrement() % numPartitions ); } @Override public void close() {} @Override public void configure(Map<String, ?> configs) {} }
This helps in to distribute data evenly across partitions even when keys are skewed. Retains partition stickiness per unique key, aiding ordering, and ordering is very important for critical notification delivery. Avoids hot partitions when some keys dominate traffic.
DLQ Strategy: Safety Net for Failures-
When consumers fail to process a record (due to exceptions or bad data), instead of crashing or skipping, send the failed message to a Dead letter Queue(DLQ) topic. Same behavior as normal topic producer/consumer. Called it as a back-up separate out in the normal execution flow Kafka topic, where all failed messages are sent.
Data never lost
Enables offline debugging or repair
Keeps your main pipeline clean
Strategies for Retrying DLQ Messages-
Once messages land in the DLQ, it's not the end of the road. Often, these messages fail due to transient issues such as:
Temporary downstream service unavailability
Network glitches
Schema evolution lag
Instead of discarding these messages, we retry them intelligently.
Strategies for Retrying DLQ Messages
Strategy | Description | When to Use |
Exponential Backoff | Retry after exponentially increasing delays (e.g., 1s, 2s, 4s, 8s) | Transient network or external service failures |
Fixed Delay | Retry every X seconds | Predictable retry for recoverable issues |
Max Attempts + Drop | Cap retries to avoid infinite loops | Prevent resource exhaustion |
Now I am sure there are more retry strategies out there that is tweaked accordingly on the bases of individual use cases. But the above mentioned are the standards algorithm that is used widely across the industry.
If you're using Spring Boot like me, you can leverage the @Retryable
annotation provided by Spring Retry to implement automatic retries with exponential backoff—even for messages consumed from a Dead Letter Topic (DLT), it is very handy and simple.
@Service
public class DltConsumerService {
@KafkaListener(topics = "my-topic-dlq", groupId = "dlt-retry-consumer-group")
public void listenFromDLT(String message) {
handleDltMessage(message);
}
@Retryable(
value = {TransientDataException.class},
maxAttempts = 4,
backoff = @Backoff(delay = 2000, multiplier = 2)
) // the values like maxattemps, multplier, delay can be configured via app props/config maps
public void handleDltMessage(String message) {
// Try to reprocess the message
if (Math.random() > 0.5) { // Simulate failure
throw new TransientDataException("Random failure for demo");
}
log.info("Successfully processed: " + message);
}
@Recover // optional
public void recover(TransientDataException e, String message) {
// Called when all retries failed
log.info("All retries exhausted for message: " + message);
// Optionally forward to a final DLQ topic or alert
}
}
Retry Flow with Exponential Backoff
First failure → wait 2s → retry
Second failure → wait 4s → retry
Third failure → wait 8s → retry
After 4th total attempt → call @Recover
method or you can update the failure metrics up in monitoring tool like prometheus, grafana etc.
Note: Sync vs Async Kafka Send – Tradeoff Between Performance and Reliability-
Real-World Production Lesson(Ghost transactions)
In a high-throughput system, we originally used asynchronous Kafka send (kafkaTemplate.send(...)
) to improve performance and latency. However, we encountered a critical issue in production:
Messages that failed silently in the background were assumed to be delivered, and the upstream system proceeded as if the operation succeeded.
❗ The Problem
Kafka producer was configured for async sending.
Errors during
send(...)
were not handled via callback properly.As a result, if Kafka broker was temporarily unavailable or a serialization error occurred, the message was never delivered.
The calling service (upstream) moved on, assuming the message was committed.
Downstream systems never saw the event, causing data inconsistency or "ghost" transactions.
✅ Why It Happened
Behavior | Async Send() | Sync Send() |
Producer sends message | Immediately (non-blocking) | Waits for broker acknowledgment |
Failure handling | Via callback (optional) | Exception thrown to caller |
Assumption on success | Easy to miss errors if callback is ignored | Caller is forced to handle exceptions |
🧠 Key Takeaway: Performance vs Reliability Tradeoff
Factor | Async Send() | Sync Send() |
Performance | ✅ Higher throughput (non-blocking) | ❌ Slower (blocking per send) |
Reliability | ❌ Can fail silently if not handled | ✅ Guaranteed acknowledgment or error |
Backpressure | ✅ Queue builds up silently | ❌ Fails early under load |
Error Control | Needs proper callback handling | Built-in exception propagation |
🎯 Our Solution (in Production)
We made the following changes:
Switched critical workflows (e.g., order API events, Billing, Audit events ) to synchronous Kafka send using
.send().get()
, ensuring the message is fully acknowledged by the Kafka broker before proceeding.For non-critical telemetry, we continued using async send with robust callbacks that:
Logged errors
Sent failed messages to a retry queue or DLQ
Introduced retry mechanisms using Spring Retry to auto-retry transient send failures. And also added a manual Queue API for manual intervention of the critical messages like OTP’s among others, along with ignore capability for non critical messages.
Added metrics and alerts for failed Kafka publishes.
If your upstream system relies on Kafka for critical state changes (e.g., orders, payments, provisioning), always prefer sync sends—or ensure async callbacks are bulletproof.
✅ Conclusion
Apache Kafka is a robust foundation for building scalable, event-driven systems—but using it effectively goes beyond just publishing and consuming messages. Real-world reliability requires deep attention to:
Producer patterns: Choosing between sync vs async send based on the criticality of data
Message partitioning: Ensuring even distribution and message ordering with custom partitioners
Error handling: Leveraging Dead Letter Queues (DLQ) and retry strategies like exponential backoff
Observability: Instrumenting callbacks, logs, and DLQ metrics to catch silent failures
Resilience: Combining Spring Boot’s retry, Kafka’s fault-tolerant guarantees, and system-level safeguards
These considerations become essential when your system transitions from development scale to production-grade reliability.
In our case, trading off some latency for reliability by moving from async to sync Kafka sends in critical paths prevented data mismatches, and allowed downstream systems to behave predictably.
“In backend systems, speed is optional—delivery is not."
#Kafka #SpringBoot #BackendDevelopment #SystemDesign #SoftwareEngineering #DLQ #RetryMechanism #Microservices #EventDrivenArchitecture
Subscribe to my newsletter
Read articles from Moni MK directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Moni MK
Moni MK
I am a software development Engineer, just sharing my learning which I face on my day to day development, and how the industry juice up the technology for the betterment of the consumer.