How we implemented canary releases in kafka without breaking production

The Problem: "Move Fast, But Don’t Break Things"
Our team needed a way to:
Test new code safely (without impacting all users).
Limit failures to a small subset of traffic.
Roll back instantly if something went wrong.
Why Not Just Deploy to Staging?
Staging doesn’t catch everything. Real traffic has:
Unexpected message formats.
Scale issues (e.g., partition skew).
Third-party API quirks.
We needed real-world testing
The Solution: Kafka Canary Releases via Smart Partitioning
Step 1: Isolate Canary Traffic with Partitioning
We have:
20 partitions (for parallelism).
2 canary pods (running new code).
5 stable pods (running old code).
Goal:
Route 10% of traffic (2 partitions) to canaries.
Distribute the rest evenly across stable pods.
Step 2: The Custom Partition Assignor
// Simplified logic for understanding purpose only !!! :
if (consumerId.contains("-canary-")) {
assignFixedPartitions(0, 1); // P0, P1 for canaries
} else {
assignRemainingPartitionsEvenly(); // P2-P19 for stable
}
Why This Works:
Predictable: Canaries always get P0-P1 (easy debugging).
Fair: Stable pods get ~3-4 partitions each (no overload).
Zero Downtime: Just deploy canary pods—no config changes!
Example to visualise: Order Processing
Before (Risky!)
Push new code → All 20 partitions risk failures.
Rollback = Redeploy everything (slow!).
After (Safe Canary)
Deploy 2 canary pods (assigned P0-P1).
10% of orders flow to canaries.
Monitor for errors:
Latency spikes?
Processing failures?
If all good, roll out to stable pods.
If bad, kill canaries → only 10% impacted.
Lessons Learned
1. Dynamic > Static (Sometimes)
We later upgraded to header-based routing (more flexible %), but started with partitions for simplicity.
2. Debugging Tip
Use kafka-consumer-groups
to verify assignments:
kafka-consumer-groups --describe --group orders-group
Output:
orders-group orders-canary-1 P0
orders-group orders-canary-2 P1
orders-group orders-stable-1 P2, P3, P4
orders-group orders-stable-2 P5, P6, P7
...
3. Not Just for Canaries!
We can use this for:
A/B testing (P0-P1 = variant A).
Geo-routing (P0-P4 = EU traffic).
Final Thoughts
This approach saved us countless rollbacks and gave confidence in deployments.
Subscribe to my newsletter
Read articles from Vijay Belwal directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
