How we implemented canary releases in kafka without breaking production

Vijay BelwalVijay Belwal
2 min read

The Problem: "Move Fast, But Don’t Break Things"

Our team needed a way to:

  1. Test new code safely (without impacting all users).

  2. Limit failures to a small subset of traffic.

  3. Roll back instantly if something went wrong.

Why Not Just Deploy to Staging?

Staging doesn’t catch everything. Real traffic has:

  • Unexpected message formats.

  • Scale issues (e.g., partition skew).

  • Third-party API quirks.

We needed real-world testing

The Solution: Kafka Canary Releases via Smart Partitioning

Step 1: Isolate Canary Traffic with Partitioning

We have:

  • 20 partitions (for parallelism).

  • 2 canary pods (running new code).

  • 5 stable pods (running old code).

Goal:

  • Route 10% of traffic (2 partitions) to canaries.

  • Distribute the rest evenly across stable pods.

Step 2: The Custom Partition Assignor

// Simplified logic for understanding purpose only !!! :
if (consumerId.contains("-canary-")) {
    assignFixedPartitions(0, 1); // P0, P1 for canaries
} else {
    assignRemainingPartitionsEvenly(); // P2-P19 for stable
}

Why This Works:

  • Predictable: Canaries always get P0-P1 (easy debugging).

  • Fair: Stable pods get ~3-4 partitions each (no overload).

  • Zero Downtime: Just deploy canary pods—no config changes!

Example to visualise: Order Processing

Before (Risky!)

  • Push new code → All 20 partitions risk failures.

  • Rollback = Redeploy everything (slow!).

After (Safe Canary)

  1. Deploy 2 canary pods (assigned P0-P1).

  2. 10% of orders flow to canaries.

  3. Monitor for errors:

    • Latency spikes?

    • Processing failures?

  4. If all good, roll out to stable pods.

  5. If bad, kill canaries → only 10% impacted.

Lessons Learned

1. Dynamic > Static (Sometimes)

We later upgraded to header-based routing (more flexible %), but started with partitions for simplicity.

2. Debugging Tip

Use kafka-consumer-groups to verify assignments:

kafka-consumer-groups --describe --group orders-group

Output:

orders-group    orders-canary-1      P0  
orders-group    orders-canary-2      P1  
orders-group    orders-stable-1      P2, P3, P4  
orders-group    orders-stable-2      P5, P6, P7  
...

3. Not Just for Canaries!

We can use this for:

  • A/B testing (P0-P1 = variant A).

  • Geo-routing (P0-P4 = EU traffic).

Final Thoughts

This approach saved us countless rollbacks and gave confidence in deployments.

0
Subscribe to my newsletter

Read articles from Vijay Belwal directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Vijay Belwal
Vijay Belwal