Apache Kafka - A Gentle Introduction


What is Apache Kafka?
Developed by engineers at LinkedIn (2010), Apache Kafka (or simply just ‘Kafka’) was introduced as a solution dealing with real-time tracking with streaming data (i.e. data that just keeps on coming) at scale which couldn’t be handled via traditional systems. Open-sourced in 2011, it's now the silent workhorse behind Netflix's recommendations and Uber's ride matching. Fun fact: The "Kafka" name? It's either a hat tip to writer Franz Kafka (the team loved his work) or German for "crow" (their logo). Either way, it's better than "Project DatabaseSaver 3000".
Why do I need Kafka?
Picture this:
A team has built a food delivery app where thousands of delivery partners help to delivery food ordered via the app which supports live tracking of the order from restaurants to the delivery address. Every delivery partner’s device pings location updates every second. At the beginning when there are one or two delivery partners present, the app feels butter smooth and things work as planned. But once began to scale to a large business, onboarding over 5 thousand delivery partners, the team notices that out of the blue - their PostgreSQL database instance starts to throw these errors:
ERROR: too many connections
FATAL: remaining connection slots are reserved for non-replication superuser connections
Now why would this happen? Well a root cause analysis might reveal some (if not all) of the following reasons:
~5K delivery partners × 1 update/sec = 300K writes/minute
Read queries getting trampled by write stampedes
DB replication lag makes the app feel like a 90s dial-up connection (as app is Multi-region)
These frequent write operations severely impact database throughput, slowing down the entire application and potentially causing system failures during peak times.
Kafka to the rescue
Kafka is the answer to handle messaging between distributed systems in large scale application such that it reduces the database overload by temporarily holding the data until it is either ‘used’ or saved in database. But here is where most people gets confused: thinking Kafka replaces databases. Nope. Kafka handles the firehose (meaning it can only store the data temporarily), and the DB handles persistence (meaning storing data permanently). Think of it like the reception staff in a hospital - regardless the number of visitors, they get relevant data from the visitors (like their symptoms, blood pressure, height and weight), and pacifies them till the queue is cleared for them to meet their assigned doctor.
With Kafka, now we can -
Handle millions of events/sec
Order messages like a OCD librarian
Survive broker failures without losing the data
Key Components of Kafka
To understand Kafka, one needs to grasp it’s core components:
Broker: The main server that handles message storage and routing. Think of it as a ‘bouncer’ managing message flow
Producer: A thing that publishes messages to Kafka. For example - the delivery partners’ mobile screaming "HERE'S ANOTHER LOCATION UPDATE!"
Consumer: A thing that subscribes to Kafka and processes the incoming messages. Its like the workers processing those screams
Topic: The categories within Kafka where producers post their messages and consumers consume. It is like a table in a database or a folder in a file system.
Partition: Subdivisions of a topic that enable parallel processing. Partitions determine the max parallel processing of the messages. 8 partitions? Max 8 consumers per group.
Consumer Groups: Team of consumers splitting the work. They provide fault tolerance and load balancing.
We will see what each of them does in details later.
‘ZooKeeper’ and ‘KRaft’
Kafka used to use ‘Zookeeper’ behind the scenes. Think of it like a weird friend who organizes Kafka’s parties on it’s behalf. ‘ZooKeeper’ is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. Old-school Kafka needed ZooKeeper. But now, from the newer versions of Kafka (2.8+) uses ‘KRaft’ mode - simpler setup, fewer moving parts.
Unless it’s about maintaining legacy systems, Kafka don’t need ‘ZooKeeper’ any more. ‘KRaft’ is a consensus protocol that basically does what Zookeeper did for Kafka within Kafka, eliminating the need for a separate ZooKeeper service. This change simplifies Kafka's architecture, improves scalability, and reduces operational overhead.
Auto-Balancing and Scalability
Kafka's power comes from its partitioning model (i.e. a topic being sub-divided into different partitions), which follows these important rules:
One consumer can read from multiple partitions
One partition can only be consumed by one consumer from the same consumer group
The number of active consumers in a group is limited by the number of partitions
These rules, though necessary, limits the partition - consumer ratio. This is where ‘Consumer Groups’ become essential. They allow Kafka to distribute processing load and scale horizontally. See diagram below:
When you add more consumers to a group, Kafka automatically rebalances the partition assignments, allowing throughput to scale linearly until you have as many consumers as partitions. For example - in consumer group #1, four consumers within in could consume from all the four partitions leaving consumer 5 idle, in consumer group #2 two of the three consumers consumers from 1 partitions each while the last consumers consumes from the rest of the partitions, in consumer group #3, both the consumers consumes from two partitions each, and in consumer group #4, one consumer consumes all the partitions. Kafka does this automatically and this is referred to as - Auto Balancing, which enables Kafka to scale.
How Kafka Stands out among it’s peers
There are many message handlers like Kafka (for example - RabbitMQ, Google Pub/Sub, Apache ActiveMQ, etc) and most of them follows either one of the following message handling models:
Queue Model: One producer, One Consumer
Pub-Sub Model: One publisher publishes to many subscribers
Kafka elegantly supports both these message handling patterns. This flexibility makes Kafka suitable for a wide range of use cases, from simple message queuing to complex event streaming architectures.
Conclusion
Kafka provides an elegant solution for high-throughput streaming data messaging needs in modern applications. By offloading real-time data processing to Kafka, you can significantly reduce database load while maintaining system responsiveness.
For delivery tracking, IoT applications, analytics pipelines, or any scenario requiring real-time data streams, Kafka offers the scalability and performance modern applications demand.
As you implement this architecture, remember that Kafka excels at handling real-time data flows, while databases remain essential for long-term data persistence and complex queries. By combining their strengths, you can build systems that are both responsive and reliable under high load.
Reference
Subscribe to my newsletter
Read articles from Mishal Alexander directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
