Introduction

Kafka is a distributed event store and stream processing platform originally developed by LinkedIn for real-time processing. In 2011, Kafka was transferred to the Apache Software Foundation, and since then, countless software development teams have adopted it for various requirements. In this article, we will examine the various components of Kafka, how they function, and what enables Kafka to be highly available while handling millions of messages per second.

Components of Kafka

Topics

A Kafka topic can be likened to a database table; while it is not strictly a table, this analogy helps illustrate its function. Each topic is identified by a unique name and does not perform data validation like a database table. The sequence of messages within a topic is referred to as a data stream. Topics are immutable, meaning that once data is written, it cannot be deleted or modified.

Data in Kafka topics does not get removed after consumption, distinguishing Kafka from traditional message queues. Instead, the data is persisted in topics for a limited time, with a default expiry period of seven days that can be configured based on requirements. Topics are fed by producers, and the data within them is consumed by consumers, as we will explore shortly.

Partitions and Offsets

One reason Kafka achieves high throughput is due to the distribution of topics across partitions. A single topic can consist of multiple partitions. When a message is published to a topic, a component of Kafka known as the producer (covered later) determines which partition will receive the message. Users can also instruct producers to route data to a specific partition by providing a partition key.

Within each partition, messages are assigned an incremental identifier known as an offset. The offset is meaningful only within a specific partition; thus, offset 2 in partition 1 is unrelated to offset 2 in partition 2. This design ensures that order is guaranteed only within partitions, not across the entire topic. Additionally, offsets will not be reused, even after data has been deleted.

Producers

A producer is a client that writes data to Kafka topics, which are composed of partitions. Producers have the following key attributes:

They perform load balancing across partitions of a topic until a specific partition key is provided.
If a broker hosting the intended partition goes down while the producer is pushing messages, the producer can recover by shifting to a replicated partition. We will delve deeper into replication in Kafka later in this article.
Producers know which broker (a physical machine where a topic partition resides) contains the necessary partition for writing data. A producer employs partitioner logic to determine which partition to use for a record. This logic uses a key in binary format and the Murmur2 algorithm to hash the value, which identifies the target partition. The calculation follows this formula:
targetParition = Math.abs(Utils.murmur2(keyBytes)) % (numPartitions-1)

A Kafka message sent by the broker comprises the following components:

Key: In binary format and can be null.
Value: The message itself, also in binary format and nullable.
Compression Type: The supported compression types include none, gzip, snappy, lz4, and zstd.
Headers: In key-value format, these are optional.
Partition: Contains information about the partition where the message will be written.
Timestamp: This can be either user-defined or system-generated.

Kafka Message Serializer

Kafka topic partitions exclusively accept bytes from producers and return bytes to consumers (which we will discuss shortly). The key and value in each message are serialized before being pushed into the topics.

Common Serializers provided by Kafka:

String (including JSON)
Int, Float
Avro
Protobuf

Consumers

Consumers are clients that perform the opposite function of producers: they pull data from a topic. Data is read sequentially from low to high within each partition.

Consumer Deserializer

Just as producers have serializers, consumers utilize deserializers. Common deserializers include:

String (including JSON)
Int, Float
Avro
Protobuf

It is important to note that a Kafka topic cannot change its serialization or deserialization type throughout its lifecycle. The same serialization format must be used for both serialization and deserialization processes.

Consumer Groups

All consumers within an application read data as part of consumer groups.

💡

Each consumer within a group can read from multiple partitions, but a single partition cannot be consumed by more than one consumer at a time. Refer the diagram below.

A natural question arises: What happens if the number of consumers in a consumer group exceeds the number of available partitions?

In this case, the additional consumers will exist but remain inactive, as illustrated in the diagram below:

A topic can be connected to multiple consumer groups without any issue.

To create distinct consumer groups, the Kafka consumer framework provides a property called group ID For example, you might have a notification service and a dashboard service that both listen to a topic called truck_gps_location While the dashboard service updates a geographic map on the dashboard, the notification service is responsible for raising alerts for interested users via email or text. These two services belong to different consumer groups.

Consumer Offsets

Kafka stores the read offsets for each consumer group in an internal topic called __consumer_offsets. This mechanism ensures that if a consumer in a group fails, Kafka allows it to resume data retrieval from the last position it accessed.

Consumers need to periodically commit the read offsets in Kafka, which can be done either automatically or manually. If committed manually, there are three delivery semantics to consider:

At Least Once (usually preferred): Offsets are committed after the message is processed. If an error occurs during processing, the message will be read again, which might lead to duplicate processing. Therefore, consumers should handle messages idempotently.
At Most Once: Offsets are committed immediately upon receiving the messages. If an error occurs during processing, some messages may be lost and will not be read again.
Exactly Once: This approach is recommended when there is a need to read from a topic and then write back to it. The Transactional API of Kafka can be used for this purpose.

Kafka Brokers and Topic Replication Factor

Kafka is a distributed software platform that spans multiple nodes or servers, referred to as brokers. The partitions for a topic are distributed across different brokers to facilitate horizontal scaling, and they are also replicated to ensure high availability. The replication factor of a topic determines how many replicas of each partition will be stored.

For example, consider a Kafka cluster with three brokers and a replication factor of 2.

As illustrated in the above diagram, the partitions for both topics are distributed across the three brokers, with each partition's replication also stored across the brokers. This design allows Kafka topics to be both distributed and highly available.

When a client connects to any broker within the Kafka cluster, the broker acts as a bootstrap broker. It provides the client with information about all the other brokers, their addresses, and the partitions stored on them, allowing the client to know which broker to connect to for the specific partition it requires.

Regarding the replicas of a partition, only one copy—the leader replica—will accept writes from producers. Producers send messages only to the leader partition. Consumers, on the other hand, read from the closest replica of the partition.

ZooKeeper / KRaft

ZooKeeper, a distributed system in its own right, was initially used to coordinate the brokers in a Kafka cluster, enabling them to manage events such as new topic creation, leader elections across partitions, and broker failures (both the death of a broker and its recovery). However, ZooKeeper faced scaling issues after approximately 100,000 partitions. Consequently, starting with Kafka version 4.0, ZooKeeper has been deprecated and replaced with KRaft, which is a Raft implementation of Kafka itself, allowing a Kafka cluster to scale to millions of partitions.

Conclusion

In this article, we examined the various components of Apache Kafka and gained an understanding of how they function. At a high level, we explored the factors that make Kafka both scalable and highly available. We began with topics and delved into components such as partitions. We also discussed producers, consumers, and consumer groups, highlighting the anatomy of a Kafka message sent by the producer to the partitions and how consumer groups allow different services to read from the same topic for various business purposes.

Kafka is continually being enhanced for performance improvements and remains a highly sought-after technology for high-speed event processing. I hope this article has clarified many of your questions about Kafka and has instilled confidence in you to consider it for your next project.

Apache Kafka: Architectural Overview and Performance Mechanisms

Table of contents