Apache Kafka: A Beginner's Guide by Shubham Gore

Topics

What is Kafka?: Apache Kafka is a distributed streaming platform that allows for publishing, subscribing to, storing, and processing streams of records in real-time. It's designed to handle high-throughput, fault-tolerant, and scalable data pipelines.
Why use Kafka?: It's perfect for real-time analytics, log aggregation, and message queuing.
Key Use Cases:
- Real-time data pipelines.
- Log aggregation systems.
- Microservices communication.
Core Components:
- Brokers: A broker is a Kafka server that runs in a Kafka cluster. It receives messages from producers, assigns offsets to them, and commits the messages to storage on disk. It also services consumers, responding to fetch requests for partitions and responding with the messages that have been published.
- Partition: A partition is an ordered, immutable sequence of records that is continually appended to. Each partition is a structured commit log, and records in the partitions are each assigned a sequential id number called the offset. Partitions allow Kafka to scale horizontally and provide parallel processing capabilities.
- Producers/Consumers: Handle message input and output.
- Replication: Kafka replicates data by maintaining multiple copies of each partition across different brokers. One broker is designated as the leader for a partition, handling all read and write requests, while others are followers that replicate the leader's data. If a leader fails, one of the followers becomes the new leader. The number of replicas is configurable per topic.
Message Delivery Guarantees:
- At-least-once.
- At-most-once.
- Exactly-once.
Producers:
- Control acknowledgement (acks), retries, and batching.
Consumers:
- Use consumer groups for scalability.
- Manage offsets (auto or manual commit).

Setup

Downloaded Apache Kafka Binary from Here.
Extracted the file.

Started the Kafka broker and Zookeeper with following commands.

    # Start Zookeeper
    bin/zookeeper-server-start.sh config/zookeeper.properties

    # Start Kafka broker
    bin/kafka-server-start.sh config/server.properties

Created my first topic:

  bashCopy codebin/kafka-topics.sh --create --topic my-first-topic --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1

Produced messages using Kafka CLI:

  bashCopy codebin/kafka-console-producer.sh --topic my-first-topic --bootstrap-server localhost:9092

Consumed messages from the topic:

  bashCopy codebin/kafka-console-consumer.sh --topic my-first-topic --bootstrap-server localhost:9092 --from-beginning

Created a topic with multiple partitions:

  bashCopy codebin/kafka-topics.sh --create --topic multi-partition-topic --bootstrap-server localhost:9092 --partitions 3 --replication-factor 1

Produced messages and verified their partition distribution.

Observed partition behaviour using:

 bashCopy codebin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group console-consumer-group

wrote a simple Java producer to send messages.
Created a Java consumer to read messages and manually commit offsets.
check out the code on my GitHub

Insights

Kafka's CLI tools are intuitive for beginners.
Observing real-time message production and consumption was fascinating.
Partitioning improves scalability but requires careful design for ordering.
Consumer groups balance the load across consumers.
Managing offsets provides flexibility but requires careful implementation to avoid duplication.

Conclusion

This week, I explored the core concepts and practical aspects of Apache Kafka, focusing on topics like brokers, partitions, producers, consumers, and message delivery guarantees. Through hands-on exercises, I set up Kafka, created topics, and experimented with real-time message production and consumption.

Key takeaways:

Kafka's distributed architecture is powerful for handling large-scale, real-time data streams, with partitions ensuring scalability and fault tolerance.
Partitioning enables parallel processing but also presents challenges around message ordering, which I had to consider when working with multiple partitions.
Consumer groups offer a scalable approach to consuming messages, allowing multiple consumers to share the load, but offset management becomes crucial to avoid processing the same messages multiple times.
Message delivery guarantees such as at-least-once, at-most-once, and exactly-once give flexibility in how Kafka handles message reliability and duplication.

By setting up topics and consumers, observing partitioning behavior, and experimenting with Java producers and consumers, I gained a deeper understanding of Kafka's inner workings. Moving forward, I’ll continue to expand my knowledge on advanced features like Kafka Streams, Kafka Connect, and stream processing.

I am excited about what lies ahead in Week 2, as I dive deeper into Kafka's advanced topics and work on more sophisticated stream processing applications.

Engage with Me

Twitter: Follow my daily updates and join the discussion!
LinkedIn: Connect with me and share your Kafka journey.
Hashnode: Stay updated with my weekly learning journals.

Let me know your thoughts, questions, or feedback in the comments below. 🚀

Learning Kafka: Week 1 of 14-Day Journey With Shubham Gore