DataOps: Apache Kafka - Basic
Introduction
Technically speaking, event streaming is the practice of capturing data in real-time from event sources like databases, sensors, mobile devices, cloud services, and software applications in the form of streams of events; storing these event streams durably for later retrieval; manipulating, processing, and reacting to the event streams in real-time as well as retrospectively; and routing the event streams to different destination technologies as needed. Event streaming thus ensures a continuous flow and interpretation of data so that the right information is at the right place, at the right time.
How Does it Work?
Kafka has 3 basic blocks (functionalities):
To publish (write) and subscribe to (read) streams of events.
To store streams of events as long as one wants, durably.
To process streams of events.
Kafka is a distributed system consisting of servers and clients that communicate via a high-performance TCP network protocol:
Servers
Kafka is run as a cluster of one or more servers that can span multiple datacenters or cloud regions. Some of these servers form the storage layer, called the brokers. Other servers run Kafka Connect to continuously import and export data as event streams to integrate Kafka with your existing systems such as relational databases as well as other Kafka clusters.
Clients
Distributed applications and microservices that read, write, and process streams of events in parallel, at scale, and in a fault-tolerant manner even in the case of network problems or machine failures.
Notes
If a server fails, other servers will take over their work to limit data loss and ensure a healthy operation.
Clients are developed for many programming languages, namely: Java, Python, Go, and C and REST APIs.
Kafka's performance is effectively constant with respect to data size, so storing data for a long time is perfectly fine.
Descriptive Core Concepts
Events
A record, marking the fact that something happened. Also called a record or a message. Every read-and-write operation is done in the form of Events.
Producers
Client applications that publish events.
Consumers
Client applications that subscribe to events.
Topic
Events are stored in topics, like how files populate a folder or how images form an album. Topics may have many producers and many consumers. Each topic can be configured separately in order to give full control over how long should the old events be kept.
Topics are partitioned over a number of buckets located on different Kafka brokers. Events with the same event key are written to the same partition.
Practical Core Concepts
Initialize Server
We'll be using KRaft instead of ZooKeeper because the latter is almost deprecated.
After downloading the Kafka from official website.
# Navigate to the extracted folder. And then:
> KAFKA_CLUSTER_ID="$(bin/kafka-storage.sh random-uuid)"
> bin/kafka-storage.sh format -t $KAFKA_CLUSTER_ID -c config/kraft/server.properties
> bin/kafka-server-start.sh config/kraft/server.properties
Now Kafka is ready to use. Open a new terminal session and let's create a topic.
Topics
Before publishing or subscribing, we should create a Topic for the event, to let the Kafka server know where to store & categorize the event data.
> bin/kafka-topics.sh --create --topic quickstart-events --bootstrap-server localhost:9092
Note: localhost:9092
is the default server and port that Kafka starts on.
Now that the topic has been created, let's send and receive some events.
Producers & Consumers
Using the console producer:
> bin/kafka-console-producer.sh --topic quickstart-events --bootstrap-server localhost:9092
#
>> Hello!
>> This is Andrew
Now in a new terminal session, let's inspect the consumer while the producer and the server are running.
> bin/kafka-console-consumer.sh --topic quickstart-events --from-beginning --bootstrap-server localhost:9092
#
Hello!
This is Andrew
The event streaming happens in real-time and you can publish events at the same time that you're subscribing to them.
Strong Points
Scalable Event Reporting
Using Apache Kafka, one can create a Pub-Sub communication channel/line between any two components of the whole system that can easily scale to any number of producers and consumers, allowing a seamless flow of information for Analysis, Archiving, and API Response Generation, all at once.
Real-time Handling
Kafka allows for a very low latency communication channel between producers and consumers makes it excellent for time-sensitive implications such as:
Self-driving cars and other autonomous service robots that directly interact with other IoT devices or human beings need a super fast feedback loop and communication speed between different logistic compartments.
Security- and asset-management-related applications that need real-time tracking, reporting, and data processing to provide reliable support.
Centralized Universal Messaging Interface
Kafka has a simple and adaptable interface that can pass along events from/to many different services. One can flexibly gather reports of all the assets of a company by using Kafka and monitor or archive every status report step-by-step if need be.
Final Word
We covered the basic concepts of working with Apache Kafka. These were only building blocks, and how you plan your data ETL and logging procedures depend on you.
Subscribe to my newsletter
Read articles from Andrew Sharifikia directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Andrew Sharifikia
Andrew Sharifikia
I'm a senior full-stack developer, and I love software engineering. I believe by working together we can create wonderful assets for the world, and we can shape the future as we shape our own destiny, and no goal is too ambitious. ๐ซ I try to be optimistic and energetic, and I'm always open to collaboration. ๐ฅ Contact me if you have something on your mind and let's see where it goes ๐