DataOps: Apache Kafka

Introduction

Technically speaking, event streaming is the practice of capturing data in real-time from event sources like databases, sensors, mobile devices, cloud services, and software applications in the form of streams of events; storing these event streams durably for later retrieval; manipulating, processing, and reacting to the event streams in real-time as well as retrospectively; and routing the event streams to different destination technologies as needed. Event streaming thus ensures a continuous flow and interpretation of data so that the right information is at the right place, at the right time.

(kafka.apache.org)

How Does it Work?

Kafka has 3 basic blocks (functionalities):

To publish (write) and subscribe to (read) streams of events.
To store streams of events as long as one wants, durably.
To process streams of events.

Kafka is a distributed system consisting of servers and clients that communicate via a high-performance TCP network protocol:

Servers

Kafka is run as a cluster of one or more servers that can span multiple datacenters or cloud regions. Some of these servers form the storage layer, called the brokers. Other servers run Kafka Connect to continuously import and export data as event streams to integrate Kafka with your existing systems such as relational databases as well as other Kafka clusters.
Clients

Distributed applications and microservices that read, write, and process streams of events in parallel, at scale, and in a fault-tolerant manner even in the case of network problems or machine failures.

Notes

If a server fails, other servers will take over their work to limit data loss and ensure a healthy operation.
Clients are developed for many programming languages, namely: Java, Python, Go, and C and REST APIs.
Kafka's performance is effectively constant with respect to data size, so storing data for a long time is perfectly fine.

Descriptive Core Concepts

Events

A record, marking the fact that something happened. Also called a record or a message. Every read-and-write operation is done in the form of Events.

Producers

Client applications that publish events.

Consumers

Client applications that subscribe to events.

Topic

Events are stored in topics, like how files populate a folder or how images form an album. Topics may have many producers and many consumers. Each topic can be configured separately in order to give full control over how long should the old events be kept.

Topics are partitioned over a number of buckets located on different Kafka brokers. Events with the same event key are written to the same partition.

Practical Core Concepts

Initialize Server

We'll be using KRaft instead of ZooKeeper because the latter is almost deprecated.

After downloading the Kafka from official website.

# Navigate to the extracted folder. And then:
> KAFKA_CLUSTER_ID="$(bin/kafka-storage.sh random-uuid)"
> bin/kafka-storage.sh format -t $KAFKA_CLUSTER_ID -c config/kraft/server.properties
> bin/kafka-server-start.sh config/kraft/server.properties

Now Kafka is ready to use. Open a new terminal session and let's create a topic.

Topics

Before publishing or subscribing, we should create a Topic for the event, to let the Kafka server know where to store & categorize the event data.

> bin/kafka-topics.sh --create --topic quickstart-events --bootstrap-server localhost:9092

Note: localhost:9092 is the default server and port that Kafka starts on.

Now that the topic has been created, let's send and receive some events.

Producers & Consumers

Using the console producer:

> bin/kafka-console-producer.sh --topic quickstart-events --bootstrap-server localhost:9092
#
>> Hello!
>> This is Andrew

Now in a new terminal session, let's inspect the consumer while the producer and the server are running.

> bin/kafka-console-consumer.sh --topic quickstart-events --from-beginning --bootstrap-server localhost:9092
#
Hello!
This is Andrew

The event streaming happens in real-time and you can publish events at the same time that you're subscribing to them.

Strong Points

Scalable Event Reporting

Using Apache Kafka, one can create a Pub-Sub communication channel/line between any two components of the whole system that can easily scale to any number of producers and consumers, allowing a seamless flow of information for Analysis, Archiving, and API Response Generation, all at once.

Real-time Handling

Kafka allows for a very low latency communication channel between producers and consumers makes it excellent for time-sensitive implications such as:

Self-driving cars and other autonomous service robots that directly interact with other IoT devices or human beings need a super fast feedback loop and communication speed between different logistic compartments.
Security- and asset-management-related applications that need real-time tracking, reporting, and data processing to provide reliable support.

Centralized Universal Messaging Interface

Kafka has a simple and adaptable interface that can pass along events from/to many different services. One can flexibly gather reports of all the assets of a company by using Kafka and monitor or archive every status report step-by-step if need be.

Final Word

We covered the basic concepts of working with Apache Kafka. These were only building blocks, and how you plan your data ETL and logging procedures depend on you.

DataOps: Apache Kafka - Basic

Table of contents