Cloud Event-Driven architecture

Abhiram PuranikAbhiram Puranik
12 min read

Definition of event-driven architecture

As the name suggests, event-driven, it is a paradigm of building highly scalable applications where each action is based on an event that has occurred in the system, which is triggered externally or internally as part of the system.

Overview of cloud-native applications

Cloud native apps are the systems that are specifically designed to be deployed and run in a cloud based environment where the environment on which the application run can dynamically scale.

Microservices are the abstracted apps that are deployed in cloud environment that are dedicated to perform a particular task. These are modular components that have interdependencies on infrastructure like databases, caches or other services to do their work. But they are designed to scale independently.

Characteristics of cloud-native apps

  1. Software-as-a-service: These apps are not usually delivered, instead deployed at a provider’s place, providing functionalities as a service.

  2. Microservice based: Made up of multiple services that are intended to do a particular task in the application.

  3. Containerized: Docker images based systems that package code and all its dependencies making it platform independent.

  4. Dev-ops based: Allows to be deployed in a CI/CD(Continuous integration/ Continuous deployment) configuration.

  5. APIs: REST/gRPC based standard communication bringing in modularity between services.

  6. Mostly stateless: Apps on their own don’t need to hold the state of any application making it easy to replace.

  7. Scalable: Its stateless ability makes it scalable using replicas of the same instance.

  8. Resilient and Fault-Tolerance: As these apps run on distributed systems, it makes it highly resilient and fault-tolerant on node outages.

  9. Observable: App’s performance can be monitored individually.

  10. Maintainability: Apps are modular and adding new apps to the system is easy.Role of message queues in event-based systems

Role of message queues in event-based systems

Message queues are conveyer belt of the whole “factory of services”. It is the backbone of event driven architectures.

Message queues are scalable systems that act as an intermediate between services to talk with each other. They are designed to be fault tolerant, cope up with large amount of messages which are sent, and intelligent enough to hold and distribute messages based on its usage by the system.

If it was an end-to-end (request-reply) design where in each service is sending and receiving messages directly like REST/gRPC, with each other, and with number of services increasing linearly, the messages shared between services will grow exponentially making it hard for the system to handle the load reducing throughput and slowing down responses and actions.

Fundamentals of Event-Based Architecture

Event driven approach’s main purpose is scalability from Applications business end.

From a development end, its main purpose is ability for teams to build services which are loosely coupled, asynchronous behaviour and modularity.

Key components

High level Components:

  • Message queue: Appliance on which other components relay messages to each other.

  • Producers/Publishers: Components that create the events which are sent to the message queue.

  • Subscribers/Consumers : Components that read the events from the message queue.

  • Contracts: Understanding between component owners on what they will produce or consume for the purpose of building the application.

  • Event: The actual message that are transferred between the Publisher and Subscriber. These can be JSON, Protobufs, Primitive data types and many others.

Benefits and challenges

This architecture having multiple advantages, also has many drawbacks like any other implementation.

Benefits are scalability, modularity, loosely coupled components.

Challenges occur when systems are not correctly designed or infra hurdles that can occur due to resource constraints or other limitations.

Not correctly designed systems in event driven architecture mostly lead to huge tech debts, lot of bugs, and lot of back and forth with service owners.

Comparison with traditional request-response models

Traditional request-response are point-to-point communication usually using REST or gRPC.

REST usually has extra overheads on each request but gRPC are lighter as compared to, but both are quite similar.

Problem with these are that they can get overwhelming when the number of communicating messages or numbers of services increase. Each message needs to travel independently leading to stress on the network, here the k8s network.

On the other hand, message queue keeps messages on the same queue, like a river on which multiple households have their intake pipes dipped in from the river bank.

Advantages of message queues:

  1. Services are not responsible to handle the message movements.

  2. If a service which had to take in the messages go down at a point in time, the messages in the queue wait until the service comes back up. In request-response model, those messages that were sent would be lost.

Use cases in modern applications

Event driven architecture are useful when the application is acceptable to be asynchronous.

This is because messages are processed in queues, first-come-first-serve. If the architecture demands immediate action on a request then this architecture may not suitable.

Uses cases:

  1. Analytical services for any app: analytics will never require immediate processing for any events/messages that occur.

  2. Requests that can take time to be processed: like, registering the time of a netflix show at which user is watching to backend - so it can start from that point in time next time when its played, delays are acceptable.

  3. Jobs based workloads: Actions on systems that are external to the application - like Enterprise onprem components talking to the their cloud services.

  4. ChatGPT’s Image generation: As workloads run async, event driven is one of ideal ways to schedule tasks to internal services

  5. Cab faring geo location services: Delay on exact location of cabs in cab fairing services is acceptable to users up to 30-40 seconds.

  6. Applications that should store history of events/messages - for Machine learning and AI training

Examples where it shouldn’t be used:

  1. Chatting applications: Users will feel the delay if messaging backend is using message queues.

  2. Trading: As immediate processing is not present, asset trading would be an issue as things would run async. These apps need milliseconds precision and message queues won’t be able to handle that.

Overview of Apache Kafka

Kafka is open-source project from Apache for implementing Message queues.

Introduction to Apache Kafka

Apache Kafka is the message queue between services, applications, appliances, devices and many more.

When building scalable containerized applications, we use Kubernetes. Kubernetes is one of the places ideal to use event-driven architecture because of its scalable nature.

Kafka can run as service in Kubernetes itself, or else it can be deployed as its own cluster for higher performance and throughput.

When deploying on cloud, it is recommended to use managed services from the provider or use confluent Kafka as configuration can be an overhead. It should be owned by the infrastructure team in the org to maintain it just like any other infrastructure components.

Kafka's architecture

In Kafka, messages are segregated based on something called “Topics”. These are abstraction of lanes on the queue to pass messages. Messages/Events are tagged with the Topic “Name” to which it will be sent onto the wild for which other “Subscribers” can pick it up and read upon.

Coming to Subscribers, Kafka Message Communication is between 2 entities, one the producer - the one that actually triggers the event in turn starting the message, and the one that consumes it - called consumer. Subscribers are the Consumer that “Opt-In” to read the messages that are part of a Topic. Every message that comes with the Topic is sent to the Subscribers of that Topic.

This model is Highly scalable in the cloud world simply because, there is a stream of data in the center that is being handled by an entity(Kafka) from which individual services can pick the message from the topic they want to have some action upon, and do their work.

Kafka or any message Queues have their own software infrastructure to function. It is a distributed system to manage events. Main component called the Zookeeper, which is a distributed system handler that works to maintain the leader of the Kafka cluster. Kafka Brokers are the components that actually handle the movement of the messages.

AWS provides MSK - Managed Streaming for Apache Kafka Services on EKS(Elastic Kubernetes Service)

Consumer groups

Consumer groups is a way of combining event subscribers into a set where a particular event to the group is sent to only one of the consumer of the group where we consider the group as the consumer.

This helps us avoid duplication of consuming same event in the application.

In Kubernetes we have concept of replicas which are used for scaling the application. Replicas of same consumer service can be put in a single consumer group, which helps in scaling the app when events being produced increases over a threshold.

Partitions

Topics are divided into partitions which are used for scalability, fault tolerance and parallel processing. Each consumer can latch onto a partition of a topic.

Designing the architecture

Event based architecture should be implemented where data trend needs attention, speed isn’t the main goal, modularity of services are a concern, services don’t care about the response for a request.

Architecture has to be fault tolerant, with defensive code, being less aggressive on action and more careful on considering the event can be wrong or missed.

Playing around with Kafka in Kubernetes: Step-by-step guide

Lets have a small event-driven tutorial.

For this tutorial, start with a kind cluster and deploy Kafka using the Kafka Kube manifest.
Set up two Python streaming services: 1 producer and 3 consumers.
Consumer service will have 3 replicas, demonstrating Kafka's consumer groups in action.

Below are the YAML manifests for Zookeeper and Kafka along with sample Python code (producer/consumer), Dockerfiles, and Kubernetes manifests to deploy with replicas.

Start with having a K8s cluster, using kind here

kind create cluster

After successfully installing kind

kubectl get nodes

Keep your files as such:

eventdriven
├── consumer
│   ├── consumer.py
│   ├── deploy.yaml
│   └── Dockerfile
├── kafka
│   ├── kafka.yaml
│   └── zookeeper.yaml
├── Makefile
└── producer
    ├── deploy.yaml
    ├── Dockerfile
    └── producer.py

Makefile

PRODUCER_DIR=producer
CONSUMER_DIR=consumer
REGISTRY=eventdriven

.PHONY: all build-producer build-consumer deploy-apps clean-images deploy-kafka

all: build-producer build-consumer deploy-apps

build-images: build-producer build-consumer

build-producer:
    docker build -t $(REGISTRY)/producer $(PRODUCER_DIR)

build-consumer:
    docker build -t $(REGISTRY)/consumer $(CONSUMER_DIR)

load-images-to-kind: build-images
    kind load docker-image $(REGISTRY)/producer
    kind load docker-image $(REGISTRY)/consumer

deploy-apps: deploy-consumer deploy-producer
    kubectl apply -f $(PRODUCER_DIR)/deploy.yaml
    kubectl apply -f $(CONSUMER_DIR)/deploy.yaml

deploy-producer: load-images-to-kind
    kubectl apply -f $(PRODUCER_DIR)/deploy.yaml

deploy-consumer: load-images-to-kind
    kubectl apply -f $(CONSUMER_DIR)/deploy.yaml

delete-deploy-apps:
    kubectl delete -f $(PRODUCER_DIR)/deploy.yaml
    kubectl delete -f $(CONSUMER_DIR)/deploy.yaml

deploy-zookeeper:
    kubectl apply -f kafka/zookeeper.yaml

deploy-kafka:
    kubectl apply -f kafka/kafka.yaml

delete-kafka:
    kubectl delete -f kafka/zookeeper.yaml
    kubectl delete -f kafka/kafka.yaml

delete-all: delete-deploy-apps delete-kafka

Zookeeper Deployment (zookeeper.yaml):

apiVersion: v1
kind: Service
metadata:
  labels:
    app: zookeeper
  name: zookeeper-service
spec:
  type: ClusterIP
  ports:
    - name: client
      port: 2181
      targetPort: 2181
  selector:
    app: zookeeper
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: zookeeper
  name: zookeeper
spec:
  replicas: 1
  selector:
    matchLabels:
      app: zookeeper
  template:
    metadata:
      labels:
        app: zookeeper
    spec:
      containers:
        - name: zookeeper
          image: wurstmeister/zookeeper
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 2181

Kafka Deployment (kafka.yaml):

apiVersion: v1
kind: Service
metadata:
  labels:
    app: kafka
  name: kafka-service
spec:
  ports:
    - port: 9092
  selector:
    app: kafka
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: kafka
  name: kafka
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kafka
  template:
    metadata:
      labels:
        app: kafka
    spec:
      hostname: kafka
      containers:
        - name: kafka
          image: wurstmeister/kafka
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 9092
          env:
            - name: KAFKA_BROKER_ID
              value: "1"
            - name: KAFKA_ZOOKEEPER_CONNECT
              value: zookeeper-service:2181
            - name: KAFKA_LISTENERS
              value: PLAINTEXT://:9092
            - name: KAFKA_ADVERTISED_LISTENERS
              value: PLAINTEXT://kafka-service:9092

Python Producer (producer.py):

import sys
from kafka import KafkaProducer
import json
import time
import os
import logging

KAFKA_BROKER = os.environ.get("KAFKA_BROKER", "kafka-service:9092")
TOPIC = os.environ.get("KAFKA_TOPIC", "demo-topic")

logging.info("Starting Consumer...")
producer = KafkaProducer(
    bootstrap_servers=[KAFKA_BROKER],
    value_serializer=lambda v: json.dumps(v).encode("utf-8")
)

logging.basicConfig(stream=sys.stdout, level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")

i = 0
while True:
    msg = {"number ": i}
    producer.send(TOPIC, msg)
    logging.info(f"Produced: {msg}")
    time.sleep(5)
    i += 1

Dockerfile for Producer

FROM python:3.9-slim
WORKDIR /app
COPY producer.py .
RUN pip install kafka-python
CMD ["python", "producer.py"]

Producer deploy.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: producer
spec:
  replicas: 1
  selector:
    matchLabels:
      app: producer
  template:
    metadata:
      labels:
        app: producer
    spec:
      containers:
        - name: producer
          imagePullPolicy: Never
          image: eventdriven/producer:latest
          env:
            - name: KAFKA_BROKER
              value: kafka-service:9092
            - name: KAFKA_TOPIC
              value: demo-topic

Python Consumer (consumer.py):

from kafka import KafkaConsumer
import json
import os
import logging
import sys

KAFKA_BROKER = os.environ.get("KAFKA_BROKER", "kafka-service:9092")
TOPIC = os.environ.get("KAFKA_TOPIC", "demo-topic")
GROUP = os.environ.get("KAFKA_GROUP", "demo-group")

# Set up logging to stdout so kubectl logs can capture it
logging.basicConfig(stream=sys.stdout, level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")

logging.info("Starting Consumer...")
consumer = KafkaConsumer(
    TOPIC,
    group_id=GROUP,
    bootstrap_servers=[KAFKA_BROKER],
    auto_offset_reset="earliest",
    value_deserializer=lambda m: json.loads(m.decode("utf-8")),
    enable_auto_commit=True
)
for msg in consumer:
    logging.info(f"Consumed: {msg.value}")

Dockerfile for Consumer

FROM python:3.9-slim
WORKDIR /app
COPY consumer.py .
RUN pip install kafka-python
CMD ["python", "consumer.py"]

Consumer deploy.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: consumer
spec:
  replicas: 3
  selector:
    matchLabels:
      app: consumer
  template:
    metadata:
      labels:
        app: consumer
    spec:
      containers:
        - name: consumer
          imagePullPolicy: Never
          image: eventdriven/consumer:latest
          env:
            - name: KAFKA_BROKER
              value: kafka-service:9092
            - name: KAFKA_TOPIC
              value: demo-topic
            - name: KAFKA_GROUP
              value: demo-group

Commands

make deploy-zookeeper

# wait for zookeeper to come up

make deploy-kafka

# these deployments will take time as they need to pull the images

Kafka ready

Bringing up the apps

make deploy-apps

# this takes time as docker needs to build the apps

Once apps are up - Check logs using

kubectl logs <pod-name> -f

Column 1: producer logs
Column 2: Consumer 1 logs
Column 3: Consumer 2 logs
Column 4: Consumer 3 logs

As you can see, all events are only being read by first consumer. This is because the demo-topic by default is configured with 1 partition, hence only one consumer can latch on to it.

To configure partition, creating 3 partitions

kubectl exec -it <kafka-pod> -- bash
root@kafka:/# kafka-consumer-groups.sh --describe --group demo-group --bootstrap-server localhost:9092

GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                              HOST            CLIENT-ID
demo-group      demo-topic      0          33              34              1               kafka-python-2.2.15-4773fc7e-6f67-4799-9dab-9364e9e8b7aa /10.244.0.111   kafka-python-2.2.15

Configuring 3 partition

root@kafka:/# kafka-topics.sh --alter --topic demo-topic --partitions 3 --bootstrap-server localhost:9092

After around a minute, check the group

root@kafka:/# kafka-consumer-groups.sh --describe --group demo-group --bootstrap-server localhost:9092

GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                              HOST            CLIENT-ID
demo-group      demo-topic      2          7               7               0               kafka-python-2.2.15-f0a58b19-5c29-4f66-9ddd-9da4e4d94573 /10.244.0.113   kafka-python-2.2.15
demo-group      demo-topic      0          66              66              0               kafka-python-2.2.15-4773fc7e-6f67-4799-9dab-9364e9e8b7aa /10.244.0.111   kafka-python-2.2.15
demo-group      demo-topic      1          2               2               0               kafka-python-2.2.15-8c5fefd1-d653-4b27-af1c-aa13e6b60032 /10.244.0.110   kafka-python-2.2.15

Events starts flowing to other consumers as well

Event flow video, Producer send —> Kafka distributes to Partition 0, then Partition 1 and later Partition 2

Cleaning everything

make delete-all

Challenges faced

The above tutorial is simple way to see events moving to begin with, but in a big complex mesh of services, things can get very complex because of the movement of data between services.

Few of the issues like,

  1. Event loss

  2. Data corruption

  3. Schema violation

  4. Event Process failure at consumer

Can lead to time-consuming debugging.

And Kafka operational maintenance at scale is also a headache, depending on the replication and retention, your cloud bills can get huge.

P.S., Application architecture plays a key role on how smoothly you can deploy and maintain your services on the cloud to have a peaceful weekends as a software engineer.

Conclusion

Event-driven architecture enables systems to be scalable, resilient, and responsive, making it a popular choice across the tech industry. Its loosely-coupled design is beginner-friendly and future-proof. Adopting this approach helps teams build robust applications that adapt to changing needs.

0
Subscribe to my newsletter

Read articles from Abhiram Puranik directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Abhiram Puranik
Abhiram Puranik