Confluent Certified Developer for Apache Kafka (CCDAK) Study Guide

Table of contents
- Background
- Kafka Fundamentals & Architecture
- Producing Messages to Kafka (Producers)
- Consuming Messages from Kafka (Consumers & Consumer Groups)
- Kafka Data Retention and Log Compaction
- Kafka Streams (Streams API)
- ksqlDB (KSQL) – Streaming SQL for Kafka
- Kafka Connect
- Schema Registry and Data Serialization
- Confluent REST Proxy
- Kafka Security (TLS, Authentication, ACLs)
- Monitoring and Troubleshooting Kafka Applications
Background
I recently sat and passed the CCDAK exam and I wanted to share the study guide I compiled and used to pass the exam. I didn’t really use any other resource, except for a few practice questions from various online sources.
This guide covers all objectives and key topics for the CCDAK exam. It provides detailed explanations of Kafka’s architecture, core APIs (Producer, Consumer, Streams, Connect), Confluent platform components (Schema Registry, REST Proxy, ksqlDB), security features, and best practices for development, testing, and troubleshooting. The content is up-to-date (as of 2025) and is organized by high-level exam objectives, so you can use this as a single source to ace the CCDAK exam without needing any other reference.
Kafka Fundamentals & Architecture
Apache Kafka is a distributed event streaming platform built around a publish/subscribe model. At its core, Kafka is essentially a distributed commit log, storing sequences of records in a fault-tolerant, scalable way. Understanding Kafka’s fundamental concepts and architecture is critical:
Events (Records): The basic data units in Kafka, each with a timestamp, an optional key, a value (message payload), and optional headers. Events are immutable and appended to logs.
Topics: Categories or feed names to which records are published. A topic is analogous to a table in a database (but without a fixed schema). Topics are logically separated streams of data.
Partitions: Each topic is split into one or more partitions (ordered, immutable sequences of records). Partitions are the unit of parallelism and data distribution across the cluster. Each partition is an independent log, and records within a partition have a sequential offset.
Offsets: A monotonically increasing integer ID assigned to each record within a partition. Offsets uniquely identify records and preserve their order per partition. Consumers use offsets to track their read position.
Every Kafka cluster consists of brokers (servers). A cluster typically has multiple brokers working together to provide scalability and fault tolerance. Key aspects of the Kafka architecture include how data is stored and replicated, and how producers/consumers interact with brokers:
Brokers and Clusters: A broker is a Kafka server responsible for storing topic partitions and serving read/write requests. Brokers form a cluster and elect a controller (one broker that manages partition leadership and cluster metadata). Clients (producers/consumers) connect to any broker (called a bootstrap broker initially) and are redirected to the leader of the partition they need. For high availability, topics are configured with a replication factor > 1, meaning each partition’s data is replicated to multiple brokers.
Leader and Followers: For each partition, one broker is elected the leader, and the others are followers. All produce (write) and consume (read) operations go through the leader. Followers replicate the leader’s log. This replication is crucial for durability and availability. If a leader broker fails, one of its in-sync follower replicas is automatically promoted to leader to continue serving data.
In-Sync Replicas (ISR): The set of follower replicas that are fully caught-up with the leader’s data. Kafka only considers a write “committed” when all replicas in the ISR have the record. The high watermark is the offset up to which the leader knows that all ISR have replicated, and thus marks the record as committed (readable by consumers).
ZooKeeper vs KRaft (Controller): Older Kafka clusters use ZooKeeper as an external coordination service to store metadata (brokers, topics, configurations) and manage controller election. New versions of Kafka introduced KRaft (Kafka Raft) mode, which eliminates ZooKeeper by integrating metadata management into Kafka brokers via a Raft consensus quorum. KRaft uses an internal “metadata log” (a special topic) to replicate cluster metadata across controller nodes. By Kafka 4.0, ZooKeeper will be fully deprecated, with Kafka’s built-in controller handling all metadata and quorum-based controller elections. For the exam, know that ZooKeeper was traditionally critical for Kafka’s control plane, but KRaft is the newer approach to a self-contained Kafka cluster metadata management.
Storage Mechanics: Kafka stores records in disk-based logs. Each partition’s log is segmented and persists records sequentially, allowing high-performance disk writes (sequential I/O) and zero-copy reads. Kafka can retain vast amounts of data by configurable retention policies. By default, data is retained for a certain time (e.g. 7 days) or until log size limits, after which old segments are deleted. Kafka can also retain data indefinitely or use log compaction (more on this later) for state-changelog topics. These features enable Kafka to act as a durable storage layer for events.
How data flows through Kafka: When producers send messages and consumers fetch them, the brokers handle these requests in a scalable manner:
Producer Requests: Producers send records to a topic (the client library automatically finds the partition leader for the target partition). If a record has a key, Kafka’s default partitioner will hash the key to choose a partition; if no key, records are distributed in a round-robin fashion by the producer. Producers batch records together for efficiency and compression before sending. On the broker side, each incoming batch goes through network threads and I/O threads: the broker appends the batch to the partition log (sequential disk write) and then handles replication to followers. The request may wait in a short-term queue (sometimes called “purgatory”) until replication is done, depending on the acknowledgment settings (see Producer section below).
Consumer Requests: Consumers fetch records by periodically sending fetch requests to brokers, specifying topic, partition, and the next offset they want. The broker’s I/O thread reads from the log (using an index for efficiency) and returns the requested records. If a consumer asks for data beyond the log’s end, the broker can wait (long poll) until new events arrive or until a specified timeout, to efficiently stream data as it arrives. Kafka uses zero-copy transfer for sending data to consumers (avoiding unnecessary data copies in memory), which optimizes throughput.
Reliability and Failover: Kafka’s replication ensures that even if brokers crash, data is not lost as long as at least one replica is intact. If a leader broker goes down, one of the in-sync follower replicas will be chosen as the new leader (this election is managed by the controller, via ZooKeeper or KRaft). Followers that were not fully caught up (not in ISR) are not eligible for leadership to avoid data loss. When a new leader takes over, it may truncate its log to the last committed offset if it had extra uncommitted data, ensuring consistency with what was acknowledged to producers. This mechanism guarantees at-least-once delivery of messages in failure scenarios (duplicates can occur if a producer retried after a crash, but data loss is prevented when correctly configured).
Kafka Cluster Scaling and Advanced Features
Kafka’s design allows horizontal scaling by adding more brokers and more partitions:
Adding Brokers (Cluster Expansion): New brokers can join the cluster to increase capacity. Partitions can be reassigned to new brokers (using the
kafka-reassign-partitions
tool or Confluent’s auto-balancing features) to redistribute load. The exam may expect knowledge that data isn’t automatically moved when a broker is added; an admin action is needed to balance partitions (or use Confluent’s Self-Balancing Clusters in enterprise editions).Elasticity Considerations: Increasing partitions increases throughput but also has limitations: the number of consumers in a group cannot exceed the number of partitions (since each partition can be consumed by only one group member at a time). Also, significantly increasing partition count can add overhead (more open files, more metadata). Planning for an optimal partition count is part of Kafka architecture design.
Tiered Storage: A newer feature (production-ready since Kafka 3.9) that offloads older data from brokers’ local disks to a remote storage (e.g., cloud object storage like AWS S3). Tiered Storage addresses limitations of finite local disk size and cost: recent data stays on fast local storage (for quick access), while deep historical data is retained on cheaper remote storage. When consumers need older data, brokers fetch it from remote tier and cache it locally. Note: Tiered storage is not yet widely used in all deployments, but you should be aware of its high-level purpose: enabling virtually unlimited retention at lower cost. (It’s mostly relevant for Kafka operators, but as a developer it’s good to know if asked.)
Multi-Cluster and Disaster Recovery: Kafka supports multi-datacenter setups for fault tolerance and geo-distribution. Techniques include MirrorMaker 2 (an open-source tool based on Connect for replicating topics to another cluster) and Confluent’s Cluster Linking (an enterprise feature for directly linking clusters and replicating topic data in a byte-for-byte manner). MirrorMaker 2 runs as a Connect cluster and can do filtered or selective replication but requires offset translation for active/active scenarios (not automatic out-of-the-box). Cluster Linking creates a read-only mirror of a topic in a target cluster with the same offsets, simplifying consumer failover (this avoids the complexity of offset mapping that MirrorMaker requires). Additionally, Confluent offers Replicator, an enterprise connector for cross-cluster replication with additional integration (like monitoring via Control Center). While these are more admin-focused features, you should recognize their names and purposes (e.g., MirrorMaker2 vs Cluster Linking) in case the exam touches on cross-cluster data movement for disaster recovery.
By understanding the core architecture – brokers, partitions, replication, leaders/followers, and new improvements like KRaft and tiered storage – you have a solid foundation. Next, we dive into developing with Kafka’s APIs, starting with producers and consumers.
Producing Messages to Kafka (Producers)
A Kafka producer is a client application or library that publishes events to Kafka topics. As a CCDAK candidate, you should know how the producer API works, important configuration settings, data serialization, and delivery semantics (e.g., at-least-once, exactly-once). Given you have Kafka experience, we’ll focus on the critical points you must master for the exam:
Producer API Basics: Using the Producer API (available in Java and other languages), you create a
ProducerRecord
specifying a topic, an optional key, and a value (the message data). The producer sends records to Kafka asynchronously for high throughput, typically via a call likesend(record)
in Java. Under the covers, the producer will accumulate records into batches per partition to optimize network use. If a key is present, Kafka guarantees that all records with the same key go to the same partition (using a default hash partitioner). If no key, the producer by default uses a round-robin or sticky partitioner (in newer Kafka versions) to distribute messages evenly while still benefiting from batching.Acknowledgment (
acks
) Setting: This is one of the most important producer configs for reliability. It controls how many broker acknowledgments the producer waits for before considering a send successful:acks=0
– The producer does not wait for any ack. It fires off the record and doesn’t care if it was received (highest throughput, lowest latency, but no durability guarantees – messages can be lost if broker fails).acks=1
– The producer waits for the leader to ack the record (leader wrote it to its local log). This is the common default. It ensures the message is stored on the leader but not necessarily on followers yet. If the leader dies right after, data could be lost (at-most one broker had it).acks=all
(or-1
) – The producer waits for the leader and all in-sync replicas (ISR) to ack. This guarantees that the record is on all replicas that are in sync (durable even if the leader crashes, since a follower has it). This is the safest mode (best durability) but can incur slightly higher latency. In practice,acks=all
combined with a propermin.insync.replicas
setting (see below) is required for strong durability.
min.insync.replicas
: A broker/topic config that works withacks=all
. It specifies the minimum number of ISR that must acknowledge a write for the leader to consider it successful. For example, in a cluster with replication factor 3, you might setmin.insync.replicas=2
. If one replica is down (ISR size = 2), and you require 2 acks, then even withacks=all
the produce will fail until the ISR is restored to 2 members. Essentially,acks=all
+min.insync.replicas > 1
prevents writes from being acknowledged if the cluster is in a state where it can’t guarantee durability. This is how you enforce no data loss even if a broker is down.Producer Retries and Delivery Semantics: By default, the producer will retry failed send attempts (
retries
config, which is unlimited by default in recent clients). This improves reliability but can lead to duplicate messages if a retry succeeds after a network glitch (because the broker may have stored the message but the acknowledgment was lost, and then a retry produces a second copy). Kafka producers mitigate this with two key features:Idempotent Producer: When enabled (
enable.idempotence=true
, which is by default true in modern Kafka clients), the producer is assigned a unique producer ID and sequence numbers for each message. The broker uses these to ensure it never writes the same sequence twice, eliminating duplicates on retry. Idempotent producers thus provide exactly-once insertion into a single partition (provided you don’t break other conditions, like using a consistent PID). Idempotence requiresacks=all
and settingmax.in.flight.requests.per.connection=1
(to maintain order), which the client does automatically in idempotence mode.Transactional Producer: Kafka transactions go a step further – they allow atomic writes to multiple partitions (and topics) and coordinate with consumers so that either all writes succeed and are visible or all are aborted. To use this, a producer must have a
transactional.id
configured, and you use the transaction API (begin transaction, send messages, commit or abort). Transactions guarantee exactly-once processing in an end-to-end sense when combined with read_committed consumers (discussed later). They ensure that if a set of writes needs to be atomic (e.g., a message and its corresponding DB update via a Kafka sink), the consumers will see either all or none of them. Internally, Kafka’s transaction coordinator (running on brokers) tracks pending transactions and uses special markers in the log to commit or abort them. For the exam, know that enabling exactly-once semantics (EOS) involves idempotent producers + transactions and that it prevents duplicates even on retries and producer restarts, at the cost of some complexity and throughput. Common use case is in Kafka Streams or custom apps that need to consume-process-produce without losing or duplicating data even on failure.
Batching & Compression: The producer batches records into chunks per partition. Two configs control this:
batch.size
(batch buffer size in bytes) andlinger.ms
(how long to wait for more messages before sending a batch). By default, producers send immediately (linger=0) but a small linger (e.g., 5-100ms) can greatly improve throughput by increasing batch size. Batching reduces overhead and allows compression to be effective. Kafka supports compressing batches with algorithms like gzip, LZ4, zstd (configcompression.type
). Compression is very beneficial in throughput and disk space when messages are text or have repetitive structure.Serializers (Data Serialization): Producers don’t send Java objects directly – they must serialize keys and values to byte arrays. The serializer interface (e.g.,
StringSerializer
,IntegerSerializer
,AvroSerializer
, etc.) handles this. In practice, many Kafka applications use Avro or JSON for message data to enforce schemas. With Confluent Schema Registry, you’d use theKafkaAvroSerializer
so that the schema is registered and the message contains a schema ID (consumers then useKafkaAvroDeserializer
). The exam will likely expect knowledge on using Avro and Schema Registry (see Schema Registry section), but fundamentally: the producer must be configured with appropriate key and value serializers. If you see a question about strange data on the consumer end, often the cause is a mismatch or misuse of serializers/deserializers.Partitioning & Order Guarantees: Kafka guarantees order per partition. The producer’s role in this is deciding which partition to send each message to:
By default, if a record has a non-null key, Kafka uses a hash of the key to choose a partition (deterministically, so same key always goes to same partition). This ensures key-based ordering – all messages for a given key arrive to the same partition and thus are strictly ordered.
If the key is null, the producer can either round-robin or use the sticky partitioner (which batches many consecutive messages to one partition then switches) to balance load while maintaining order for the batch. The exact behavior depends on client version (recent clients use sticky partitioner for better batching).
Custom partitioners can be implemented if needed (e.g., to route messages based on some dynamic cluster state or a portion of the key).
Order guarantee: All messages sent by a single producer to a single partition will be in the order they were sent, as long as
max.in.flight.requests.per.connection=1
or idempotence is on (to avoid races in retries). If you have multiple producers sending to the same partition (with the same key for instance), Kafka does not globally order across producers – only each producer’s sequence is preserved. Typically, a single producer instance per keyspace is used to guarantee order for that keyspace.
Delivery Semantics Summary: By tuning the above settings, you can achieve different semantics:
At most once: Producer does not retry and possibly uses
acks=0
. If a send fails, it’s dropped. (This is rarely desired; it means maximum speed with risk of loss.)At least once (default): Producer with retries (possibly infinite) and
acks=1
orall
. This might introduce duplicates but won’t drop data – most common scenario.Exactly once: Producer with
acks=all
,enable.idempotence=true
(on by default now), and if spanning multiple topics/partitions, use transactions (transactional.id
). This ensures no duplicates and atomic multi-partition writes. The trade-off is throughput and complexity, but Kafka’s idempotent producer has very low overhead – so use it when in doubt for critical data.
Producer Performance Tuning: Some configurations to be aware of for throughput and latency tuning (the exam might present a scenario asking which config to adjust):
linger.ms
– increases latency (waiting time) but improves batch size for throughput.batch.size
– if your messages are small and throughput is low, you might increase this to allow bigger batches (default 16KB, can be increased to 100–200KB if needed).compression.type
– using compression (e.g., lz4 or zstd) can drastically reduce I/O if messages are text or JSON. It trades CPU for I/O; often worth enabling.acks
andmin.insync.replicas
– as discussed, these trade durability vs latency. Many production setups useacks=all
for critical data.max.in.flight.requests.per.connection
– set to 1 if you need strict ordering with retries (especially if not using idempotence for some reason). Setting >1 can improve throughput by pipelining sends, but can cause out-of-order retries.buffer.memory
– total memory buffer for unsent records. If your producer is very fast or the broker is slow, this buffer could fill; tuning it prevents blocking the producer send threads.
By mastering these producer concepts and configurations, you’ll be equipped to answer exam questions on how to ensure reliable delivery, avoid data loss, and optimize producer performance. Next, let’s move to the consuming side of Kafka.
Consuming Messages from Kafka (Consumers & Consumer Groups)
Kafka consumers read data from topics, typically as part of a consumer group. The consumer API and consumer groups are central to how Kafka achieves scalability and fault-tolerance on the consuming side. Here’s what you need to know:
Consumer Groups and Parallelism: A consumer group is a set of consumers (processes or threads) sharing a common group identifier. The group collectively subscribes to one or more topics. Kafka will assign partitions to consumers in the group such that each partition is consumed by exactly one consumer in the group. This mechanism provides horizontal scalability: if you have more partitions, you can have more consumers in a group sharing the load. If you have more consumers than partitions, some consumers will be idle (since a partition cannot be split among consumers). If you have fewer consumers than partitions, some consumers handle multiple partitions – which is fine, just ensure they can handle the throughput. The rule is #Consumers ≤ #Partitions to get full parallelism. Consumers in different groups are independent; each group gets its own copy of the data (this is how you do broadcast or fan-out consumption by using distinct group IDs).
Partition Assignment and Rebalancing: Kafka handles assignment of partitions to consumers automatically via the group coordinator (a broker designated per group). When a new consumer joins, an existing one leaves, or a topic’s partitions change, the coordinator triggers a rebalance – partitions are redistributed among consumers. By default, Kafka used to stop all consumers in the group during a rebalance (a stop-the-world approach). Newer protocols (since Kafka 2.4) introduced cooperative rebalancing to make it smoother:
Assignment Strategies: Common assignors include Range, RoundRobin, Sticky, and CooperativeSticky.
Range: Assigns each consumer a consecutive range of partitions (e.g., if 3 consumers and 9 partitions, each gets ~3 partitions). It can lead to uneven loads if topic subscription patterns differ.
RoundRobin: Distributes partitions evenly by cycling through consumers. Ensures a balanced load if all consumers subscribe to the same topics.
Sticky: Tries to preserve existing assignments as much as possible to avoid moving partitions between consumers during rebalances. It still does a complete rebalance but keeps it “sticky” to prior assignment to maximize continuity.
CooperativeSticky: Uses an incremental cooperative protocol where rebalances don’t revoke all partitions at once, but rather adjust incrementally. This allows consumers to keep processing some partitions while others are reassigned, reducing pause time. It’s the latest strategy (introduced via KIP-429) and is now the default in Java client when using subscribe-group mode.
The exam may ask which strategy avoids unnecessary movement – answer: Sticky or Cooperative Sticky since they minimize partition shuffling. Also know that cooperative rebalancing allows continuous consumption on partitions that aren’t affected by a rebalance, improving availability.
During a rebalance, Kafka will trigger consumers to revoke and re-acquire partitions. This is signaled in the consumer client via the rebalance listener (e.g., the
onPartitionsRevoked
andonPartitionsAssigned
callbacks in the Java client). Developers sometimes need to handle things like committing offsets on revoke to avoid duplicate processing.
Consumer Offset Tracking: Consumers track their position in each partition by storing the offset of the next record to consume. Kafka’s default mechanism is to store these offsets in a special internal topic named
__consumer_offsets
. Each consumer group has a partitioned key-value store in that topic (keys are group+topic+partition, values are the committed offset).Auto Commit: By default, a consumer will periodically commit its offsets back to Kafka (config
enable.auto.commit=true
with an intervalauto.commit.interval.ms
typically 5 seconds). Auto-commit means the consumer’s position (latest processed offset + 1) is saved without developer intervention.Manual Commit: Alternatively, apps can disable auto-commit and manually commit offsets (synchronously or asynchronously) after processing batches of messages. This gives more control – e.g., commit after fully processing a batch to ensure at-least-once processing (process then commit).
At-Least & At-Most Once Semantics:
If you commit before processing messages (not typical in Kafka), you achieve at-most-once (if processing fails, you’ve already committed, so you skip those records on restart – effectively dropping them).
If you commit after processing (the usual case), you get at-least-once: if the consumer crashes before committing, it will re-process from the last committed offset (duplicates possible, but no data loss).
Exactly-once consumption is achieved at the application level by using transactions or idempotent consumer logic, but by Kafka’s consumer API alone, at-least-once is the default guarantee.
The exam might ask about offset retention: Kafka retains committed offsets for a duration (
offset.retention.minutes
, default 7 days on brokers) after a consumer group is inactive. If a group has no active consumers beyond that retention, its offsets may be expired (so if later a consumer with that group comes back, it will act like a new group).
auto.offset.reset: This setting controls what a consumer does on its first start when there is no committed offset for a partition (or if the offset is invalid, e.g., beyond the log retention). Common values are
earliest
orlatest
.earliest
means start from the beginning of the partition (offset 0 or the log start if truncated) – i.e., consume all existing data.latest
means start from the end (only new messages).By default, this is
latest
for the Java client, meaning a new consumer will only get future messages. If an offset is out of range (e.g., if the committed offset was deleted due to retention), this policy also kicks in. For example, if a consumer was down too long and its stored offset is no longer valid (older than retention),auto.offset.reset
decides where to resume.
Flow Control and Polling Loop: Kafka consumers are pull-based. The application typically runs a loop calling
poll()
to fetch available records. Kafka requires that consumers poll at least everymax.poll.interval.ms
(default 5 minutes) otherwise it assumes the consumer is stuck and will rebalance it away. Additionally, the consumer sends heartbeats (either in the poll or background thread) to the group coordinator at a frequency determined byheartbeat.interval.ms
, and if the coordinator doesn’t get heartbeats forsession.timeout.ms
(default ~45s, max 5 minutes unless changed by broker), the coordinator will consider the consumer dead and trigger a rebalance. To avoid this when processing takes time:One can increase
max.poll.interval.ms
if the processing of a batch may take longer than default.Or use manual partition assignment and not use a group (not common unless special case).
Or in Kafka Streams (which builds on consumers), the framework manages heartbeats differently to avoid this scenario.
max.poll.records: This config limits how many records
poll()
returns at a time. Tuning it can help processing logic (for example, poll 500 records, process, then commit). A smaller batch yields less processing time per poll (thus more frequent heartbeats between polls), whereas a very large batch could take long to process and risk heartbeat timeouts. It’s a knob to balance throughput vs responsiveness.Consumer Lag: Lag is the difference between the latest produced message offset and the consumer’s committed (or current processing) offset. Monitoring lag is crucial: it tells you if consumers are keeping up. If lag is growing, the consumer is slower than the producer or possibly stuck. Kafka doesn’t automatically throttle producers or consumers if lag grows (that’s up to your design). The exam might not ask about specific monitoring tools, but conceptually, understanding consumer lag is key to troubleshooting throughput issues.
Exactly-Once Processing with Consumers: As noted, a plain consumer committing after processing is at-least-once. If you need exactly-once, one way is to use Kafka Streams or the Transactional Consumer pattern. Kafka’s read_committed mode for consumers (enabled by setting
isolation.level=read_committed
) ensures that consumers in a transaction setting only read records that were committed, skipping those from aborted transactions. However, read_committed is mainly relevant if the producers are using transactions. In normal consumer use, the burden of deduplication or exactly-once logic (like processing and writing results to an idempotent store) falls on the application. Important: If the exam asks “how to ensure a consumer does not reprocess messages after a failure?”, the expected answer may be about committing offsets after processing (acknowledging that you will reprocess on failure, which is fine if idempotent or tolerable) or using transactions (where the consumer’s progress and the processing outcome commit can be tied together).Common Consumer Configs Quick List: Be familiar with these:
group.id
– identifies the consumer group (must be set for group management).enable.auto.commit
– true/false for auto offset committing.auto.offset.reset
– earliest/latest (for no offset case).session.timeout.ms
andheartbeat.interval.ms
– tuning group session. The heartbeat interval is usually a third of session timeout.max.poll.records
– as above, batch size per poll.max.poll.interval.ms
– max processing time allowed before considered failed.isolation.level
– “read_committed” or “read_uncommitted” (the latter is default and means consume everything). For transactions, use read_committed to avoid seeing uncommitted data.partition.assignment.strategy
– if you need a specific assignor; defaults have changed to cooperative strategies in new versions.
Rebalance Listeners: Not heavily likely on exam, but know conceptually: you can implement a ConsumerRebalanceListener to handle events of partition revocation and assignment (e.g., to commit offsets on revoke to avoid processing duplicates after rebalancing).
Consumer Example Scenario: Suppose a question describes that after adding a new consumer to a group, all consumers pause for a moment and then resume with different assignments. This is the normal rebalance behavior. A follow-up might ask how to reduce the impact of rebalances – answer: use cooperative sticky assignor (if available) to make rebalances incremental and less disruptive. Or perhaps they ask what happens if a consumer crashes – Kafka detects it via missed heartbeats after the session timeout and reassigns its partitions to others (lag will accumulate until others catch up).
In summary, consumers in Kafka provide scalable, fault-tolerant processing of data. They rely on Kafka to manage group membership and partition assignments. For the exam, focus on understanding consumer groups, offset management, and configuration knobs that affect reliability and ordering (auto commit, offset reset, etc.). Also be prepared for questions on message ordering (Kafka guarantees order only within a partition, not across the whole topic) and how that relates to keys and partitioning strategy.
Kafka Data Retention and Log Compaction
Kafka’s ability to store data is a major differentiator (it’s not just a transient message queue). Retention policies determine how long data is stored, and compaction is an alternate way to store data by key rather than time. Understanding these is important:
Time/Size-Based Retention: By default, Kafka topics are configured to retain messages for a certain duration (e.g., 7 days) or until the log size exceeds a threshold. This is controlled by topic configs
retention.ms
(orretention.minutes
) andretention.bytes
. Kafka brokers run a cleanup thread that deletes old log segments that exceed retention age or if the partition size is beyond bytes limit (whichever condition is met). Deleting segments actually just removes them from the disk (after the segment’s max timestamp is older than now - retention). Consumers that have not yet read those messages (or come back after deletion) will find their offsets out-of-range, which triggers theauto.offset.reset
logic as discussed. So ensure retention aligns with consumer needs (if you have a consumer that might be down for 2 weeks, a 1-week retention means data will be gone).Log Compaction: Compaction is an alternative policy (can be enabled per topic via
cleanup.policy=compact
or bothcompact,delete
for topics that do both). Compaction retains the latest record for each key and discards older records with the same key. This is extremely useful for scenarios like changelogs, snapshots, or state store topics (where you only need the current state for each key). For example, a compacted topic could hold database changelog events, and at any time you can reconstruct the current content of the database by reading the topic (as it will have the latest update for each primary key).Compaction runs in the background on the broker. It doesn’t guarantee immediate removal of old records; rather it gradually scans and compacts segments. The log is divided into a “head” (recent, not yet compacted) and “tail” (older, compacted portion). Kafka will not compact the latest segment (active segment) since new records are still being appended there.
Compaction process (simplified): The broker identifies log segments eligible for compaction (older segments). It creates a new cleaned segment by going through the old one and retaining only the last update for each key (within that segment’s data). This requires keeping some in-memory map of keys to latest offset while cleaning. Once done, it swaps in the compacted segment and removes the old one.
Compaction does not guarantee only one record per key in the whole log – if a key hasn’t been seen in recent segments yet, older duplicates might still exist until those segments are compacted. But eventually, only the latest for each key is kept (plus any records that have unique keys).
Tombstones (deletions): If an application wants to delete a key’s value (in a compacted topic), it can send a special record with that key and a null value. This is called a tombstone. Compaction will remove the key’s record eventually, but to avoid removing it too early (before consumers have seen the tombstone), Kafka’s compaction process treats tombstones specially: it will keep tombstone records for at least a certain time (config
delete.retention.ms
, default 24 hours) before compaction can purge the key entirely. This gives consumers time to get the delete marker. So exam might ask: How do you permanently remove keys in a compacted topic? – answer: produce a record with null value (tombstone), and after compaction it will remove older keys (the tombstone itself is also removed afterdelete.retention.ms
).Compaction and Transactions: If a topic is both compacted and has transactional records (transaction markers), compaction must also handle those. Essentially, Kafka will not remove transactional markers that are needed for proper reads. Only records from committed transactions can be compacted; aborted transaction records are not exposed to consumers (in read_committed mode) and will eventually be cleaned up. This is a niche detail; just know compaction works with Kafka Streams and their state store changelogs (which use transactions and compaction together).
Use Cases: Expect a question like “Which Kafka feature allows storing only the latest value for each key?” – the answer is log compaction. Or a scenario: “You have a topic with account balance updates, and you always need the latest balance per account readily available. How to configure the topic?” – answer: make it a compacted topic (so it retains the latest update for each account key indefinitely).
Retention vs Compaction Combined: You can actually have both (
cleanup.policy=compact,delete
), which means Kafka will compact the log but also still enforce time-based retention on even the latest keys. This is useful if you want to prevent the log from growing unbounded – e.g., keep last 1 year of updates per key and drop anything older. If both are set, Kafka first deletes entire segments older than retention time, and within the remaining segments it compacts duplicates.Transactional Topics & Offsets Topic: It might be worth noting: some internal topics are compacted by default.
__consumer_offsets
is compacted (to always keep the latest offset per group-partition key). Kafka Streams uses compacted topics for its stores. The transaction markers topic__transaction_state
is compacted as well, to maintain current transaction status per transactional.id. So compaction is an integral part of Kafka’s design for keeping latest state.
Understanding these data retention mechanisms ensures you can answer questions about how long data stays in Kafka and how Kafka can serve as a reliable storage for state. Remember: by default, Kafka topics expire old data by time, but with compaction, Kafka can retain the latest state per key essentially forever.
Kafka Streams (Streams API)
Kafka Streams is Kafka’s library for building stream processing applications in Java (and Scala). It allows you to consume, process, and produce data in real-time, all within your application, without needing a separate processing cluster (unlike Apache Spark or Flink). For the CCDAK exam, key points include Streams architecture, the types of streams/tables, and how it achieves reliability (plus any DSL basics):
Streams API Overview: Kafka Streams is essentially a client-side library that uses the producer and consumer APIs under the hood to do continuous processing of data. You write a Kafka Streams application, which defines a topology of transformations (using a high-level DSL or the low-level processor API). The app internally manages the subscribing to source topics, processing records as they arrive, and writing results to output topics. It handles aspects like group management, state management, and fault tolerance for you. The benefit is you leverage Kafka’s scalability: you can run multiple instances of your Streams application, and they will form a consumer group sharing the work of processing input partitions.
KStream vs KTable: Two fundamental data abstractions in Kafka Streams:
KStream: Represents an infinite stream of events (similar to a Kafka topic partition). Each record is a self-contained event. If you think in database terms, a KStream is like a table of facts (appends only, never updates in place).
KTable: Represents a table of evolving facts, i.e., an updatable collection of key-value pairs. A KTable is essentially a changelog stream compacted by key – it only keeps the latest value per key. If a KTable is derived from a KStream (through an aggregation), the KTable will continually keep the latest aggregated result per key. In Kafka Streams, a KTable update (for a key) implicitly replaces the old value with the new one for that key (like an UPDATE in a database). Under the hood, KTables are typically backed by compacted topics and local state stores to keep track of current values.
Many exam questions reduce to: understand the difference between streams and tables. For example, if asked: “How would you compute a running count per key from a topic?” – you’d use a KTable via
KStream.groupByKey().count()
which yields a KTable of counts (updated as new events come). The KTable will output a new record each time the count for a key changes (with the same key, thus old counts can be compacted eventually).
Stateless vs Stateful Operations: Kafka Streams provides a rich DSL:
Stateless transformations (don’t require remembering past data):
map
,filter
,flatMap
, etc., which transform records independently.Stateful transformations (require aggregation or joins):
groupByKey
andaggregate
/count
/reduce
to produce tables, windowed aggregations (e.g., count per minute windows), and joins:Stream-Stream Join: You can join two KStreams on keys, often with a time window (since both are infinite, you typically define a window such as “join stream A and B on key where events are within 5 minutes of each other”). This produces a new KStream.
Stream-Table Join: Join a KStream with a KTable (e.g., enrich an event stream with the latest reference data from a table). The table acts like a look-up store; the join emits an output each time a stream event comes, looking up the current table value for that key.
Table-Table Join: Joining two KTables on key essentially behaves like a continually updating join of two datasets: whenever one table’s key updates, the joined table output updates. This produces another KTable.
Windowing: Kafka Streams supports tumbling, hopping, sliding, and session windows for grouping records in time ranges. For example, “tumbling window of 1 minute” to count events per minute.
Processing Guarantee: Kafka Streams can be configured for at-least-once (default) or exactly-once processing. If you enable exactly-once (EOS) in Streams, it will use underlying producer transactions and consumer read_committed to ensure that each input record affects the output exactly once, even if retries happen. In code, you set
processing.guarantee="exactly_once_v2"
(or similar, depending on version) in the StreamsConfig. Under the hood, Streams will group process records and commit transactions periodically, so that state stores and output topics are consistent. The exam may not get very detailed here, but it could ask: “How do you achieve exactly-once in Kafka Streams?” – the answer is by enabling processing.guarantee=exactly_once (and Streams uses Kafka’s transactions to do this).State Stores: Kafka Streams allows you to maintain state, e.g., counters or tables, via State Stores (backed by RocksDB by default). Whenever you do an aggregation or join that requires state, Streams will create a state store locally on disk. To make this fault-tolerant, Streams uses changelog topics: every update to the state store is also written to an internal Kafka topic (which is compacted) so that another instance can restore the state if needed. This is why, for example, a
count()
results in a backing store and a changelog topic that keeps the counts. If an instance fails and its tasks move to another instance, the new instance will consume the changelog to rebuild the store to the latest state.- RocksDB and caching: By default, Streams caches some state in memory and flushes to RocksDB/changelog periodically, to balance performance vs consistency. The commit interval (e.g., 500ms) dictates how often the store is flushed and how often output records are emitted for aggregates.
Error Handling & Processing Order: By default, if a Kafka Streams application encounters a serialization exception or some processing error, it might halt the thread or the application (depending on config). Newer versions have an option
default.deserialization.exception.handler
etc. to continue past corrupt records or send them to a dead-letter topic. This might be too deep for the exam, but just note that bad data can be handled gracefully if configured.Deployment and Scaling: A Kafka Streams application is typically run in multiple instances (JVMs). They all use the same
application.id
and thus form a consumer group. Kafka will assign partitions (and corresponding processing tasks) across the instances. As you increase instances, tasks redistribute (with a rebalance). The number of Stream tasks is determined by the input topic’s partition count (and topology structure). For example, if you read from a topic with 10 partitions, you get 10 tasks. If you have 2 instances, each might run 5 tasks. Streams also can have multiple threads per instance (num.stream.threads
config) if you want multithreading within one process. But a common pattern is one thread per process, scale out by processes (for simpler scaling).Testing Streams: Kafka Streams provides the TopologyTestDriver which is a testing utility to run your topology without a real Kafka cluster. Using TopologyTestDriver, you can feed sample records into your processing logic and read the outputs, all in memory, extremely fast and deterministically. This is great for unit tests of your streaming logicconfluent.io. The TestDriver simulates time and the state stores so you can test windowing, punctuations, etc. In the exam context, they might ask how to test a Streams app – answer: TopologyTestDriver (and possibly mention using
TestInputTopic
andTestOutputTopic
classes to feed and read data in tests)confluent.io.ksqlDB vs Streams: Sometimes folks confuse them: remember that Kafka Streams is a programming library (Java) for building stream processors, whereas ksqlDB (next section) is an SQL implementation on streaming data. ksqlDB actually uses Kafka Streams under the hood to execute queries. So they are related but one is code, the other is a SQL service. If exam asks which to use, it depends on whether you want to write code (use Streams) or SQL (use ksqlDB).
In summary, Kafka Streams enables building real-time processing directly on Kafka. Key exam takeaways: difference between streams and tables, how stateful operations work (with state stores and changelogs), and the notion that it uses consumer groups for scaling and transactions for exactly-once support. Also recall the testing and windowing concepts. With this, let’s move to ksqlDB, which will leverage some Streams knowledge but from a SQL angle.
ksqlDB (KSQL) – Streaming SQL for Kafka
ksqlDB (formerly just KSQL) is a SQL query engine for Kafka that allows you to build streaming data pipelines and applications using SQL-like syntax instead of writing code. It’s part of Confluent’s platform and is very relevant for the CCDAK exam, as it demonstrates higher-level Kafka usage. Here are the key points:
What is ksqlDB? – It’s a database-like layer on top of Kafka that lets you create continuous queries. You can define streams and tables in SQL that correspond to Kafka topics and run persistent queries to transform or aggregate streams of data. Think of it as using SQL to do what Kafka Streams does in code. ksqlDB is often run as a separate server (or a cluster of servers) that connects to your Kafka cluster and processes data.
Streams and Tables in ksqlDB: Similar to Kafka Streams API, ksqlDB has the concept of STREAM and TABLE:
A STREAM in ksqlDB is like an unbounded append-only flow of events (backed by a Kafka topic). In KSQL syntax:
CREATE STREAM pageviews (...) WITH (KAFKA_TOPIC='pageviews', VALUE_FORMAT='JSON');
defines a stream. You can then query it:SELECT * FROM pageviews EMIT CHANGES;
which would continuously output new events.A TABLE in ksqlDB represents a state (latest value for each key). Tables are typically created by aggregating a stream or reading a compacted topic. Example:
CREATE TABLE user_counts AS SELECT userId, COUNT(*) FROM pageviews GROUP BY userId EMIT CHANGES;
. This will create a Kafka topic (for the table) and continuously update counts per userId. In ksqlDB, Tables are updated by key – if you query a TABLE, you’ll see the current values (like a snapshot of latest counts, if you do a pull query).
Push vs Pull Queries:
Push Query: A query that subscribes to a stream or table and continuously outputs results as they arrive. In KSQL CLI or API, you use
EMIT CHANGES
to indicate a push query. For example,SELECT userId, count FROM user_counts EMIT CHANGES;
will push every update of the user_counts table to your console (this would be infinite/unbounded until you terminate it). Push queries are useful for monitoring streams or feeding realtime updates to applications.Pull Query: A request for a current value from a TABLE (since only tables have a notion of current state). Pull queries are like traditional database queries – they return a result and complete. In ksqlDB, you can do
SELECT * FROM user_counts WHERE userId='Bob';
(without EMIT CHANGES) as a pull query, and it will return the current count for Bob at that moment and finish. Pull queries require the table to be materialized (ksqlDB materializes tables internally). They are useful for request-response use cases (e.g., a web service that checks current account balance from a KSQL table).The exam may well ask the difference: Push queries continually stream changes (on streams or tables) and require
EMIT CHANGES
, while Pull queries are one-time lookups on tables only. Pull queries were introduced more recently (ksqlDB) to allow ksqlDB to serve as a lightweight read-only lookup database for stream processing results.
Common ksqlDB Operations:
Filtering:
CREATE STREAM vip_users AS SELECT * FROM users WHERE level='VIP' EMIT CHANGES;
creates a new stream/topic with only VIP users.Transformations: You can select and manipulate fields, e.g.,
SELECT UCASE(name) FROM users EMIT CHANGES;
to uppercase a field (ksqlDB supports many SQL functions).Joins: ksqlDB supports stream-stream, stream-table, and table-table joins with SQL syntax. For example, joining a clickstream with a user profile table to enrich events.
Aggregations: Use
GROUP BY
on a stream to aggregate into a table (with functions like COUNT, SUM, AVG, etc.). Must useEMIT CHANGES
to indicate continuous aggregation. ksqlDB handles windows via special syntax, e.g.,SELECT count(*) FROM clicks WINDOW TUMBLING (SIZE 5 MINUTES) GROUP BY pageId EMIT CHANGES;
will count clicks per page in 5-minute windows.User-Defined Functions (UDFs): You can extend ksqlDB with custom functions in Java if needed (maybe not in exam scope, but awareness that ksqlDB is extensible).
Materialization and Connectors: ksqlDB tightly integrates with Kafka Connect – you can actually create source connectors from within ksqlDB to pull data from external systems into Kafka or sink connectors to push query results out, using SQL syntax (
CREATE SOURCE CONNECTOR
etc.). This is pretty advanced but indicates ksqlDB’s goal as an all-in-one streaming database platform (ingest, process, serve).State under the hood: ksqlDB is powered by Kafka Streams. Each KSQL query (especially persistent ones that create streams/tables) is implemented as a Kafka Streams topology under the covers. The ksqlDB servers run these topologies. They use Kafka topics (typically with names prefixed by the ksql service id) to materialize tables and exchange data. For example, when you
CREATE TABLE agg ...
, it makes a Kafka topic for the table’s changelog, and uses an internal Streams job to compute it.Consistency and Guarantees: ksqlDB queries can be run with exactly-once semantics (since it’s using Kafka Streams with EOS if configured). Usually, Confluent Cloud’s ksqlDB is by default exactly-once. Not sure if exam goes there, but you can mention that ksqlDB inherits Streams’ fault tolerance – if a ksqlDB server dies, another takes over the processing task using the internal topics and state, so queries keep running.
Use cases: The exam might give a scenario: e.g., "We want to quickly detect fraud by correlating a stream of transactions with a table of known bad users. Should we use the consumer API, Streams, or ksqlDB?" – A valid answer: ksqlDB is well-suited for rapid development using SQL. You could create a stream for transactions, a table for bad users, and do a simple join/filter in SQL to flag suspicious transactions. Knowing when to use ksqlDB (quick development, SQL expertise, ad-hoc monitoring dashboards, etc.) vs when to code with Streams (complex logic, part of a larger app) can be useful.
Interactive Development: ksqlDB provides an interactive CLI and also a GUI (in Confluent Cloud) to run queries. So it’s often used by data engineers or analysts to quickly create stream processing logic without Java coding.
Remember for the exam: ksqlDB = SQL on streams/tables in Kafka. It makes things like filtering, joining, aggregating streams very accessible. It’s built on Kafka Streams internally. Concepts to highlight: streams vs tables in SQL context, push vs pull queries, and that it allows creating output streams/tables that correspond to Kafka topics with results. Also recall that ksqlDB, being part of Confluent, might not be available in pure Apache Kafka, but as CCDAK is Confluent’s exam, they include it.
Kafka Connect
Kafka Connect is a framework for streaming data between Kafka and other systems (databases, files, cloud services, etc.) via connectors (plugins that implement the source or sink logic). The CCDAK exam will test knowledge of Connect’s architecture, how to configure connectors, and potentially some specific features like schemas and error handling.
Connect Architecture: Kafka Connect can run in two modes: Standalone and Distributed.
Standalone: All connector tasks run in a single local JVM process (good for simple deployments or dev/testing). Configuration is done via a local properties file and launched via the
connect-standalone
script.Distributed: Connect runs as a cluster of workers (processes) that coordinate via Kafka. You submit connector configurations via a REST API to one of the workers, and the config (connector name, settings) is stored in an internal Kafka topic (
connect-configs
). The cluster then automatically distributes tasks (parallel execution units) across available workers. The offset tracking for source connectors is stored in another internal topic (connect-offsets
), and connector status info inconnect-statuses
. Distributed mode is what you use in production, as it allows scaling out and fault tolerance (if a worker dies, its tasks are re-assigned to others).
Sources vs Sinks: A Source connector pulls data from an external system into Kafka (e.g., a JDBC source connector reads database rows and outputs to Kafka topics). A Sink connector pushes data from Kafka into an external target (e.g., an Elasticsearch sink writes topic events into an Elasticsearch index).
Connector and Task: In Connect’s model, a Connector is a logical job and it may spawn multiple Tasks to actually move data. For example, a source connector reading a database table might create 4 tasks to parallelize reading 4 tables or partitions of data. The
tasks.max
setting in the connector config is the maximum number of tasks it will use. Kafka Connect will scale up to that many tasks if possible (depending on the input split or number of topic partitions for sink).Worker: A Connect Worker is a running process (JVM) that runs connectors and tasks. In distributed mode, all workers join a group (using Kafka group coordination) to share the work. They heart-beat and if one worker goes down, the others will take its tasks. All workers need the same plugins (JARs for connectors) installed.
Configs and Converters: Key configuration aspects:
You configure Connect via a JSON (or properties) that defines the connector class and its specific settings (like connection URLs, topic names, etc.).
Converters: Kafka Connect by default deals with data in a generic format (Connect uses an internal Struct object for records with schemas). Converters are what serialize/de-serialize the data to/from bytes in Kafka. The two main converters are the JSON Converter and Avro Converter (with Schema Registry). For example, a JDBC source connector will produce Connect Structs with schema. To send to Kafka, a converter must turn that into bytes – if using AvroConverter, it will register the schema in Schema Registry and send Avro bytes. If using JSONConverter, it will embed the schema in JSON (or can be schema-less JSON). So for exam: know that Converters are responsible for translating between Kafka Connect’s internal data format and the byte format in Kafka. Using the AvroConverter with Schema Registry is common in Confluent platform (so that all data in Kafka is Avro and schemas are tracked).
Schema Evolution: Connect integrates with Schema Registry for handling schema evolution. If a source connector sees a database schema change (new column), and using AvroConverter, it can register a new schema version. It’s important to set compatibility appropriately. The exam might ask something about how Connect handles schemas – essentially: Connect schemas can evolve and with Schema Registry you can enforce compatibility (backward, forward, etc.). Also, know that sink connectors can automatically convert from Avro/JSON into the target system’s format if properly configured with converters.
Single Message Transforms (SMTs): SMTs are pluggable transformations that can be applied to each message as it passes through Connect. They are simple adjustments like masking a field, inserting metadata (e.g., timestamp or topic name into a field), routing to a different topic, etc. They’re configured in the connector config. Example:
transforms=MaskSSN
and then definetransforms.MaskSSN.type=org.apache.kafka.connect.transforms.MaskField$Value
to blank out a SSN field. For the exam, just recall that SMTs provide lightweight per-message transformation in Connect without needing a separate stream processing job.Offset Management: Connect tracks the last processed offset for each source in the
__connect-offsets
(or a file in standalone). For example, a JDBC source connector will remember the last timestamp or ID it read so it can resume. Sink connectors track the last Kafka offset read for each partition of the input topic (so if rebalanced or restarted, they know where to continue). This is usually automatic.Fault Tolerance: In distributed mode, tasks will restart on failure or move if a worker dies. Connect has configs for how many times to retry a failing task before giving up, etc. If an entire Connect cluster stops, source connectors upon restart will continue from their last committed offsets (because of the stored offsets). Sink connectors may replay some recent messages if not using exactly-once, because they might reprocess some records after a failure (should design idempotent writes to external systems).
Exactly-Once Support in Connect: This is a somewhat advanced topic: Connect source connectors from Confluent can leverage Kafka transactions to achieve exactly-once delivery into Kafka for certain connectors. That means a source connector can group the records it produces in a transaction and commit them so that they are not read by consumers until fully written. Also, sink connectors can be configured to read in read_committed mode and use external systems’ transactions. Confluent Platform has introduced exactly-once support in Connect (called EOS Connect) but the specifics may be beyond exam scope. Just be aware that if asked “can Connect guarantee exactly-once?”, the answer is it can for source connectors when using transactions to Kafka (and the connector is designed for it), and generally connectors rely on idempotent or transactional writes on the sink side for EOS.
Common Connectors: Not likely to need memorizing any specific connector, but maybe know examples:
FileStream Source/Sink (comes with Apache Kafka, reads from or writes to a file – often used in quick demos).
JDBC Source/Sink (popular, to ingest from relational DBs or write to them).
Debezium CDC sources (to capture database change logs).
Elasticsearch, HDFS, S3 sinks (common sinks).
MQTT source, RabbitMQ, etc. (various).
Connect REST API: In distributed mode, you use the REST API to manage connectors:
POST /connectors
with JSON to create,GET /connectors
to list, etc. Possibly an exam question might be about how to deploy connectors (the answer: use the REST API in distributed mode, or the CLI in standalone).Errors and Dead Letter Queue: Newer Kafka Connect versions allow configuring an error tolerance: e.g., if a message in a sink connector can’t be written due to malformed data, you can choose to
ignore
errors or send problematic records to a dead-letter queue (DLQ) Kafka topic for later analysis. Configs likeerrors.tolerance=all
anderrors.deadletterqueue.topic.name=...
can be used. Good to know conceptually: Connect can route bad records to a DLQ instead of failing the connector.Security in Connect: Connect can be configured to connect to secure Kafka (SSL/SASL) by providing the appropriate client configs in the worker config. Also connectors themselves may need credentials for source/target systems. Those can be stored in plaintext or (better) using the ConfigProvider interface to pull from secure vaults (again, advanced).
For CCDAK, focus on architecture and usage: a question might describe a scenario of needing to ingest data from a DB and ask what component to use – answer: Kafka Connect with a source connector (JDBC). Or ask how Connect ensures no duplicates – answer: via offset tracking and (if applicable) idempotent writes, plus mention that tasks.max and partitioning influence parallelism. Or a troubleshooting question: e.g., “My connector is running but I see no data in Kafka” – possible answers: check if offset is stuck (maybe initial import done), or wrong converter (data not readable), etc.
Remember: Kafka Connect is the standard way to integrate external systems with Kafka by configuring rather than coding, which makes it very powerful for building pipelines. It’s part of the exam’s scope because it’s a key “developer” tool in the Kafka ecosystem for data integration.
Schema Registry and Data Serialization
The Confluent Schema Registry is a critical component when building Kafka applications that use structured data formats (Avro, JSON Schema, or Protobuf). It manages schemas (data models) for messages, ensuring producers and consumers agree on the data structure, and it enforces compatibility rules as schemas evolve. Here’s what to know:
Role of Schema Registry: In simple terms, Schema Registry is a service that stores a history of schemas (versions) for each subject (which usually corresponds to a topic or a data entity). Producers consult it to register new schemas or get schema IDs, and consumers use it to fetch schemas by ID to deserialize data. This allows data to be sent in a compact binary form (e.g., Avro) without including the full schema every time – instead, a small schema ID is embeddeddocs.confluent.io.
Using Avro with Schema Registry: Avro is the most common format with Schema Registry in Kafka (particularly Confluent’s platform). When a producer sends an Avro-encoded message, the Avro serializer contacts Schema Registry to register the schema (if not already) and gets back a schema ID. The serialized message will start with a magic byte (0x0) and then the 4-byte schema ID, followed by the Avro serialized data. The consumer, using the Avro deserializer, will read the ID, fetch the schema from Schema Registry (if not cached), and then decode the data into an Avro GenericRecord or specific record object.
- This way, schema changes (like adding a field) can be managed centrally. The Schema Registry will allow or reject new schema versions based on compatibility rules that you configure.
Subjects and Naming Strategies: By default, Schema Registry uses the topic name as the subject under which schemas are versioned. For a topic
my-topic
, the subject might bemy-topic-value
(for value schemas) andmy-topic-key
(if you use Avro for keys as well). The naming strategy can be changed (e.g., to use the full type name of the Avro record as subject, which can allow sharing schemas across topics). But typically, it’s topic name.Schema Evolution & Compatibility: Schema Registry supports multiple compatibility modesdocs.confluent.io:
Backward (default): New schema can be used by consumers reading old data – i.e., a consumer with the new schema can read records written with the previous schema versiondocs.confluent.iodocs.confluent.io. In practice, this means you can add new optional fields, or remove fields that had a default, etc., and older records can still be parsed with the new schema. Backward compatibility is default because it ensures you can always “replay” or rewind a topic and consumers (updated to newest schema) can read from the beginningdocs.confluent.io.
Forward: Opposite of backward – old consumers can read new data. This means new fields must be optional or have defaults (so old schema can ignore them), etc. Forward is used when producers update first and you want old consumers to not break.
Full: Combines both backward and forward – basically new and old schemas must be mutually compatible (usually a strict regime). Often means you can only add optional and remove optional (with default) fields – i.e., changes that neither break forward nor backward. “Full” is a strong guarantee if you have multiple independent producers and consumers upgrading at different times.
Transitive vs Non-Transitive: Backward vs Backward_Transitive: non-transitive means just comparing to the immediate previous version; transitive means compatible with all previous versions. By default, “backward” is non-transitive (just one version back)docs.confluent.iodocs.confluent.io. Many teams set Backward_Transitive to be safe.
None: No compatibility checks – any schema goes (not recommended unless schema changes are irrelevant or you’re careful out-of-band).
For exam, know that Backward compatibility (non-transitive) is defaultdocs.confluent.io and typically what you use so consumers can read past data. Also note: Schema Registry defaults can be overridden per subject if needed.
Schema Types Supported: Avro, JSON Schema, and Protobuf. Avro is most mature in Schema Registry usage:
Avro: Schemas are in JSON format. Supports schema references, and robust schema evolution.
JSON Schema: Added later – allows you to enforce JSON structure with Schema Registry; compatibility rules are a bit different (e.g., removing a required field might not be allowed).
Protobuf: You register .proto definitions. Protobuf’s evolution rules differ (like you can add fields, but adding a new message type might not be forward compatible, etc.). Best practice is to use backward (or full) for Proto to be safedocs.confluent.io.
The exam will likely stick to Avro for detailed questions.
Why Use Schema Registry: It prevents data inconsistencies and provides a contract between producers and consumers. Without it, you might have producers send JSON with loose structure, and a change could break consumers unexpectedly. Schema Registry ensures that changes are explicit and controlled. Also, it enables compact binary encoding (Avro, Proto), which is much smaller and faster than raw JSON, while still having self-describing data (via the schema ID).
Schema ID and Encoding: As described, each schema gets an ID (integer). By default Schema Registry has an ID space per global (IDs 1,2,3,... for each new schema across all subjects, though new Confluent versions support schema per cluster or with context). The message payload includes the ID so the consumer knows which version to fetch. If the consumer already has that schema cached, no registry call is needed.
Schema Registry in Development: Typically, you run a Schema Registry service (which itself is a Java process that can be part of the Confluent Platform, or Confluent Cloud provides it as a service). It has a REST API where you can GET/POST schemas. Many exam questions revolve around scenarios:
E.g., Producer fails to send due to Schema being incompatible. What happened? Possibly someone changed the subject’s compatibility to FORWARD and then tried to remove a field without default – Schema Registry would reject that. The producer’s Avro serializer would throw an error.
Or “how can you allow consumers to read older data after a schema update?” – ensure backward compatibility mode (the default)docs.confluent.io.
“What happens if a consumer tries to read a message with an unknown schema ID?” – it will query Schema Registry for the ID. If the schema isn’t found (perhaps registry data lost or different), it can’t deserialize. This typically doesn’t happen if using the same registry service that producers used.
Schema Evolution Example: Suppose initial schema is {name (string)}. If we add a new field {age (int, optional with default)} in v2, this is backward compatible – new consumers know how to fill default for old recordsdocs.confluent.io. If instead we removed
name
, that’s not backward compatible (consumers expecting name break on old messages). If we changed a field type, also likely incompatible. Schema Registry prevents these breaking changes depending on mode.Integration with Connect, REST Proxy, KSQL:
Kafka Connect can use Schema Registry via AvroConverter – then the data in Kafka topics is Avro and schemas stored centrally.
REST Proxy can produce or consume in Avro by also interfacing with Schema Registry (you post JSON with a schema or schema ID and the proxy will handle it).
ksqlDB can use Schema Registry for defining its stream/table schemas (especially with Avro and Protobuf topics).
Miscellaneous: Confluent Schema Registry also supports soft deletes of schemas, global mode changes, etc., but those are likely beyond exam scope. One perhaps notable is Schema References – Avro, JSON Schema, and Proto can reference other schemas (like imports). Registry can manage those and ensure referenced schemas are present.
To summarize for exam: Schema Registry is your source for managing schemas and ensuring compatibility in Kafkadocs.confluent.io. Know the compatibility modes (backward, forward, full) and that backward (consumer can read older messages with new schema) is default and preferred for Kafka replay scenariosdocs.confluent.io. Understand how producers/consumers interact with it (via serializers/deserializers that use schema IDs). This will cover most Schema Registry questions.
Confluent REST Proxy
Confluent REST Proxy is an HTTP service that exposes Kafka’s produce and consume operations via a RESTful API. It allows clients to interact with Kafka using simple HTTP calls, which is useful for languages or environments where a Kafka client library may not be available or convenient. Key points for CCDAK:
Purpose: REST Proxy makes it easier to produce/consume messages from scripts, web applications, or systems where adding the Kafka client jar isn’t ideal. For example, you could POST JSON data to an HTTP endpoint and that becomes a Kafka message, or GET from an endpoint to consume messages in JSON form. It essentially translates HTTP calls to Kafka client actions.
Producing via REST Proxy: You typically do an HTTP POST to
http://<proxy>/topics/<topic_name>
with a JSON payload containing the messages. The payload format depends on whether you send binary, JSON, or Avro. For example:If sending JSON (no schema), you might set header
Content-Type: application/vnd.kafka.json.v2+json
and send{"records":[{"value": {"foo": "bar"}}]}
. The proxy will produce that to the topic (with JSON serializer under the hood).If sending Avro, you use
application/vnd.kafka.avro.v2+json
and include a schema or schema ID. For example,{"value_schema": "{\"type\":\"record\",\"name\":\"MyRecord\",\"fields\":[...]}", "records":[{"value": {...}}]}
. The proxy will register the schema (or find the ID) and then produce the Avro bytes.The proxy returns the offsets or error if any in a JSON response.
Consuming via REST Proxy: The REST Proxy provides two ways:
Consumer Groups API: You create a consumer instance in a group by POSTing to
.../consumers/<group_name>
with config (like format). The proxy will respond with base URLs for that consumer. Then you can GET from.../records
to fetch new messages, commit offsets via POST to.../offsets
, etc. This mode keeps some state in the proxy (the consumer instance stays alive, and you poll it). This is the recommended approach for sustained consumption.Simple Consumption API (deprecated): Older versions allowed a one-shot GET from a topic/partition via HTTP without managing a consumer group, but that was less safe and is being phased out.
The consumer API of REST Proxy will return records in JSON, e.g., including key, value, offset, partition.
Scaling and Limitations: The REST Proxy is stateless for produce requests, but for consume with the group API, the proxy instance is actually running a background KafkaConsumer for each created consumer instance. So if you have many clients, it can load the proxy. It’s not meant for extremely high throughput use-cases relative to native clients – it’s more for integration convenience.
Typical Uses: Integration with systems like web dashboards, or quick debugging (you can cURL to produce a message or see messages). Also language support – say you have a PHP app; instead of a Kafka client, it could just call REST Proxy.
Versioning and APIs: The content types include version (v2, and there’s a newer v3 API primarily for admin operations). The exam likely won’t dig into specifics, but maybe into what REST Proxy is for or a scenario like “How can a Python script without Kafka libraries send data to Kafka?” – answer: use REST Proxy (just an example, though nowadays there are libs for Python).
Schema Registry integration: As noted, the proxy can register or look up schemas if using Avro/JSON Schema/Protobuf content types, making it seamlessly integrate with Schema Registry.
Security: REST Proxy can be secured via HTTPS and by requiring credentials, and it can connect to a secure Kafka cluster with its own client configs. Possibly not in exam, but know it exists.
Admin via REST Proxy: Confluent v3 REST API (not to confuse with v2 data API) can also do admin operations like list topics, etc., but the primary focus is data.
For CCDAK, one scenario question could be: “Your IoT devices cannot run a Kafka client but can make HTTP calls. How can they send data to Kafka?” – Confluent REST Proxy is the answerdocs.confluent.io. Or a question on how to consume via REST might check that you know it requires creating a consumer instance via the API (and that you should DELETE it when done to clean up).
One more thing: There is also a “Confluent Cloud REST API” which is not the same as REST Proxy (that’s more for management). The exam likely means the Kafka REST Proxy specifically. Just focus on it as an HTTP interface for producing/consuming messages.
Kafka Security (TLS, Authentication, ACLs)
Security is a significant topic: Kafka provides features to secure data in transit, authenticate clients, and authorize access to topics and operations. The exam will test understanding of how to set these up conceptually:
TLS Encryption (SSL): Kafka supports TLS for encrypting data in transit between clients and brokers, and between brokers themselves.
Enabling TLS involves configuring brokers with an SSL listener (
listeners
andsecurity.protocol
settings). Typically, you create a key pair and certificate for each broker (either self-signed or from an internal CA), and import the CA cert into clients.Clients then connect using SSL, optionally verifying the broker’s certificate (you can configure truststore in clients).
This provides encryption and server authentication (clients ensure they talk to the right broker).
Kafka’s config uses terms
ssl.keystore
(for broker’s own keys) andssl.truststore
(for CA certs to trust others).Note: TLS adds overhead, but ensures confidentiality. On exam, likely a question like “How to secure data between producers and Kafka brokers?” – answer: enable TLS on Kafka (SSL protocol) so that producers and consumers use TLS connectionsdocs.confluent.io.
Client Authentication: There are a few methods:
SSL Client Auth: The broker can be configured to require clients to present a valid certificate (two-way TLS). If enabled (
ssl.client.auth=required
), the broker will authenticate the client based on its certificate (commonly by checking it against a CA or a list). The client certificate’s principal (like CN=app1) becomes the user name in Kafka for ACL purposes. This is very secure but requires managing client certificates.SASL (Simple Authentication and Security Layer): Kafka supports SASL mechanisms on top of the network. Common SASL methods:
SASL/PLAIN: Username and password authentication. The broker must be configured with a password file or integrate with an external system (like JAAS file listing users, or integrate with LDAP). It’s simple but in plaintext (so typically used with TLS to avoid exposure of creds).
SASL/SCRAM: Uses salted challenge-response (SCRAM-SHA-256/512) for passwords, more secure storage (broker stores salted hashes). This is a recommended mechanism for username/password auth in Kafka. You set up user credentials in ZooKeeper (for ZK mode) or via broker config for KRaft mode.
SASL/GSSAPI (Kerberos): Enterprise integration with Kerberos (Kafka can use Kerberos tickets for auth – used in corporate environments with Active Directory or MIT Kerberos).
SASL/OAUTHBEARER: Token-based auth; allows custom OAuth/JWT tokens to be used. This is newer and allows central token auth.
To use SASL, you configure broker listeners with SASL (e.g.,
SASL_PLAINTEXT
orSASL_SSL
) and setsasl.enabled.mechanisms
and provide login modules for how the broker verifies credentials (for PLAIN/SCRAM, usually via JAAS config either in server.properties or a JAAS file).Clients then provide their username/password or Kerberos config through their own JAAS settings or properties (like
sasl.jaas.config
).The exam might ask which auth method is suitable: e.g., if you want username/password without managing certs, you’d use SASL/SCRAM or PLAIN (prefer SCRAM for security). If you already have Kerberos, use GSSAPI.
ACL Authorization: Kafka’s authorization (when enabled by setting an authorizer like
kafka.security.auth.SimpleAclAuthorizer
on brokers) uses Access Control Lists (ACLs) to determine which users (principals) can perform operations on which resources.ACLs are stored (in ZK mode) in ZooKeeper under
/kafka-acl
paths, or in KRaft mode, stored in the metadata log. They can be managed via thekafka-acls.sh
tool or programmatically via Admin API.An ACL example:
User:Alice
hasALLOW
permission toREAD
onTopic XYZ
(perhaps also specify which host if needed). Kafka ACLs can apply to Topics, Consumer Groups, Clusters (e.g., the ability to create topics is a cluster-level permission), and a few other resources like transactional-id, etc.By default, if authorization is enabled and no ACL matches, access is denied. You typically need to create ACLs for all users with what they need. You can also have wildcard ACLs (e.g., allow a user to consume from any topic with prefix “public_”).
Managing ACLs:
kafka-acls.sh --add --allow-principal User:Bob --operation READ --topic TestTopic
would allow Bob to read TestTopic. There are also patterns (–topic TestTop* to match prefix).The exam might present a scenario: e.g., “A consumer is failing to read messages with an authorization error. What could be wrong?” – likely that the ACLs are not set to allow that consumer’s user. Or “How to restrict a producer to only write to a specific topic?” – answer: create an ACL allowing WRITE on that topic for that user and do not grant any wildcard write.
There is also a notion of Super User in Kafka (set via config) which bypasses ACLs – typically not in exam but just in practice so that admins can always have access.
Another nuance: If using consumer groups, to join the group and commit offsets, the consumer’s user needs READ access on the group resource (Group:*). If omitted, you might get errors when committing offsets. So usually, give your consumer user both READ on topic and READ on the group (or wildcard group).
Security for ZooKeeper: If Kafka is using ZooKeeper (not KRaft), securing the connection between Kafka and ZK (via SASL) and protecting ZK data (via ACLs on znodes) is also important. But exam likely focuses on Kafka’s own config, not ZK.
Encryption at Rest: Kafka doesn’t natively encrypt data at rest on disk; one would use OS-level encryption or disk encryption if needed. Confluent has an enterprise feature for transparent data encryption on brokers in some versions, but not common.
Confluent RBAC (Role-Based Access Control): In Confluent enterprise (esp. Confluent Cloud or Platform), there’s an RBAC feature where roles (like ClusterAdmin, DeveloperRead, DeveloperWrite) can be assigned to users/service accounts for resources. This might be more admin exam territory, but just be aware Confluent has an alternative to ACLs for managed installs. For CCDAK, probably not covered deeply; stick to ACLs.
Example scenario to illustrate security:
You want all data encrypted on the wire: enable TLS on all broker listeners (at least the ones clients use, and inter-broker if needed). Provide client certs or at least have them trust the broker cert.
You want to authenticate clients by username/password: enable SASL/SCRAM on brokers (listeners like SASL_SSL) and create SCRAM credentials for users (using
kafka-configs.sh --alter --add-config 'SCRAM-SHA-256=[password]' --entity-type users
for each user, or via Confluent tools).You then enable authorization (set
authorizer.class.name=kafka.security.authorizer.AclAuthorizer
on broker andallow.everyone.if.no.acl=false
to be strict). Then create ACLs: e.g., allow User:Alice to consume TopicX (READ on topic, READ on group).Once set, if a client is not authorized, they’ll get an error like
Not authorized to access this resource
.
For the exam, know these steps in concept (don’t need exact commands, but understand the pieces). Possibly a question might be, “What security features does Kafka provide to ensure only authorized clients can publish or read messages?” – ideal answer touches on authentication (SSL certs or SASL), and ACLs for authorization. And “how to encrypt communication?” – enable SSL/TLS.
Monitoring and Troubleshooting Kafka Applications
Operating Kafka and Kafka applications requires monitoring key metrics and knowing how to diagnose issues. While CCDAK is a developer exam (not ops), you are expected to know how to ensure your applications run smoothly and how to detect common problems. Here’s a summary:
Monitoring Kafka Cluster Health: Key things to monitor on the brokers include:
Broker Availability: All expected brokers are up and in the cluster. Tools: check ZK or use
kafka-broker-api-versions.sh
or JMX.Under-Replicated Partitions (URP): Partitions where followers are lagging behind the leader. Ideally 0 URPs. A sustained URP indicates a stuck follower or slow network/disk. Monitor
KafkaServer:ReplicaManager, UnderReplicatedPartitions
metric.Consumer Lag: As mentioned, measure how far behind each consumer group is. Confluent Control Center or open-source tools like Burrow can track lag. High lag indicates consumers not keeping up or maybe stuck.
Throughput metrics: Bytes in/out per broker, messages in/out per second – to understand load. Also, producer request rates, consumer fetch rates.
Latency metrics: e.g., request latency, produce throttle time, etc. If latencies spike, could be an issue with broker performance (GC, disk I/O).
System resources: CPU (especially if compression or heavy GC), disk usage (log dirs filling up is bad; watch broker logs for
No space left on device
), network I/O.JMX: Kafka exposes a lot of metrics via JMX. Many monitoring setups rely on grabbing these (e.g., via Prometheus JMX exporter or Datadog integration).
Application Monitoring: For producers, track the rate of retries, errors (e.g.,
ProducerFailedMessages
metric or logging if send callbacks have exceptions). If producers start seeingTimeoutException
on sends, it suggests brokers aren’t acknowledging in time (could be backlog or ISR issues). For consumers, track how frequently you commit, how long processing takes, and also monitor the consumer lag as part of the app’s SLAs.Common Issues & Troubleshooting:
Consumer Lag continuously increasing: Could be the consumer is too slow (maybe need more instances or threads, or the processing per message is heavy). Or maybe a stuck consumer (thread hung). If lag is high and stable but not growing, might be okay (just behind and catching up). If increasing, might never catch up, consider adding consumers or optimizing processing. Also check if the group is frequently rebalancing (in logs) which can slow consumption; if so, increase session timeouts or use cooperative assignor to mitigate thrashing.
Frequent Rebalances: If you see consumers rebalancing often (group log statements of join and sync in consumer logs), find out why: possibly one consumer is slow heartbeating (increase
max.poll.interval.ms
or reduce work per poll), or membership is unstable (maybe using Docker and instances restarting often – fix deployment issues).Producer Failures: If producers log errors about
NotEnoughReplicasException
orNotEnoughReplicasAfterAppendException
, that means not enough in-sync replicas for theacks=all
requirement – some brokers might be down or slow. If you seeRecordTooLargeException
, the message size exceeds brokermax.message.bytes
or producermax.request.size
. Solution: either avoid huge messages or increase these configs (keeping in mind memory impact). IfTimeoutException
from producer, it waiteddelivery.timeout.ms
without success – likely broker is busy or network issues.Throughput Bottlenecks: If producers cannot push as fast as needed, consider enabling compression (to reduce bandwidth), batching (linger.ms slightly higher), and ensure brokers aren’t throttling due to disk I/O. If consumers are the bottleneck, maybe increase
max.poll.records
to get more data at once or scale out consumers.Memory/GC issues: Kafka brokers manage a lot of memory for page cache (OS) and heap for indexes, etc. If not tuned, heavy garbage collection can pause a broker (leading to timeouts). Monitoring GC logs and possibly using G1 GC (default now in newer Java) is best practice.
Errors in Kafka logs: Keep an eye on broker logs for recurring errors: e.g.,
ERROR [ReplicaManager]
,LogFlushFailed
, etc., which might indicate disk issues. Or lots ofLeader not available
errors for a topic – indicates no leader for a partition (maybe all replicas down or stuck).Troubleshooting Connectivity: If clients can’t connect, check advertised.listeners and network settings. Many issues in deployments come from Kafka’s listeners configuration (advertised hostnames not reachable by clients, etc.).
Testing environment: For development, use tools like kafka-console-producer/consumer to quickly test if you can send/receive. Or kafkacat (now
kcat
) – a powerful CLI for producing/consuming and debugging.Topic Inspection: Use
kafka-topics.sh --describe --topic mytopic
to see partition leaders and ISR. If ISR list is shorter than replication factor consistently, some replica is down or slow.Offset Management: Use
kafka-consumer-groups.sh --describe --group mygroup
to see current offset, log-end offset, and lag for each partition. This is essential to debug consumption issues.Data Verification: If suspect data issues, you can temporarily consume from a topic with console-consumer to peek at records (or use ksqlDB to select from it).
Testing Strategies: Already discussed how to test in dev (embedded cluster, TopologyTestDriver). Also consider testing error scenarios: e.g., what if one broker goes down – does your producer handle it (it should internally retry and reconnect to new leader)?
While you won’t be asked to interpret lengthy metrics output, expect conceptual questions like “what does it mean if under-replicated partitions is non-zero?” (it’s a reliability risk: some replicas are not fully synced) or “how to check consumer lag?” (by using consumer group describe, or a monitoring tool that tracks offset differences). Also, “how to ensure a Streams app is processing within bounds?” – answer might include monitoring its consumer lag or its output rates, and adjusting parallelism or capacity.
In troubleshooting questions, often the answer is to check logs and metrics, and knowledge of the above typical issues will guide what to check.
Lastly, be aware of software version compatibility – ensure your client and broker versions are compatible (clients usually can talk to a range of broker versions; using latest clients is fine with slightly older brokers, but not vice versa). If a client uses a feature not supported by the broker, it could cause issues (for example, exactly-once requires brokers >= 0.11 and if not, the producer might fail to initialize idempotence).
Subscribe to my newsletter
Read articles from Radu Pana directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
