What is Debezium? CDC Architecture, Connectors & Topics

In the first article of this CDC series, we learned how Change Data Capture (CDC) works and why it matters.

Now let’s dive deeper into Debezium — one of the most popular open-source CDC tools — to see what it is, how it works, and what kind of data it produces.

What is Debezium?

Debezium is an open-source CDC platform built on top of Kafka Connect. It continuously monitors database transaction logs and streams every change (insert, update, delete) into Kafka topics.

Supported databases include:

Relational: MySQL, PostgreSQL, SQL Server, Oracle, Db2
NoSQL: MongoDB, Cassandra
Others: Vitess, Spanner, etc.

Instead of batch jobs, Debezium ensures real-time, event-driven pipelines — a perfect fit for analytics, search, and microservices.

How Debezium Works?

At a high level:

Debezium connects to a database’s transaction log (e.g., MySQL binlog, Postgres WAL).
It captures changes row by row.
It converts these into structured change events.
Events are published to Kafka topics.
Other systems (apps, warehouses, sinks) consume these events.

Debezium Architecture?

A diagram illustrating data flow from MySQL and PostgreSQL databases through Debezium source connectors to Apache Kafka. The data is then processed by JDBC and Elasticsearch sink connectors to be stored in a database and sent to an analytics viewer.

Database → Debezium Connector (Kafka Connect) → Kafka Topics → Consumers/Sink Connectors

Source Database → e.g., MySQL, PostgreSQL
Debezium Connector → reads changes from transaction logs (MySQL binlog, PostgreSQL WAL).
Kafka Cluster → stores events in topics
Consumers/Sinks → JDBC sink connector for relational database, Elasticsearch, Data Warehouse, or microservices

This design makes Debezium scalable and fault-tolerant.

Terminology

Connector → plugin that knows how to read changes from a specific DB
Source Connector → captures changes (Debezium provides these)
Sink Connector → delivers changes to targets (from Kafka Connect ecosystem)
Change Event → structured JSON/Avro message containing before/after values
Offsets → checkpoints for connector progress in logs
Snapshotting → initial dump of existing data before streaming begins
Schema History Topic → Kafka topic where Debezium records schema/DDL changes
Tombstone Event → message indicating a delete (so topics stay compacted)

Benefits of Using Debezium

Near real-time CDC with low latency
Works with many databases
No changes needed in application code
Reliable (offset tracking, schema history, exactly-once with Kafka)
Scales easily with Kafka

Features of Debezium

Debezium provides a rich set of features that make it one of the most widely used CDC platforms.

Captures All Data Changes
- Inserts, updates, and deletes are all captured reliably
Low Latency, High Efficiency
- Produces change events with very low delay while avoiding heavy CPU usage (no expensive polling).
No Data Model Changes Required
- Works by reading the database’s transaction log, so you don’t need to modify existing tables or schemas.
Captures Deletes
- Supports “tombstone” events to reflect deleted records downstream.
Captures Old State + Metadata
- Provides both before and after row states.
- Can include extra metadata such as transaction IDs, user queries, and timestamps (depending on DB).
Advanced Filtering and Transformations
- Built-in Single Message Transformations (SMTs) allow:
  - Filtering certain records
  - Masking sensitive fields (e.g., PII)
  - Routing records to different Kafka topics
  - Custom message transformations
Fault-Tolerance & Recovery
- Uses Kafka offsets to resume from the exact point of failure, ensuring no events are lost or duplicated.

Why Log-Based CDC is Better

There are different CDC techniques (polling, triggers, log-based), but log-based CDC — which Debezium uses — is considered the best approach:

Low Overhead → No extra load on the database since it reads from the transaction log instead of querying live tables.
Complete Change History → Captures all changes, including deletes and before/after values.
Reliable & Consistent → Transaction ordering is preserved exactly as it happened in the database.
Non-Intrusive → No need to modify application code or database schema.

By comparison:

Polling adds query overhead, can miss changes, and introduces latency.
Triggers increase write latency and are harder to maintain at scale.

That’s why Debezium’s log-based CDC approach is widely used for real-time data pipelines.

Debezium Topics

When Debezium runs, it creates multiple Kafka topics:

Table-specific topics
- Each database table gets its own separate topic.
- Example:
  - dbserver1.inventory.orders → streams changes from orders table
  - dbserver1.inventory.customers → streams changes from customers table
Schema/History topic
- Example: schema-changes.inventory
- Stores schema (DDL) changes like ALTER TABLE.
- Ensures consumers can interpret events even if the table structure evolves.

Sample Change Events (Records)

Let’s say we have a MySQL orders table.

Insert record sample data

{
  "before": null,
  "after": {
    "id": 101,
    "product": "Laptop",
    "amount": 1200
  },
  "source": {
    "db": "ecommerce",
    "table": "orders",
    "ts_ms": 1692956540000
  },
  "op": "c",   // c = create
  "ts_ms": 1692956540500
}

Update record sample data

{
  "before": {
    "id": 101,
    "product": "Laptop",
    "amount": 1200 // old price
  },
  "after": {
    "id": 101,
    "product": "Laptop",
    "amount": 1100 //new price
  },
  "source": {
    "db": "ecommerce",
    "table": "orders"
  },
  "op": "u",   // u = update
  "ts_ms": 1692956540700
}

Delete record sample data

{
  "before": {
    "id": 101,
    "product": "Laptop",
    "amount": 1100
  },
  "after": null,
  "source": {
    "db": "ecommerce",
    "table": "orders"
  },
  "op": "d",   // d = delete
  "ts_ms": 1692956540900
}

Notice how each event has:

before → row state before change
after → row state after change
op → operation type (c, u, d)
source → metadata (db, table, timestamp)

This is the heart of how Debezium streams data.

Connectors in Debezium

Source Connectors (provided by Debezium):
- MySQL, PostgreSQL, MongoDB, SQL Server, Oracle, Db2, Cassandra
- Reads transaction logs
Sink Connectors (provided by Kafka Connect ecosystem):
- JDBC Sink → push to another DB
- Elasticsearch Sink → for search indexing
- S3 Sink → for archiving raw events
- Others → Snowflake, BigQuery, etc.

Together, these connectors make Debezium a complete pipeline for database sync and streaming.

Real-World Use Cases

E-commerce → sync orders from MySQL to analytics DB
Search → stream product updates into Elasticsearch
Microservices → event-driven communication
Data migration → low-downtime DB migration
Audit Trails → stream all data changes with before and after payload into data lakes or history tables.

Limitations & Considerations

Requires a Kafka cluster (extra infra)
Snapshotting large tables can be expensive
Schema evolution needs planning
Sensitive data may need masking/transformations
Kafka topic retention must be tuned

Conclusion & Next Steps

Debezium makes CDC practical, reliable, and production-ready.
It turns every row change into a real-time event that downstream systems can consume.

In the next article, we’ll walk through a hands-on example: syncing changes from MySQL → Kafka → PostgreSQL using Debezium and Kafka Connect.

What is Debezium? Architecture, Terminology, and Connectors

Table of contents