What is Debezium? Architecture, Terminology, and Connectors

Table of contents

In the first article of this CDC series, we learned how Change Data Capture (CDC) works and why it matters.
Now let’s dive deeper into Debezium — one of the most popular open-source CDC tools — to see what it is, how it works, and what kind of data it produces.
What is Debezium?
Debezium is an open-source CDC platform built on top of Kafka Connect. It continuously monitors database transaction logs and streams every change (insert, update, delete) into Kafka topics.
Supported databases include:
Relational: MySQL, PostgreSQL, SQL Server, Oracle, Db2
NoSQL: MongoDB, Cassandra
Others: Vitess, Spanner, etc.
Instead of batch jobs, Debezium ensures real-time, event-driven pipelines — a perfect fit for analytics, search, and microservices.
How Debezium Works?
At a high level:
Debezium connects to a database’s transaction log (e.g., MySQL binlog, Postgres WAL).
It captures changes row by row.
It converts these into structured change events.
Events are published to Kafka topics.
Other systems (apps, warehouses, sinks) consume these events.
Debezium Architecture?
Database → Debezium Connector (Kafka Connect) → Kafka Topics → Consumers/Sink Connectors
Source Database → e.g., MySQL, PostgreSQL
Debezium Connector → reads changes from transaction logs (MySQL binlog, PostgreSQL WAL).
Kafka Cluster → stores events in topics
Consumers/Sinks → JDBC sink connector for relational database, Elasticsearch, Data Warehouse, or microservices
This design makes Debezium scalable and fault-tolerant.
Terminology
Connector → plugin that knows how to read changes from a specific DB
Source Connector → captures changes (Debezium provides these)
Sink Connector → delivers changes to targets (from Kafka Connect ecosystem)
Change Event → structured JSON/Avro message containing before/after values
Offsets → checkpoints for connector progress in logs
Snapshotting → initial dump of existing data before streaming begins
Schema History Topic → Kafka topic where Debezium records schema/DDL changes
Tombstone Event → message indicating a delete (so topics stay compacted)
Benefits of Using Debezium
Near real-time CDC with low latency
Works with many databases
No changes needed in application code
Reliable (offset tracking, schema history, exactly-once with Kafka)
Scales easily with Kafka
Features of Debezium
Debezium provides a rich set of features that make it one of the most widely used CDC platforms.
Captures All Data Changes
- Inserts, updates, and deletes are all captured reliably
Low Latency, High Efficiency
- Produces change events with very low delay while avoiding heavy CPU usage (no expensive polling).
No Data Model Changes Required
- Works by reading the database’s transaction log, so you don’t need to modify existing tables or schemas.
Captures Deletes
- Supports “tombstone” events to reflect deleted records downstream.
Captures Old State + Metadata
Provides both before and after row states.
Can include extra metadata such as transaction IDs, user queries, and timestamps (depending on DB).
Advanced Filtering and Transformations
Built-in Single Message Transformations (SMTs) allow:
Filtering certain records
Masking sensitive fields (e.g., PII)
Routing records to different Kafka topics
Custom message transformations
Fault-Tolerance & Recovery
- Uses Kafka offsets to resume from the exact point of failure, ensuring no events are lost or duplicated.
Why Log-Based CDC is Better
There are different CDC techniques (polling, triggers, log-based), but log-based CDC — which Debezium uses — is considered the best approach:
Low Overhead → No extra load on the database since it reads from the transaction log instead of querying live tables.
Complete Change History → Captures all changes, including deletes and before/after values.
Reliable & Consistent → Transaction ordering is preserved exactly as it happened in the database.
Non-Intrusive → No need to modify application code or database schema.
By comparison:
Polling adds query overhead, can miss changes, and introduces latency.
Triggers increase write latency and are harder to maintain at scale.
That’s why Debezium’s log-based CDC approach is widely used for real-time data pipelines.
Debezium Topics
When Debezium runs, it creates multiple Kafka topics:
Table-specific topics
Each database table gets its own separate topic.
Example:
dbserver1.inventory.orders
→ streams changes fromorders
tabledbserver1.inventory.customers
→ streams changes fromcustomers
table
Schema/History topic
Example:
schema-changes.inventory
Stores schema (DDL) changes like
ALTER TABLE
.Ensures consumers can interpret events even if the table structure evolves.
Sample Change Events (Records)
Let’s say we have a MySQL orders
table.
Insert record sample data
{
"before": null,
"after": {
"id": 101,
"product": "Laptop",
"amount": 1200
},
"source": {
"db": "ecommerce",
"table": "orders",
"ts_ms": 1692956540000
},
"op": "c", // c = create
"ts_ms": 1692956540500
}
Update record sample data
{
"before": {
"id": 101,
"product": "Laptop",
"amount": 1200 // old price
},
"after": {
"id": 101,
"product": "Laptop",
"amount": 1100 //new price
},
"source": {
"db": "ecommerce",
"table": "orders"
},
"op": "u", // u = update
"ts_ms": 1692956540700
}
Delete record sample data
{
"before": {
"id": 101,
"product": "Laptop",
"amount": 1100
},
"after": null,
"source": {
"db": "ecommerce",
"table": "orders"
},
"op": "d", // d = delete
"ts_ms": 1692956540900
}
Notice how each event has:
before → row state before change
after → row state after change
op → operation type (
c
,u
,d
)source → metadata (db, table, timestamp)
This is the heart of how Debezium streams data.
Connectors in Debezium
Source Connectors (provided by Debezium):
MySQL, PostgreSQL, MongoDB, SQL Server, Oracle, Db2, Cassandra
Reads transaction logs
Sink Connectors (provided by Kafka Connect ecosystem):
JDBC Sink → push to another DB
Elasticsearch Sink → for search indexing
S3 Sink → for archiving raw events
Others → Snowflake, BigQuery, etc.
Together, these connectors make Debezium a complete pipeline for database sync and streaming.
Real-World Use Cases
E-commerce → sync orders from MySQL to analytics DB
Search → stream product updates into Elasticsearch
Microservices → event-driven communication
Data migration → low-downtime DB migration
Audit Trails → stream all data changes with before and after payload into data lakes or history tables.
Limitations & Considerations
Requires a Kafka cluster (extra infra)
Snapshotting large tables can be expensive
Schema evolution needs planning
Sensitive data may need masking/transformations
Kafka topic retention must be tuned
Conclusion & Next Steps
Debezium makes CDC practical, reliable, and production-ready.
It turns every row change into a real-time event that downstream systems can consume.
In the next article, we’ll walk through a hands-on example: syncing changes from MySQL → Kafka → PostgreSQL using Debezium and Kafka Connect.
Subscribe to my newsletter
Read articles from Jatinder directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Jatinder
Jatinder
Passionate Software Engineer with 5+ years experience in developing Java, and Spring Boot Rest Applications. Keen to explore and learn new technologies. Sharing knowledge with others. Worked in Startups with fast-paced development modes. Interested in solving real-world problems. Always ready for challenges to solve those challenges.