What is Debezium? Architecture, Terminology, and Connectors

JatinderJatinder
6 min read

In the first article of this CDC series, we learned how Change Data Capture (CDC) works and why it matters.

Now let’s dive deeper into Debezium — one of the most popular open-source CDC tools — to see what it is, how it works, and what kind of data it produces.

What is Debezium?

Debezium is an open-source CDC platform built on top of Kafka Connect. It continuously monitors database transaction logs and streams every change (insert, update, delete) into Kafka topics.

Supported databases include:

  • Relational: MySQL, PostgreSQL, SQL Server, Oracle, Db2

  • NoSQL: MongoDB, Cassandra

  • Others: Vitess, Spanner, etc.

Instead of batch jobs, Debezium ensures real-time, event-driven pipelines — a perfect fit for analytics, search, and microservices.

How Debezium Works?

At a high level:

  1. Debezium connects to a database’s transaction log (e.g., MySQL binlog, Postgres WAL).

  2. It captures changes row by row.

  3. It converts these into structured change events.

  4. Events are published to Kafka topics.

  5. Other systems (apps, warehouses, sinks) consume these events.

Debezium Architecture?

A diagram illustrating data flow from MySQL and PostgreSQL databases through Debezium source connectors to Apache Kafka. The data is then processed by JDBC and Elasticsearch sink connectors to be stored in a database and sent to an analytics viewer.

Database → Debezium Connector (Kafka Connect) → Kafka Topics → Consumers/Sink Connectors

  • Source Database → e.g., MySQL, PostgreSQL

  • Debezium Connector → reads changes from transaction logs (MySQL binlog, PostgreSQL WAL).

  • Kafka Cluster → stores events in topics

  • Consumers/Sinks → JDBC sink connector for relational database, Elasticsearch, Data Warehouse, or microservices

This design makes Debezium scalable and fault-tolerant.

Terminology

  • Connector → plugin that knows how to read changes from a specific DB

  • Source Connector → captures changes (Debezium provides these)

  • Sink Connector → delivers changes to targets (from Kafka Connect ecosystem)

  • Change Event → structured JSON/Avro message containing before/after values

  • Offsets → checkpoints for connector progress in logs

  • Snapshotting → initial dump of existing data before streaming begins

  • Schema History Topic → Kafka topic where Debezium records schema/DDL changes

  • Tombstone Event → message indicating a delete (so topics stay compacted)

Benefits of Using Debezium

  • Near real-time CDC with low latency

  • Works with many databases

  • No changes needed in application code

  • Reliable (offset tracking, schema history, exactly-once with Kafka)

  • Scales easily with Kafka

Features of Debezium

Debezium provides a rich set of features that make it one of the most widely used CDC platforms.

  1. Captures All Data Changes

    • Inserts, updates, and deletes are all captured reliably
  2. Low Latency, High Efficiency

    • Produces change events with very low delay while avoiding heavy CPU usage (no expensive polling).
  3. No Data Model Changes Required

    • Works by reading the database’s transaction log, so you don’t need to modify existing tables or schemas.
  4. Captures Deletes

    • Supports “tombstone” events to reflect deleted records downstream.
  5. Captures Old State + Metadata

    • Provides both before and after row states.

    • Can include extra metadata such as transaction IDs, user queries, and timestamps (depending on DB).

  6. Advanced Filtering and Transformations

    • Built-in Single Message Transformations (SMTs) allow:

      • Filtering certain records

      • Masking sensitive fields (e.g., PII)

      • Routing records to different Kafka topics

      • Custom message transformations

  7. Fault-Tolerance & Recovery

    • Uses Kafka offsets to resume from the exact point of failure, ensuring no events are lost or duplicated.

Why Log-Based CDC is Better

There are different CDC techniques (polling, triggers, log-based), but log-based CDC — which Debezium uses — is considered the best approach:

  • Low Overhead → No extra load on the database since it reads from the transaction log instead of querying live tables.

  • Complete Change History → Captures all changes, including deletes and before/after values.

  • Reliable & Consistent → Transaction ordering is preserved exactly as it happened in the database.

  • Non-Intrusive → No need to modify application code or database schema.

By comparison:

  • Polling adds query overhead, can miss changes, and introduces latency.

  • Triggers increase write latency and are harder to maintain at scale.

That’s why Debezium’s log-based CDC approach is widely used for real-time data pipelines.

Debezium Topics

When Debezium runs, it creates multiple Kafka topics:

  1. Table-specific topics

    • Each database table gets its own separate topic.

    • Example:

      • dbserver1.inventory.orders → streams changes from orders table

      • dbserver1.inventory.customers → streams changes from customers table

  2. Schema/History topic

    • Example: schema-changes.inventory

    • Stores schema (DDL) changes like ALTER TABLE.

    • Ensures consumers can interpret events even if the table structure evolves.

Sample Change Events (Records)

Let’s say we have a MySQL orders table.

Insert record sample data

{
  "before": null,
  "after": {
    "id": 101,
    "product": "Laptop",
    "amount": 1200
  },
  "source": {
    "db": "ecommerce",
    "table": "orders",
    "ts_ms": 1692956540000
  },
  "op": "c",   // c = create
  "ts_ms": 1692956540500
}

Update record sample data

{
  "before": {
    "id": 101,
    "product": "Laptop",
    "amount": 1200 // old price
  },
  "after": {
    "id": 101,
    "product": "Laptop",
    "amount": 1100 //new price
  },
  "source": {
    "db": "ecommerce",
    "table": "orders"
  },
  "op": "u",   // u = update
  "ts_ms": 1692956540700
}

Delete record sample data

{
  "before": {
    "id": 101,
    "product": "Laptop",
    "amount": 1100
  },
  "after": null,
  "source": {
    "db": "ecommerce",
    "table": "orders"
  },
  "op": "d",   // d = delete
  "ts_ms": 1692956540900
}

Notice how each event has:

  • before → row state before change

  • after → row state after change

  • op → operation type (c, u, d)

  • source → metadata (db, table, timestamp)

This is the heart of how Debezium streams data.

Connectors in Debezium

  • Source Connectors (provided by Debezium):

    • MySQL, PostgreSQL, MongoDB, SQL Server, Oracle, Db2, Cassandra

    • Reads transaction logs

  • Sink Connectors (provided by Kafka Connect ecosystem):

    • JDBC Sink → push to another DB

    • Elasticsearch Sink → for search indexing

    • S3 Sink → for archiving raw events

    • Others → Snowflake, BigQuery, etc.

Together, these connectors make Debezium a complete pipeline for database sync and streaming.

Real-World Use Cases

  • E-commerce → sync orders from MySQL to analytics DB

  • Search → stream product updates into Elasticsearch

  • Microservices → event-driven communication

  • Data migration → low-downtime DB migration

  • Audit Trails → stream all data changes with before and after payload into data lakes or history tables.

Limitations & Considerations

  • Requires a Kafka cluster (extra infra)

  • Snapshotting large tables can be expensive

  • Schema evolution needs planning

  • Sensitive data may need masking/transformations

  • Kafka topic retention must be tuned

Conclusion & Next Steps

Debezium makes CDC practical, reliable, and production-ready.
It turns every row change into a real-time event that downstream systems can consume.

In the next article, we’ll walk through a hands-on example: syncing changes from MySQL → Kafka → PostgreSQL using Debezium and Kafka Connect.

0
Subscribe to my newsletter

Read articles from Jatinder directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Jatinder
Jatinder

Passionate Software Engineer with 5+ years experience in developing Java, and Spring Boot Rest Applications. Keen to explore and learn new technologies. Sharing knowledge with others. Worked in Startups with fast-paced development modes. Interested in solving real-world problems. Always ready for challenges to solve those challenges.