Why Debezium Alone Is Not Enough for End-to-End Streaming ETL/ELT

Yingjun WuYingjun Wu
6 min read

When building real-time data platforms, we get asked the same question almost every time in customer meetings or community discussions:

“Can we just use Debezium to build a complete end-to-end streaming ETL/ELT pipeline?”

Debezium is indeed the de facto standard for Change Data Capture (CDC). In the CDC stage, it is mature, stable, and widely adopted. It supports multiple mainstream databases, parses complex log formats, and handles many version differences and edge cases. However, when you try to run the full ETL/ELT process on it - especially in production - you quickly find that it lacks many of the engineering capabilities required to make it production-ready. This is exactly where RisingWave’s state-of-the-art streaming CDC solution comes in.


Limitations of Debezium

Debezium focuses solely on capturing changes from databases and delivering them reliably and in order to downstream systems. This focus makes it excellent at protocol parsing, log handling, and connector coverage. However, it has several inherent limitations.

First, it has no built-in data processing or transformation capabilities. Debezium does not perform SQL projection, cleaning, joins, aggregations, or window operations. It outputs raw change events, which must be consumed by a downstream compute engine to produce analytical results.

Second, it cannot guarantee end-to-end consistency. In scenarios with multi-table or multi-source synchronization, or where strict cross-system transaction consistency is required, event ordering alone is not enough. Without full pipeline checkpointing, offset alignment, and recovery, achieving exactly-once semantics is very difficult.

Third, it lacks efficient write support for data lakes or warehouses. When writing to Apache Iceberg, Hudi, or Delta Lake, Debezium does not handle small file compaction, equality delete merging, or transactional commits - critical factors for both query performance and storage cost.

Finally, its operational and observability features are limited. Although Debezium Server can run standalone, advanced features like memory and backpressure management, cross-cloud schema history management, or real-time DDL detection and synchronization are almost impossible without modifying the source code. Once your requirements go beyond “send data to Kafka,” operational complexity grows quickly.


Why We Chose the Embedded Engine

At RisingWave, we did not rewrite Debezium from scratch - doing so would be a high-cost, long-term maintenance effort, and it would be hard to catch up quickly with Debezium’s accumulated expertise in protocol parsing, version compatibility, and community maintenance. We also did not choose Debezium Server, which is a black box and offers little room for deep customization.

Instead, we adopted Debezium Embedded Engine mode. This allows us to control the entire CDC lifecycle and inject logic at any point: managing memory and backpressure during ingestion, precisely controlling offsets during recovery, or augmenting and rewriting event data when needed. The embedded mode lets us integrate CDC tightly with RisingWave’s scheduling, checkpointing, and DAG topology, ensuring unified management of offsets and compute state, and directly combining it with our storage, compute, and sink layers.


Our Adaptation Strategy

Rather than rebuilding Debezium, we extended Debezium Embedded Engine with deep customizations. Debezium is already mature in CDC protocol parsing, connector coverage, and community support, so we did not reinvent the wheel. But to make it suitable for end-to-end streaming ETL/ELT, we needed to close its gaps in consistency, performance optimization, schema management, sink optimization, and operational visibility.

We treat Debezium as a stable ingestion layer and integrate it with RisingWave’s scheduling, storage, and compute system, adding controlled, extensible enhancements at every critical stage of the ingestion path. Below are the key improvements and their technical details.

Snapshot Boundaries with Lock-Free Initialization

For Postgres, we read the current Log Sequence Number (LSN) as the snapshot boundary; for MySQL, we record the current binlog file name and position. This ensures a consistent snapshot without holding long read locks. Both snapshot and incremental events are aligned to this boundary so that the final state matches the source at the same logical time.

Dual-Channel Synchronization with Idempotent Merge

Once the boundary is set, the snapshot channel and incremental channel run in parallel: the snapshot channel scans static data, and the incremental channel consumes WAL or binlog in real time. We generate idempotent keys using the primary key and position (LSN or binlog offset) to merge the two streams, ensuring correct ordering and no duplicates or gaps. This enables a quick switchover to pure incremental mode.

Single-Table Parallel Backfill

Debezium’s native parallel snapshotting only works across tables, not within a single large table. We added primary-key range splitting, dividing large tables into shards for parallel backfill. The thread concurrency is adjusted dynamically based on the source load, significantly reducing initialization time without overloading the source database.

Offset and Checkpoint Alignment

We bind source offsets and RisingWave compute checkpoints into atomic transaction units. When the compute graph issues a checkpoint (barrier), it records the current offsets for all tables and commits them together with the compute state. Only after a successful checkpoint commit is the new data visible downstream. In case of failure, we recover from the latest successful checkpoint and have Debezium replay events from the exact offset, enabling end-to-end exactly-once semantics.

Cross-Cloud Schema History Management

We implemented cross-cloud schema history management that works with S3, GCS, Azure Blob, and Alibaba OSS. Schema history is segmented by time or size, with the most recent segments cached in memory. Older segments are loaded on demand and released when idle, eliminating the memory spikes common in long-running jobs.

Real-Time DDL Detection and Synchronization

For Postgres, we use the replication protocol’s Relation (R) messages to detect schema changes in real time. When new, dropped, or altered columns are detected, the system updates the internal schema immediately, injects a schema-change checkpoint into the compute graph, and synchronizes metadata to downstream data lakes or message systems, ensuring schema consistency across the pipeline.

TOAST Value Restoration

When Postgres updates non-TOAST columns, it replaces TOAST columns with the placeholder __debezium_unavailable_value. During ingestion, we restore the real values by querying the source database or reading from a local cache, with concurrency limits, retries, and caching in place to ensure completeness without overloading the source.

Iceberg-Optimized Writes

Before writing to Iceberg, we cluster data by partition and sort keys, batching into near-target-size files to minimize small files. Each checkpoint is committed as a single Iceberg transaction, becoming visible only after a successful commit. Background compaction jobs merge small files and equality delete files into data files, reducing read amplification and storage cost. The compaction frequency and concurrency are tunable to balance data freshness against cost.

End-to-End Observability

We provide comprehensive monitoring across the entire pipeline, including data arrival latency, snapshot and incremental throughput, DDL propagation delay, Iceberg file size distribution, transaction commit time, and compaction progress. These metrics allow operators to quickly locate issues and take corrective actions such as throttling, recovery, or rollback, ensuring stable operations.


Conclusion

Debezium is the de facto standard for CDC, but it is only a starting point. To turn it into a production-ready, end-to-end streaming ETL/ELT pipeline, you must add capabilities in consistency, computation and transformation, sink optimization, schema management, and operational monitoring.

By deeply customizing Debezium Embedded Engine, RisingWave integrates snapshot and incremental synchronization, parallel backfill, offset-checkpoint alignment, cross-cloud schema management, real-time DDL detection, TOAST restoration, Iceberg-optimized writes, and full observability into one cohesive system. The result is a state-of-the-art streaming CDC solution - mature, controllable, extensible, low-latency, and operationally friendly, built on the shoulders of a proven ecosystem.

0
Subscribe to my newsletter

Read articles from Yingjun Wu directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Yingjun Wu
Yingjun Wu