The Change Data Capture performance problem

WirekiteWirekite
2 min read

Many Change Data Capture (CDC) tools use something like the below approach to propagate changes from sources to targets:

  • Gather changes from the source dataworld using vendor-specific change capture methods (binlog reader APIs, replication streams, insert/update/delete table triggers, etc).

  • Stream changes using something like Kafka from the source platform to the target platform.

  • Use a client-side loader that reads the events out of Kafka, converts them to appropriate SQL DML - typically single-row INSERT/UPDATE/DELETE statements and BEGIN/COMMIT as necessary - and runs them on the target to execute the changes.

This will “work correctly” in the sense that your changes on the source will make it to the target, and will correctly reflect changes to the source on the target.

But…

This method has a low performance ceiling for several reasons…

  1. CDC processing on the source is typically single-threaded, particularly if you intend to serialize transactions. To be fair, there isn’t too much the CDC processing software can do about this as it’s an artifact of the vendor CDC implementation.

  2. Data must be formatted into stream-friendly events and written to the streaming system, such as Kafka.

  3. On the client side, the events must be read out of the streaming system, converted to some sort of SQL DML, and written to the target. This is often done one query at a time in a single-threaded fashion.

  4. One complicating factor is many cloud database engines don’t process single-row change DML efficiently because of their storage design.

This may be fine for environments that average a few to a few dozen changes per second, but if you have a very active dataworld with hundreds or more changes per second from dozens of active client connections, CDC tools using such methods will quickly “fall behind”, and if you don’t have “idle periods” on the source, your CDC may fall far enough behind that it simply can’t recover.

Wirekite uses a faster method. We’re still serial on most extracts as most CDC APIs are fundamentally serialized (and to be transactionally correct in the target, we have to have some unit-of-work that is serial), but we use a much simpler and faster transfer mechanism than event-streaming, and we use multi-row loading operations to post changes to the target database.

This approach allows us to process over 180,000 changes per second on some particularly fast data sources and targets, as shown by this benchmark: 10 Million Changes: MySQL to Firebolt.

You won’t typically need this level of speed - few single-instance database engines can handle this much activity - but it’s good to know it’s there if needed…

0
Subscribe to my newsletter

Read articles from Wirekite directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Wirekite
Wirekite