Zeroth Commandment: Thou Shall Not Copy

WirekiteWirekite
3 min read

If you’ve written database backends, you learn a crucial design doctrine: Excess Data Copying is Bad.

Wirekite is designed around a related doctrine: “Thou Shall Not Copy (Unnecessarily)”.

If you’re moving data from source to target, you need to do the following at a minimum:

  1. Extract the data from the source database. Many data sources (MySQL/MariaDB and PostgreSQL in particular) permit this to be done by a direct dump out of the database itself to a file. Others (Oracle, Microsoft SQL Server, many others) require you to stream data from the source database instance into the address space of the client, where the client dumps the data to a file. The former is a single copy**, while the latter is three copies (network (even if localhost extract) → client address space → file).

  2. Transfer data from the source environment to the target environment. In some cases, this is as simple as connecting to the target db and pointing it at the file dumped by the source application. In others - particularly many cloud data warehouse targets such as Firebolt or Snowflake - you need to upload the output from (1) above to something like Amazon S3. Assuming the usual case of the target db instance being on a different physical host than the source db instance, the optimal case involves a read of the file + a network byte transfer from the client to the target db instance. The less optimal case involves four copies: reading the file into the data transfer client’s address space + a network data transfer + reading the bytes off the network at the other end of the networking + writing the bytes to remote storage (either an S3 object or a file on a remote compute instance, whether in the cloud or not).

  3. The final load to the target. This involves getting the data from transient storage into the address space of the target db instance. In some cases such as MySQL’s LOAD DATA INFILE, this is a direct access to the stored file by the db engine itself, which is a single copy**. In others, such as S3 storage or data loaded from an application (ie, using big INSERT statements or mechanisms such as PostgreSQL COPY FROM STDIN), the data is copied twice more: once into the address space of the client and another across the network.

**We’ve ignored copies inside the source and target DB instances themselves - and a bunch of potential copying done by networking/firewalls/etc - as the number of copies done by these is difficult to determine at the application level.

Depending on the specifics and networking of the source and target, Wirekite’s Extractors, Movers (if needed), and Loaders will execute 3 to 9 copies of the data while doing the data movement.

Other tools - particularly those using something like Kafka or some other intermediate tooling as part of the data transfer process - will make many more copies of the data, particularly if they do row-by-row data conversions as part of using the data transfer tool. If they do parsing at the row and column level, there could be several memory copies as part of the parsing + copies involved in building an intermediate format such as JSON or Parquet, as well as at least some extra copying to persistent storage if they use a “database-backed” transfer system like Kafka.

On the target side, they have to convert from the intermediate format to something that can be ingested by the database instance. This will also involve several additional copies while parsing the JSON or Parquet into address space structures and reformatting these into output suitable for target ingestion.

The simplicity of our data flow - determined by source and target capabilities as well as network topology - is a big reason Wirekite benchmarks so much faster than many other data migration tools.

Another big reason is multithreaded extract and load, but that’s a topic for another blog post…

0
Subscribe to my newsletter

Read articles from Wirekite directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Wirekite
Wirekite