when you have millions of insert statements in a file...
Let's say you have a script of one million insert statements with literal values. Of course, this is a bad idea. SQL is a language, not a data exchange format. You should have a file to import with COPY, with all values as CSV for example. Or at least, if you are a big fan of INSERT statements, have thousands of row values in it. Or prepare the statement with parameters and call with each parameter.
However, this is the kind of thing we encounter in real life. I still remember 20 years ago when I got to look at a Mobile Telecom Billing application where the vendor specification required 6 large servers for a system that had at most one thousand post-paid customers to bill. Call Data Records came in a very efficient proprietary binary format from the network Mobile Switching Centre. They had a C program to decode it into... plain text INSERT statements to be "ingested" with sqlplus 😱
It is hard to imagine the worst design, and that was for the most critical part of the system as it required nearly real-time actions. Of course, performance was poor. And you can imagine how there was no correct error handling there. And no security with all call records in plain text. But they sell it to people who do not look at how it works, easily satisfied by nice powerpoints. Today, I'm convinced that, whatever the reputation of the vendor, you should not put your data on software that is not open-source. Look at the code, look at the design, and you will get a good idea of the quality of the software.
Back to our technical workaround, I'm generating one million of insert statements:
for i in {1..1000000}
do
echo "insert into demo (key,value) values ($i,$RANDOM);"
done > inserts.sql
PostgreSQL
Loading them as-is with psql is not very efficient because it has to send the command one by one, parse them each time, and commit each row:
time psql <<SQL
drop table if exists demo;
create table demo (key int, value int);
\set QUIET on
\i inserts.sql
\set QUIET off
select count(*),sum(value) from demo;
\q
SQL
Here is the result, in 21 minutes, and my PostgreSQL database is local. It would be even worse with network roundtrips:
CREATE TABLE
count | sum
---------+-------------
1000000 | 16382993084
(1 row)
real 21m22.156s
I can get it much faster by sending them in a single transaction. By default, psql is in autocommit mode, which means that each call is a transaction. You get out of this by starting a transaction yourself:
time psql <<SQL
begin transaction;
drop table if exists demo;
create table demo (key int, value int);
\set QUIET on
\i inserts.sql
\set QUIET off
select count(*),sum(value) from demo;
commit;
\q
SQL
There, even if the rows are sent one by one, all are part of the same transaction and take only one minute:
BEGIN
DROP TABLE
CREATE TABLE
count | sum
---------+-------------
1000000 | 16382993084
(1 row)
COMMIT
real 1m13.517s
user 0m5.841s
I can even avoid the many roundtrips by enclosing this in a procedural block. The BEGIN keyword has another signification here. There's no need to control the transaction as the whole command is sent as one auto-commit one:
cat > inserts.cmd <<SQL
do 'begin
$(cat inserts.sql)
commit;
end;';
SQL
time psql <<SQL
drop table if exists demo;
create table demo (key int, value int);
\i inserts.cmd
select count(*),sum(value) from demo;
\q
SQL
I save the roundtrips, still run it as one transaction, and get it to run in 40 seconds:
DROP TABLE
CREATE TABLE
DO
count | sum
---------+-------------
1000000 | 16382993084
(1 row)
real 0m40.664s
YugabyteDB
Ok, now my goal is to ingest to YugabyteDB. I will have some network calls because my database is on another node (you can't expect a distributed database to be local as it has multiple nodes). I tried the same but it was very long:
time psql postgres://yugabyte:yugabyte@yb1.pachot.net:5433 <<SQL
drop table if exists demo;
create table demo (key int, value int);
\i inserts.cmd
select count(*),sum(value) from demo;
\q
SQL
I mentioned the long parse to prepare the statements, and this is worse in a distributed database because the metadata (aka dictionary aka catalog aka system information) is shared across the nodes. And, during this, the DropCachedPlan shows up in perf (reminder that perf shows only the CPU resources, not the RPC waits):
Well, I canceled it. Row-by-row inserts are bad in a monolith database, and even worse in a distributed one. Let's see the solutions, or workarounds, without having to modify the inserts.sql file.
YugabyteDB - Temporary Table
As the insert into the monolith PostgreSQL doesn't take too long, I tried the following, inserting into a temporary table (local on the node I'm connected to) and CREATE TABLE ... AS SELECT or INSERT ... SELECT from it into my distributed table.
time psql postgres://yugabyte:yugabyte@yb1.pachot.net:5433/yugabyte <<SQL
drop table if exists demo_temp;
drop table if exists demo;
create temporary table demo_temp (key int, value int);
alter table demo_temp rename to demo;
\set QUIET on
\i inserts.sql
\set QUIET off
alter table demo rename to demo_temp;
create table demo as select * from demo_temp;
select count(*),sum(value) from demo;
\q
SQL
This is again too long. Because the problem is not in the distributed storage. When we have a batch of inserts, they are optimally sent to each node and processed asynchronously. Here the problem is many small SQL statements to parse.
YugabyteDB - COPY
In YugabyteDB as in PostgreSQL, for fast ingest of data, the right tool is COPY. But I'm too lazy to format those inserts into a CSV. As it is quite fast to load into a local PostgreSQL database, as I did above, I'll generate the COPY commands with pg_dump
:
pg_dump --table demo --data-only > inserts.dmp
Without the --inserts
option, the generated file uses COPY FROM STDIN with TSV formatted lines until \.
:
No statement to parse, no dictionary to read, this is efficient:
time psql postgres://yugabyte:yugabyte@yb1.pachot.net:5433/yugabyte <<SQL
drop table if exists demo;
create table demo (key int, value int);
\i inserts.dmp
\c
select count(*),sum(value) from demo;
\q
SQL
Here is the result:
CREATE TABLE
SET
SET
SET
SET
SET
set_config
------------
(1 row)
SET
SET
SET
SET
COPY 1000000
You are now connected to database "yugabyte" as user "yugabyte".
count | sum
---------+-------------
1000000 | 16382993084
(1 row)
real 0m59.139s
This is about 1 minute, not bad when compared to the local PostgreSQL one given that I'm inserting into a Replication Factor 3 cluster here. But in a distributed database, the performance comes with scalability.
Let's try again:
psql postgres://yugabyte:yugabyte@yb1.pachot.net:5433/yugabyte <<SQL
drop table if exists demo;
create table demo (key int, value int);
SQL
Now with 4 clients in parallel:
for i in {0..3} ; do
(
awk '
# out of data section
/^\\[.]/{data=0}
# when in data section, take only 1 row every `mod`
data!=1||NR%mod==num{print}
# data section begins after COPY
/^COPY /{data=1}
' mod=4 num=$i inserts.dmp > inserts.$i
psql -c "\\i inserts.$i" postgres://yugabyte:yugabyte@yb1.pachot.net:5433/yugabyte
) & done | grep ^COPY | ts
time wait
This is a small cluster but with parallelism, I can load a million rows in a shorter time:
Sep 22 20:34:30 COPY 250000
Sep 22 20:34:30 COPY 250000
Sep 22 20:34:30 COPY 250000
Sep 22 20:34:30 COPY 250000
real 0m39.396s
This is, in a remote highly available distributed database, the same time as with a local monolith database. YugabyteDB has many optimizations to scale fully consistent data ingestion. The right way is to send data in bulk without parsing statements, as with COPY. The same PostgreSQL best practices apply in YugabyteDB, and are even more critical as soon as it involves cross-node calls
Subscribe to my newsletter
Read articles from Franck Pachot directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by