Migrating Data to AWS, lessons learned


Introduction
Hey everyone, Recently i’ve been working on migrating 1.5TB PostgreSQL worth of data to AWS from another cloud provider. I wanted to document the journey on the different attempts that were made and what did not work and what did. This is going to be a short one because honestly I’m out of ideas at this moment of time so forgive me 😅
Context
Migrating the data because the goal was to move everything to AWS. So part of it was migrating the database essentially.
Using AWS DMS 💀
The first attempt was to migrate the data using AWS’s data migration tool, and let me tell you a mistake that happened before everything is we should have spiked it out a couple of days before making this decision.
We started moving data using the DMS tool, so basically just migrating data from one place to another. At first we had no idea it doesn’t copy the schema so prepare yourself for the first headache
Headache 1: There’s no schema 💀
After everything finished we tried it out and it had the data, but we couldn’t create anything new (no sequences for ids defined), there was no indexes no constraints nothing. Then we realized that the DMS tool doesn’t copy schema. (wish I read about it before lol). But anyways continuing the story we decided to take a schema only dump —schema-only
using pg_dump
and apply it on the new db. Aaaand we ran into another problem
Headache 2: Default values for columns
Turns out that the DMS tool creates the bare minimal schema for it to be able to transfer the data (gotta have tables to be able to insert in them lol) and when it does this it skips the default values identified in columns. When we apply the schema the create table commands (with correct default values) don’t get triggered because the tables already exist. So indexes, sequences, constraints and triggers work but the default values are gone.
Unless you decide to go through every single table and do alter statements to add the missing default which really isn’t best practice if you think about it.
This problem spiraled into us trying different techniques to make it work, some of them were the following:
Say screw it and add the schema first and make it slower but integral.
Add the schema but disable indexes then enable afterwards
Try
pg_dump
using a.sql
format, a.dump
format, I don’t even remember the rest tbf
And congrats guys we reached our 3rd, 4th and 5th headache of the day
Headache 3: Invalid command galore in .sql dumps
So we took a .sql
dump, tried it and well a spam of invalid command \N
started to appear. Which means the file couldn’t be parsed correctly when attempting to restore. I couldn’t care less at this point and decided to just do a .dump
format see if the error still persists and it went completely. But another one appeared (surprise)
Headache 4: pg_restore doesn’t work if you have generated columns
A generated column in PostgreSQL is a column that is physically stored on disk but is generated from other columns in the table, something like this
CREATE TABLE users (
first_name TEXT,
last_name TEXT,
full_name TEXT GENERATED ALWAYS AS (first_name || ' ' || last_name) STORED
);
And apparently when restoring the .dump
using pg_restore
it worked for all the tables that don’t have generated columns but for the ones that did
pg_restore: error: could not execute query: ERROR: column "xx" is a generated column
DETAIL: Generated columns cannot be used in COPY.
Command was: COPY foo (id, name, ..., xx, ...) FROM stdin;
And the copy command fails completely. Turns out u can’t copy generated columns with pg_restore
Thought of dropping the columns, transferring the data and re adding the columns but that too didn't work.
What ended up happening
We decided to go through with a .dump
of the whole database & schema. Optimizing the database as much as we can regarding insertions. Also parallelizing with pg_restore
Some of the optimizations that were made were: (all in postgresql.conf file)
Bumped shared buffers
Bumped
maintenance_work_mem
(Specifies the maximum amount of memory to be used by maintenance operations, such as VACUUM, CREATE INDEX, and ALTER TABLE ADD FOREIGN KEY.)Bumped
checkpoint_timeout
which is the time we go okay let’s sync actual database with where WAL is committed at (kinda goes deeper but tldr)disable
fsync
andsynchronus_commit
Which is not production friendly, this basically skips committing WAL to disk on every write (sometimes it can get batched when on but that’s not the case)disable
full_page_writes
which is not production friendly too, basically instead of copying the whole page to memory to alter it it just partially updates it. (basically for crash recovery safety)This makes WAL much smaller and writes faster.
That’s it I guess, needed to rant about this so I wrote it as an article.
Thanks for coming to my tech talk guys and hope you enjoyed! till the next one
Subscribe to my newsletter
Read articles from Amr Elhewy directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
