Migrating Data to AWS, lessons learned

Amr ElhewyAmr Elhewy
4 min read

Introduction

Hey everyone, Recently i’ve been working on migrating 1.5TB PostgreSQL worth of data to AWS from another cloud provider. I wanted to document the journey on the different attempts that were made and what did not work and what did. This is going to be a short one because honestly I’m out of ideas at this moment of time so forgive me 😅

Context

Migrating the data because the goal was to move everything to AWS. So part of it was migrating the database essentially.

Using AWS DMS 💀

The first attempt was to migrate the data using AWS’s data migration tool, and let me tell you a mistake that happened before everything is we should have spiked it out a couple of days before making this decision.

We started moving data using the DMS tool, so basically just migrating data from one place to another. At first we had no idea it doesn’t copy the schema so prepare yourself for the first headache

Headache 1: There’s no schema 💀

After everything finished we tried it out and it had the data, but we couldn’t create anything new (no sequences for ids defined), there was no indexes no constraints nothing. Then we realized that the DMS tool doesn’t copy schema. (wish I read about it before lol). But anyways continuing the story we decided to take a schema only dump —schema-only using pg_dump and apply it on the new db. Aaaand we ran into another problem

Headache 2: Default values for columns

Turns out that the DMS tool creates the bare minimal schema for it to be able to transfer the data (gotta have tables to be able to insert in them lol) and when it does this it skips the default values identified in columns. When we apply the schema the create table commands (with correct default values) don’t get triggered because the tables already exist. So indexes, sequences, constraints and triggers work but the default values are gone.

Unless you decide to go through every single table and do alter statements to add the missing default which really isn’t best practice if you think about it.

This problem spiraled into us trying different techniques to make it work, some of them were the following:

  1. Say screw it and add the schema first and make it slower but integral.

  2. Add the schema but disable indexes then enable afterwards

  3. Try pg_dump using a .sql format, a .dump format, I don’t even remember the rest tbf

And congrats guys we reached our 3rd, 4th and 5th headache of the day

Headache 3: Invalid command galore in .sql dumps

So we took a .sql dump, tried it and well a spam of invalid command \N started to appear. Which means the file couldn’t be parsed correctly when attempting to restore. I couldn’t care less at this point and decided to just do a .dump format see if the error still persists and it went completely. But another one appeared (surprise)

Headache 4: pg_restore doesn’t work if you have generated columns

A generated column in PostgreSQL is a column that is physically stored on disk but is generated from other columns in the table, something like this

CREATE TABLE users (
  first_name TEXT,
  last_name TEXT,
  full_name TEXT GENERATED ALWAYS AS (first_name || ' ' || last_name) STORED
);

And apparently when restoring the .dump using pg_restore it worked for all the tables that don’t have generated columns but for the ones that did

pg_restore: error: could not execute query: ERROR:  column "xx" is a generated column
DETAIL:  Generated columns cannot be used in COPY.
Command was: COPY foo (id, name, ..., xx, ...) FROM stdin;

And the copy command fails completely. Turns out u can’t copy generated columns with pg_restore

Thought of dropping the columns, transferring the data and re adding the columns but that too didn't work.

What ended up happening

We decided to go through with a .dump of the whole database & schema. Optimizing the database as much as we can regarding insertions. Also parallelizing with pg_restore

Some of the optimizations that were made were: (all in postgresql.conf file)

  1. Bumped shared buffers

  2. Bumped maintenance_work_mem (Specifies the maximum amount of memory to be used by maintenance operations, such as VACUUM, CREATE INDEX, and ALTER TABLE ADD FOREIGN KEY.)

  3. Bumped checkpoint_timeout which is the time we go okay let’s sync actual database with where WAL is committed at (kinda goes deeper but tldr)

  4. disable fsync and synchronus_commit Which is not production friendly, this basically skips committing WAL to disk on every write (sometimes it can get batched when on but that’s not the case)

  5. disable full_page_writes which is not production friendly too, basically instead of copying the whole page to memory to alter it it just partially updates it. (basically for crash recovery safety)

    This makes WAL much smaller and writes faster.

That’s it I guess, needed to rant about this so I wrote it as an article.

Thanks for coming to my tech talk guys and hope you enjoyed! till the next one

10
Subscribe to my newsletter

Read articles from Amr Elhewy directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Amr Elhewy
Amr Elhewy