Boost PostgreSQL Export and Import with Parallelism


If you’ve ever had to move a hefty database, you know how painful it can be to sit through long export and import times. When speed is of the essence but configuration tweaks aren’t an option (maybe you don’t have permissions, or the team is on edge about potential risks), there’s a handy trick you can use to speed things up without touching server settings. This guide will walk you through how to get the job done using PostgreSQL’s pg_dump and pg_restore commands in parallel, which is a fantastic way to reduce processing time.

Imagine you’re working with a PostgreSQL database housing customer analytics data — let’s call it customer_db. This database has grown massive over time, and it’s time to move it to a development environment (customer_db_dev), where your team needs it for testing and analysis.


Step 1: Export the Database with pg_dump in Parallel Mode

Typically, exporting data is a straightforward pg_dump job, but when you're dealing with gigabytes (or even terabytes) of data, you’re in for a wait. Since changing PostgreSQL's configuration isn’t an option, we can still leverage parallel processing in pg_dump to get things moving faster. Here’s how.

Pro tip: Use pg_dump in Directory Format (-F d) with Parallel Jobs (-j). In directory format, pg_dump can split data into multiple files that will then be processed in parallel—meaning multiple threads are working simultaneously, reducing the overall export time.

To kick things off, we’ll start with 4 parallel jobs, but feel free to tweak based on your system’s CPU power. Here’s what the command looks like:

pg_dump -h your_host -p 5432 -U your_username -d customer_db -F d -j 4 -f exports -v

Breakdown of the Command:

  • -F d: Specifies Directory Format, enabling multi-file output.

  • -j 4: Uses 4 parallel jobs, a good starting point for a speedy export. (This can be adjusted based on your CPU power; more cores = more jobs!)

  • -f exports: Defines the output directory for the exported files.

  • -v: Verbose mode, giving you live feedback on what’s happening.

In practice, running pg_dump this way can cut export time significantly. For our customer_db example, it shaved hours off the process, making it not only more efficient but also way less stressful.


Step 2: Import the Database with pg_restore in Parallel Mode

So, we’ve got our export, and it’s time to bring it into our target database, customer_db_dev. Normally, pg_restore does a single-threaded job, but we can make it work harder by using multiple threads to handle different parts of the import.

The beauty of using pg_restore with parallel jobs (-j) is that it takes full advantage of your system’s resources without requiring changes to the database settings. Like pg_dump, pg_restore can handle directory-format exports in chunks, allowing you to run parallel jobs with ease.

Here’s the pg_restore command we’re using:

pg_restore -h your_host -p 5432 -U your_username -d customer_db_dev -j 4 -F d exports/ -v

Breakdown of the Command:

  • -j 4: Runs 4 parallel jobs, increasing the speed of the import process.

  • -F d: Specifies directory format, which is essential for enabling parallelism in pg_restore.

  • exports/: The path to our exported directory containing the dump files.

  • -v: Verbose mode, to keep an eye on the progress.

With this approach, the database import process is much faster and a lot smoother. In our case, the import into customer_db_dev was completed in nearly half the usual time, making it a life-saver when working with massive databases.


Real-World Results: Speed Boost in Action

To give you an idea of the speed difference, here’s what happened with our customer_db to customer_db_dev move:

  1. Single-threaded Approach (standard, non-parallel): Export time of about 4 hours and import time close to 5 hours. Total transfer time? Nearly 9 hours.

  2. Parallel Approach (using -j 4): Export time dropped to around 2 hours, with the import time coming in at about 2.5 hours. Total transfer time? Just under 5 hours.

By leveraging parallelism, we cut nearly 50% off the total time for the export/import cycle — a huge improvement for teams needing quick database access in dev environments.


Pros and Cons of Using Parallelism in pg_dump and pg_restore

Like any approach, using parallelism in pg_dump and pg_restore comes with its advantages and disadvantages. Here’s a rundown to help you decide if it’s right for your database migration needs.

Pros

  • Reduced Processing Time: With multiple threads working simultaneously, you’re effectively cutting down the total time for both export and import, sometimes by half or more.

  • Flexible: The number of parallel jobs (-j) can be adjusted based on system resources, allowing for scalability on different hardware setups.

  • No Server Configuration Changes Needed: This approach leverages client-side commands, meaning you don’t need special permissions or configuration changes on the database server itself.

  • Less Downtime for Teams: Faster migrations mean that dev teams can access data sooner, making it ideal for time-sensitive migrations or testing scenarios.

Cons

  • High Resource Usage: Running multiple jobs in parallel can put a strain on CPU and memory, especially on lower-end systems. You’ll want to monitor resources to ensure your system can handle it without lagging.

  • Storage Requirements: The directory format used in parallel dumps creates multiple files, which can require more storage during the export and import process. Ensure you have enough disk space to avoid interruptions.

  • Limited Compatibility: Directory format (-F d) is required for parallelism, but it may not be compatible with some backup or restore workflows that rely on single-file outputs.

  • Potential Locking Issues: For databases with frequent writes, there’s a slight risk of encountering locking issues during the export. This is typically minor but can disrupt heavily used production databases.

Final Thoughts

Using pg_dump and pg_restore with parallelism is a fantastic workaround when you can’t change server settings but still need the process to be fast. It’s also flexible, allowing you to adjust the number of parallel jobs based on your system’s capabilities. So, next time you’re stuck with a massive database and need a swift export/import process, give this parallel approach a try—your future self will thank you.

Whether you’re running tests on a dev environment or preparing for a big data migration, parallelism in pg_dump and pg_restore can be a lifesaver, saving you time and sparing you stress. So go on, give it a shot, and let those CPUs get to work!

0
Subscribe to my newsletter

Read articles from Shreehari Vaasistha L directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Shreehari Vaasistha L
Shreehari Vaasistha L