Finding a Needle in a Haystack: How to Diff 800M+ Records Across Two Databases Without Losing Your Mind

Amr ElhewyAmr Elhewy
5 min read

Introduction

Hello guys! This is going to be a quick but interesting one. In design sometimes for faster response times & aggregation purposes we step away from the traditional relational databases and head over towards more analytical processing optimized databases (e.g Clickhouse which Is a columnar data store)

Maintaining a source of truth database is important and that will most probably be the relational one. We opt into syncing both either synchronously or asynchronously where data in Postgres for example is always passed over to Clickhouse which we can directly read and do aggregates and such.

I’m not here to talk about how to sync them together but the main goal is to keep both databases in sync. Both have the same exact data either right now or eventually but that’s a different story.

Sometimes they’ll deviate. Meaning Postgres has more data than Clickhouse. That could happen because either some messages weren’t processed as they should have or network errors, etc

Now you figured out the reason and pushed a fix, but you want to re-sync the missing parts again and the next question arises..

How do we sync the missing data again?

The answer of this will be depending on the scale of the data provided. And let me tell you from experience that that’s the most important part necessary to answer the question.

I’ll explain the next parts in a series of questions and answers and hopefully the answers help you reach a conclusion by the end of the article

How big is the data?

Data sizes can range from a few hundred thousand rows to a few million to almost billions of rows.

Smaller ranges have more options than larger ranges. Meaning it’s easier and takes less time to resync when you don’t have a lot of rows.

Larger ranges are where things get tough. You’ll reach a point where you start questioning yourself but hey everything is a learning curve I guess.

After figuring this out we opt into asking the second question

How big is the delta?

The difference between both databases also and the ratio between the delta and the size of the data is important to know.

Let me tell you before moving on that having a difference of 4000 records in a data size of 1b+ rows is something that’ll cause you headaches. This article is more aimed at that scale of data.

Now knowing the answers of the above two questions can already derive you to a solution

Small-ish data sizes

Small amounts of data (up to a few hundred thousand) can be easily re-synced. The most straight forward option is to Fetch the ids (any unique key really) here and there and just get the difference between both. Will take some memory but not a crazy amount. Once having the delta just re insert them into your analytical database and move on with your day.

Larger data sizes

Here is where the above solution will NEVER work under normal circumstances (unless you have a whopping 100gb RAM machine or something)

I’m going to stick with the 4000 records over 800Mil + records problem here. (We love extremes) And the solutions provided for the questions below are steps I took to actually solve this problem.

Now you’re going to have to play it carefully and ask new questions and these questions depend on the database optimization level you have and setup.

The setup we’ll assume is that there exists no partitioning no sharding just indexes and vibes. (Not the best setup for this amount of data)

The main goal here is to stay away from memory. And by staying away from memory I mean any application level logic you attempt will end up failing miserably.

Some questions to ask

  1. Is there a possibility to close the searching gap down? for example do I have to search from the first record to the last or can I close that gap a little bit? Indexes will help but still take a long time.

For example if I figured out that the 4000 records got dropped only in the past month then I can utilize some timestamp (hopefully indexed) and work around that for a start.

So the first task is to minimize the search field as much as you can

  1. What are the strengths of each database? Knowing this can help us make better decisions

a- For example Clickhouse is very very strong in aggregating and crushing numbers due to its columnar nature (data is stored physically together in columns not rows). Knowing this information, querying the ids over the date range specified in step 1 is key (a few seconds query) but instead of moving them to memory we write them to a csv file on disk. Something like this

# Clickhouse query
SELECT id 
FROM xxx 
WHERE xxx_date BETWEEN '2025-01-01' AND '2025-01-31' 
INTO OUTFILE 'clickhouse_ids.csv'
FORMAT CSV;

This will output all the ids that exist in that data range. They are missing 4000 records though but this is a solid step towards solving the problem.

b- Now in Postgres the goal is to take the ids from above and exclude them, finding the missing 4000.

Here comes a really cool trick I learned which is creating a Temporary Postgres table

Temporary tables are tables that get removed once the connection/session is closed. For example executing psql and creating a temp table then on exit it gets deleted

The idea here is to create this table, dump the ids in it and use it to query against the larger table. Offloading the memory overhead to Postgres. Memory use is offloaded to Postgres’ shared buffers, disk, and planner.

Even if the table is big, Postgres manages the spillover efficiently, avoiding app memory pressure.

Now executing something like this will be ideal

# Postgres psql
\copy (
  SELECT p.id
  FROM xx p
  WHERE p.xx_date BETWEEN '2025-01-01' AND '2025-01-31'
  AND p.id NOT IN (
    SELECT id FROM clickhouse_ids
  )
) TO 'missing_ids.csv' WITH CSV

Giving that we have properly indexed the table on the date column and the id. Everything falls in the hands of the planner anyway.

Once this query finishes though you’ll have a 4000 row csv of the missing ids. Now all that’s left is just inserting them into Clickhouse.

Summary

Dealing with large data is definitely an experience. The application level solutions have no power here and thinking outside the box is definitely a must. This was an approach I personally used that worked better than I expected so thought i’d share. Thank you guys for tuning in & see you in the next one!

0
Subscribe to my newsletter

Read articles from Amr Elhewy directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Amr Elhewy
Amr Elhewy