Compacting date intervals in BigQuery

Here's a practical BigQuery SQL exercise that highlights some important concepts as well is an interesting algorithm imho. I've pair programmed this with LLMs, if that's a thing 😎

Problem statement: compacting a SCD-2 table, essentially finding intervals that can be safely merged, turning two adjacent intervals with the same data into a single, bigger interval.

This particular input data guarantees these intervals cannot overlap (at the same grain), but there can be gaps. We're also talking about [left-inclusive, right-exclusive) intervals.

No alternative text description for this image

Here's a breakdown of how it all works:

1️⃣ we're starting by computing a hash of all the column of interest, excluding the grain (in my example: flag_a, flag_b)
2️⃣ then we use LAG() over grain window to detect whether the current row starts right after the previous one and if the hashes (so the 'payload' of the two rows) match
3️⃣ we mark the start of a new "segment" when either:
- attributes have changed (flags differ), so hash being different
- intervals are not adjacent (there are gaps)

4️⃣ use a cumulative SUM() over grain window to group rows into segment IDs

5️⃣ collapse each segment using MIN(valid_from) and MAX(valid_to)

We can now see that in our example that several intervals were merged into bigger ones.

Found it useful? Subscribe to my Analytics newsletter at notjustsql.com.

0
Subscribe to my newsletter

Read articles from Constantin Lungu directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Constantin Lungu
Constantin Lungu

Senior Data Engineer • Contractor / Freelancer • GCP & AWS Certified