Navigating dbt Incremental Models: The Double-Edged Sword of incremental_predicates

In the realm of modern data transformation, dbt’s incremental models are a cornerstone for processing large datasets efficiently. But like any powerful tool, they require careful handling—especially when leveraging the advanced configuration option called incremental_predicates.

In this post, I’ll unpack a real-world scenario from our imaginary CRHUB data pipeline that demonstrates both the risks and rewards of using incremental_predicates. We’ll explore how a subtle misalignment caused silent data duplication and how a small adjustment transformed our model into a high-performance workhorse.


🧱 The CRHUB Data Pipeline

Our team processes millions of CRHub guest reviews. The pipeline is structured using a classic dbt layer approach:

  1. Source layer: Raw data in bronze_crhub.reviews

  2. Staging model: Cleans and renames fields

  3. Fact model: Applies business logic and loads incrementally

Let’s walk through a simplified version of this pipeline.

Staging Model

-- models/silver_crhub /stg_reviews.sql

WITH src_reviews AS (
    SELECT *
    FROM {{ source('bronze_crhub', 'reviews') }}
)

SELECT
    listing_id,
    date AS review_date,
    reviewer_name,
    comments AS review_text,
    sentiment AS review_sentiment
FROM src_reviews

Incremental Fact Model

-- models/gold_crhub /fact_reviews.sql

{{
    config(
        materialized='incremental',
        unique_key=['listing_id', 'reviewer_name', 'review_date'],
        incremental_strategy='merge'
    )
}}

WITH source_data AS (
    SELECT *
    FROM {{ ref('stg_reviews') }}
    {% if is_incremental() %}
    WHERE review_date > DATE('2018-01-01')
    {% endif %}
),

deduplicated AS (
    SELECT
        listing_id,
        reviewer_name,
        review_date,
        MAX(review_sentiment) AS review_sentiment,
        MAX(review_text) AS review_text
    FROM source_data
    GROUP BY 1, 2, 3
)

SELECT * FROM deduplicated

As the review table grew, our incremental model allowed us to process only new or updated records—a big win for speed and cost.


⚠️ The Duplicate Trap with incremental_predicates

As our data scaled, we sought more performance by applying incremental_predicates:

config(
    materialized='incremental',
    unique_key=['listing_id', 'reviewer_name', 'review_date'],
    incremental_strategy='merge',
    incremental_predicates=["review_date > DATE('2019-01-01')"]
)

Seems harmless, right? Unfortunately, this introduced a silent duplication bug.

What Went Wrong?

Let’s break down the core issue:

  • Our source filter used 2018-01-01

  • Our incremental_predicates used 2019-01-01

Because incremental_predicates injects into the ON clause of the MERGE statement, dbt generated SQL like this:

MERGE INTO fact_reviews AS DBT_INTERNAL_DEST
USING fact_reviews__dbt_tmp AS DBT_INTERNAL_SOURCE
ON (
    DBT_INTERNAL_DEST.review_date > DATE('2019-01-01') AND
    DBT_INTERNAL_SOURCE.listing_id = DBT_INTERNAL_DEST.listing_id AND
    DBT_INTERNAL_SOURCE.reviewer_name = DBT_INTERNAL_DEST.reviewer_name AND
    DBT_INTERNAL_SOURCE.review_date = DBT_INTERNAL_DEST.review_date
)

So any records with review dates before 2019, even if they existed in the destination table, could not match due to the predicate.

As a result, older records (e.g., from 2018) were treated as new and inserted again and again—leading to exponential growth in duplicates:

23 → 46 → 92 → 184 ...

✅ Safe Usage Patterns for incremental_predicates

Once we identified the issue, we aligned the filter dates in both the source query and the predicate:

config(
    materialized='incremental',
    unique_key=['listing_id', 'reviewer_name', 'review_date'],
    incremental_strategy='merge',
    incremental_predicates=["DBT_INTERNAL_DEST.review_date > DATE('2019-01-01')"]
)

-- Source query
SELECT *
FROM {{ ref('stg_reviews') }}
{% if is_incremental() %}
WHERE review_date > DATE('2019-01-01')
{% endif %}

This alignment is critical. Now, only reviews after 2019 are:

  • Queried from the source

  • Matched against in the destination

No records fall through the cracks, so no duplicates occur.


🚀 Performance Gains

With this fix, our fact_reviews model gained major performance improvements:

  • Faster runs on large datasets

  • Partition pruning enabled in Snowflake

  • Reduced memory usage in merge operations

And—most importantly—data integrity preserved.


🧠 Best Practices for Time-Series Incrementals

Here’s a checklist based on our experience:

  1. Start without predicates: Get your model working with is_incremental() alone.

  2. Monitor performance: Use dbt logs and warehouse metrics.

  3. Align filters: Ensure source and predicate conditions match.

  4. Disambiguate columns: Use DBT_INTERNAL_DEST.column_name to avoid SQL errors.

  5. Validate after changes: Always check for duplication after predicate tweaks.

  6. Consider clustering: Use clustering keys on date fields for long-term performance.


🧭 When Source and Destination Filters Must Differ

There may be cases where source filtering and merge matching criteria need to be different. In those situations, avoid incremental_predicates altogether and instead:

  • Use source filtering only

  • Optimize with clustering keys


🏁 Conclusion

incremental_predicates in dbt is a high-leverage optimization feature—but also a sharp tool that can silently break your model if misused.

To use it safely:

  • Align your filters

  • Understand how dbt compiles SQL

  • Test rigorously before and after

When configured correctly, incremental_predicates can transform sluggish incremental models into fast, scalable pipelines ready for big data production loads.

References

About incremental strategy | dbt Developer Hub

Using Incremental Predicates to Decrease dbt Run Times by 80% | by Erin Feaser | Medium

0
Subscribe to my newsletter

Read articles from Chandrasekar(Chan) Rajaram directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Chandrasekar(Chan) Rajaram
Chandrasekar(Chan) Rajaram