Navigating dbt Incremental Models: The Double-Edged Sword of incremental_predicates


In the realm of modern data transformation, dbt’s incremental models are a cornerstone for processing large datasets efficiently. But like any powerful tool, they require careful handling—especially when leveraging the advanced configuration option called incremental_predicates
.
In this post, I’ll unpack a real-world scenario from our imaginary CRHUB data pipeline that demonstrates both the risks and rewards of using incremental_predicates
. We’ll explore how a subtle misalignment caused silent data duplication and how a small adjustment transformed our model into a high-performance workhorse.
🧱 The CRHUB Data Pipeline
Our team processes millions of CRHub guest reviews. The pipeline is structured using a classic dbt layer approach:
Source layer: Raw data in
bronze_crhub.reviews
Staging model: Cleans and renames fields
Fact model: Applies business logic and loads incrementally
Let’s walk through a simplified version of this pipeline.
Staging Model
-- models/silver_crhub /stg_reviews.sql
WITH src_reviews AS (
SELECT *
FROM {{ source('bronze_crhub', 'reviews') }}
)
SELECT
listing_id,
date AS review_date,
reviewer_name,
comments AS review_text,
sentiment AS review_sentiment
FROM src_reviews
Incremental Fact Model
-- models/gold_crhub /fact_reviews.sql
{{
config(
materialized='incremental',
unique_key=['listing_id', 'reviewer_name', 'review_date'],
incremental_strategy='merge'
)
}}
WITH source_data AS (
SELECT *
FROM {{ ref('stg_reviews') }}
{% if is_incremental() %}
WHERE review_date > DATE('2018-01-01')
{% endif %}
),
deduplicated AS (
SELECT
listing_id,
reviewer_name,
review_date,
MAX(review_sentiment) AS review_sentiment,
MAX(review_text) AS review_text
FROM source_data
GROUP BY 1, 2, 3
)
SELECT * FROM deduplicated
As the review table grew, our incremental model allowed us to process only new or updated records—a big win for speed and cost.
⚠️ The Duplicate Trap with incremental_predicates
As our data scaled, we sought more performance by applying incremental_predicates
:
config(
materialized='incremental',
unique_key=['listing_id', 'reviewer_name', 'review_date'],
incremental_strategy='merge',
incremental_predicates=["review_date > DATE('2019-01-01')"]
)
Seems harmless, right? Unfortunately, this introduced a silent duplication bug.
What Went Wrong?
Let’s break down the core issue:
Our source filter used
2018-01-01
Our incremental_predicates used
2019-01-01
Because incremental_predicates
injects into the ON
clause of the MERGE
statement, dbt generated SQL like this:
MERGE INTO fact_reviews AS DBT_INTERNAL_DEST
USING fact_reviews__dbt_tmp AS DBT_INTERNAL_SOURCE
ON (
DBT_INTERNAL_DEST.review_date > DATE('2019-01-01') AND
DBT_INTERNAL_SOURCE.listing_id = DBT_INTERNAL_DEST.listing_id AND
DBT_INTERNAL_SOURCE.reviewer_name = DBT_INTERNAL_DEST.reviewer_name AND
DBT_INTERNAL_SOURCE.review_date = DBT_INTERNAL_DEST.review_date
)
So any records with review dates before 2019, even if they existed in the destination table, could not match due to the predicate.
As a result, older records (e.g., from 2018) were treated as new and inserted again and again—leading to exponential growth in duplicates:
23 → 46 → 92 → 184 ...
✅ Safe Usage Patterns for incremental_predicates
Once we identified the issue, we aligned the filter dates in both the source query and the predicate:
config(
materialized='incremental',
unique_key=['listing_id', 'reviewer_name', 'review_date'],
incremental_strategy='merge',
incremental_predicates=["DBT_INTERNAL_DEST.review_date > DATE('2019-01-01')"]
)
-- Source query
SELECT *
FROM {{ ref('stg_reviews') }}
{% if is_incremental() %}
WHERE review_date > DATE('2019-01-01')
{% endif %}
This alignment is critical. Now, only reviews after 2019 are:
Queried from the source
Matched against in the destination
No records fall through the cracks, so no duplicates occur.
🚀 Performance Gains
With this fix, our fact_reviews
model gained major performance improvements:
Faster runs on large datasets
Partition pruning enabled in Snowflake
Reduced memory usage in merge operations
And—most importantly—data integrity preserved.
🧠 Best Practices for Time-Series Incrementals
Here’s a checklist based on our experience:
Start without predicates: Get your model working with
is_incremental()
alone.Monitor performance: Use dbt logs and warehouse metrics.
Align filters: Ensure source and predicate conditions match.
Disambiguate columns: Use
DBT_INTERNAL_DEST.column_name
to avoid SQL errors.Validate after changes: Always check for duplication after predicate tweaks.
Consider clustering: Use clustering keys on date fields for long-term performance.
🧭 When Source and Destination Filters Must Differ
There may be cases where source filtering and merge matching criteria need to be different. In those situations, avoid incremental_predicates
altogether and instead:
Use source filtering only
Optimize with clustering keys
🏁 Conclusion
incremental_predicates
in dbt is a high-leverage optimization feature—but also a sharp tool that can silently break your model if misused.
To use it safely:
Align your filters
Understand how dbt compiles SQL
Test rigorously before and after
When configured correctly, incremental_predicates
can transform sluggish incremental models into fast, scalable pipelines ready for big data production loads.
References
About incremental strategy | dbt Developer Hub
Using Incremental Predicates to Decrease dbt Run Times by 80% | by Erin Feaser | Medium
Subscribe to my newsletter
Read articles from Chandrasekar(Chan) Rajaram directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
