Data contracts + dbt Lineage: Lessons from early experiments

Rocio RaduRocio Radu
3 min read

Data lineage is the practice of tracking where data comes from, how it changes, and where it ends up.
For data engineers, it’s like a map showing every step data takes so you can understand dependencies and the impact of changes, from the very beginning in the process with the ingestion, through transformations, to consumption.

This becomes essential in decentralized data architectures (like data mesh, or similar architectures with independent products), where datasets are produced and owned by different teams, yet consumed by many others across the company. In these kind of setups, having a clear ownership and visibility is critical.

A big challenge

The idea with the implementation of this contracts:

  • Have a clear owner: define boundaries and team responsibilities for quality and changes

  • Be discoverable by other teams: provide abstraction of the process and visibility of the main data, avoiding reinvent the wheel.

  • Show how changes might affect downstream processes: be aware of the effect of changes on future processes. Without that, a small schema change could break dashboards or machine learning models without warning.

Possible Approach: Data Contracts + Data Lineage

We’re exploring a new way to strengthen collaboration between data producers and consumers:
tying data contracts directly into our dbt lineage.

As summary, the idea is:

  • Data contracts define the schema, SLAs, and quality expectations of every dataset.

  • dbt handles transformations and automatically builds a lineage graph between sources and downstream models.

  • The two combined could give us both clarity (what a dataset should look like) and visibility (how changes ripple through the system).

It’s not in production yet, but it’s really promising 😁

A simple example

For this post I want to show something really practical to share that Data Contract is not so complex. We can have as example a customer_orders dataset.
Here’s how we imagine defining its contract and linking it to dbt.

# contracts/customer_orders.yaml
dataset_name: customer_orders
owner_team: sales_data_engineering
owner_contact: sales-data-eng@company.com
description: >
  Contains all customer order records from the e-commerce platform. Updated daily at midnight UTC.
schema:
  - name: order_id
    type: STRING
    constraints:
      required: true
      unique: true
  - name: customer_id
    type: STRING
    constraints:
      required: true
  - name: order_date
    type: TIMESTAMP
    constraints:
      required: true
  - name: total_amount
    type: DECIMAL(10,2)
    constraints:
      required: true
      min: 0
  - name: status
    type: STRING
    constraints:
      allowed_values:
        - pending
        - shipped
        - completed
        - cancelled
update_frequency: daily
sla:
  freshness: 24h
  availability: 99.9%
quality_checks:
  - check: "No nulls in required fields"
  - check: "status is in allowed_values"

Adding here more dbt components like: source.yml (where we define the customer_orders as source ), mdoel transformation and dbt tests to implement the quality checks from the contract. The dbt DAG shows the full transformation lineage, from raw ingestion to clean models to downstream dashboards.

Limitations of using dbt for lineage

While dbt is great at showing how datasets connect inside our transformation layer, it’s important to consider here that it only covers what’s inside dbt. It won’t show the ingestion pipelines that bring the data in, the BI dashboards that consume it, or the machine learning models that depend on it (the previous and post steps).

To get that full end-to-end view, we’d still need another tool, like OpenLineage or DataHub, to tie everything together (not explored jet). Another thing is that dbt won’t automatically enforce a data contract unless we explicitly write tests for every rule, and its freshness checks run on a schedule rather than in real time.

We need to consider this to know that dbt is more of a strong building block in a bigger lineage strategy, than a the complete solution by itself.

0
Subscribe to my newsletter

Read articles from Rocio Radu directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Rocio Radu
Rocio Radu