Preface 📚

I’ve actually written a post on data contracts before, so have a quick scan here if you want to see a project I created on them using Python, AWS S3 and other libraries (like Selenium and Soda).

What is a contract? 🤔

Let’s first talk about what a contract is, and what problems it solves.

A contract is an agreement between at least two parties about what they’re going to do for each other.

In a workplace, this could be a written agreement between you and your employers on

what your role is,
how much you’ll be paid for it, and
what responsibilities you and your employer have

Contracts exist so to set clear expectations from the very beginning. Both parties can discuss and agree on what’s expected before any agreement gets signed. Then once signed, this agreement becomes the single (and legal) source of truth the relationship depends on.

If there’s ever a misunderstanding or dispute, both sides can go back to the contract to figure out

who’s responsible for what
how to handle these issues
what the agreed outcome should be

If done right, a contract can be a tool for creating a fair relationship between parties built on trust.

So…what is a data contract? 📝

A data contract is like any other contract, but instead of people and jobs only, its about data.

It’s an agreement between

the people who create the data (producers)
the people who use the data (consumers)

It explains

how the data should look
how the data should be delivered
who to talk to if something goes wrong

Problems before data contracts⚠️

Now let’s talk about what problems existed before data contracts

1. Miscommunication between producers and consumers

A producer might rename a column or change the data format without telling anyone
The consumer would still assume the data is still in its original format, which results to breaking workflows that depend on this particular format or naming structure

Example:

A producer changes a date columns format from YYYY-MM-DD to DD-MM-YYYY in the SQL Server table, but the consumer’s pipeline only works with the original format. This could break a Power BI dashboard designed to work with the original format, and lead to losing hours to troubleshooting.

2. No single source of truth

When something went wrong with the data, it wasn’t clear

who to contact,
whether the issue was with the data or how it was used/interpreted
how to fix it quickly

3. Mismatched expectations

Teams often had different assumptions about the data, which led to

inaccurate reports
broken dashboards
time wasted fixing problems instead of creating value

Example:

A producer may assume it’s fine to leave null values in a critical field. However the consumer’s system isn’t designed to handle nulls, so it fails.

This could result in:

firefighting - fixing problems reactively instead of proactively

lost productivity - spending time cleaning the data instead of using it

reputational damage - teams losing trust in the data and the systems that build it

How a data contract solves these problems✅

A data contract sets clear expectations between the producers and consumers.

For data producers…

They know exactly

what the data should contain
the format and structure they need to follow
how often the data needs to be updated

For example,

The data must include these columns:

transaction_id (string)
created_date (date, YYYY-MM-DD format)
product_name (string)
amount (decimal, 2 decimal places)

For data consumers…

They know

what data to expect
where to find it
when + how it will be delivered
who to contact if there’s a problem

Example: The data must be delivered as a CSV file every Monday at 8am containing transactions from the previous week.

Content of a data contract📝

This defines the rules about the data itself.

1. Schema📐

The schema defines what the data should look like.

It includes

Column names
Data types (e.g. strings, integers, timestamps)

Here’s an example:

Column name	Data type	Example value
transaction_id	string	TXN123
created_date	date	2024-12-17
product_name	string	iPhone 15 Pro
amount	decimal	11.99

2. Constraints🔒

A constraint is a rule the values in each column must follow.

Here are a few examples

character limits - The product_name field must be less than 100 characters
valid ranges - The amount field must be a positive number smaller than 100
uniqueness - There must be no duplicate values in the transaction_id column
nullable - The amount field must not have any nulls
formatting - Date columns must use YYYY-MM-DD format

Having the right amount of constraints

ensures the data is accurate and clean
ensures the business logic is incorporated into the data
reduces the chances of errors found in downstream reports and systems

3. Delivery rules🚚

These define how, when and where the data should be delivered

It covers

Data sources - Where the data comes from e.g. databases, APIs
Transformations - How the data is cleaned + enriched (e.g. joins, filters, aggregations)
Destinations - where the data ends up e.g. dashboard, reports, storage
Delivery schedules - When and how the data is refreshed and scheduled e.g. daily, weekly
File format e.g. CSV, JSON, etc

Example:

The data pipeline drops a CSV file into an S3 bucket every Monday at 8am.

The file contains:

transactions from the past week

clean + validated data

4. Data validation 🧪

This defines the agreed rules the data must pass before it’s delivered

Validation checks include:

Schema validation - are the columns and data types correct?
Constraints validation - are there any duplicates or nulls?
Freshness - when was the last time the data was refreshed, and does it include all the records up to yesterday?
Row + column count - do the number of records and fields match the expectation counts?
Business logic validation - do the calculated fields and metrics align with the expected business logic?

5. Service level agreements (SLAs) 📈

This defines the performance expectations we guarantee the end users

Examples:

Availability - Data pipeline must have an uptime of 99%

Latency - Data must be delivered within 15 minutes of the scheduled time i.e. 8:15am latest

Error rate - Only less than 1% of records can contain errors

Completion time - Pipelines must not run for more than 45 mins

6. Ownership👤

This defines who’s responsible for different aspects of the data and who the consumers can contact for support

Producers - who creates/updates the data?
Consumers - who uses or relies on this data?
Support contacts - who should we reach out to when there are issues?

Example

Primary Owner:
- Key Personnel: Tom (senior data engineer)
- Responsibilities:
  - Checks pipeline runtime and data health checks every morning
  - Communicates schema changes to Data Science and BI teams via Slack
Secondary Owner:
- Entity: Data Platforms team
- Responsibilities:
  - Monitors pipeline errors
  - Provides escalated support
Support contacts
- Communication channels
  - Email: data-support@notrealcompany.io
  - Slack: #data-support

7. Change management policy🔄

This deals with how updates to the data contract (like schema changes + new constraints) are handled.

This covers key policies like

Notification period - producers must give one week’s notice before schema changes
Approval process - Changes must be approved by appointed stakeholders
Versioning - All changes to the data contracts must include version numbers (v1.1 to v1.2)
Deprecation policy - Consumers must be given a grace period to adapt to changes made
Retirement policy - There should be a clear process for retiring columns no longer needed by downstream systems

Why does this work?🌟

1. Transparency

Everyone knows what is expected

2. Accountability

The producers and customers have a clear understanding of their responsibilities

3. Efficiency

Issues are easier to resolve quickly because everyone is on the same page

Formats of a data contracts📂

A data contract can be in different formats, here are some to try out:

A piece of paper 📝

Use if you care about simple agreements that do not need to be automated or scaled. (Ideal for individuals + small teams.)

Pros -
- Easy and quick to create
- Easy for everyone to understand
Cons -
- Not scalable
- Easy to lose and overlook
- Hard to version control or track changes
- Can’t be automated or integrated into workflows

It can look something as simple as this:

Data Contract Agreement

Producer: Data Engineering Team
Consumer: Business Intelligence Team

Data Schema:
- transaction_id (string)
- created_date (YYYY-MM-DD)
- product_name (string)
- amount (decimal)

Delivery Rules:
- Delivered as a CSV file
- Frequency: Weekly (Every Monday at 8:00 AM)
- Delivery Location: s3://fake-company-bucket/sales/YYYY/MM/DD/*.csv

Signed:
Producer: ___________________
Consumer: ___________________
Date: ___________________

CSV 📄

Best for tabular data when teams know tools like Excel and Python

Pros -
- Supported by Excel
- Easy to parse programmatically with Python
- Familiar and straightforward for most users
Cons -
- Only works for tabular data
- Hard to enforce strict data validation
- Hard to manage nested data and complex schema
- Risk of parsing errors if data contains commas or special characters

Here’s an example:

Column Name,Data Type,Constraints
transaction_id,string,Unique
created_date,date,YYYY-MM-DD Format
product_name,string,Max 100 Characters
amount,decimal,Positive Numbers Only

JSON 🔷

For modern systems and APIs working with nested + hierarchical data

Pros -
- Supports complex, nested data
- Easy for humans and machines to read
- Compatible with most modern tools, APIs and systems
Cons -
- Hard to read for non-technical users
- Gets harder to read once complexity grows

Here’s an example:

{
  "schema": {
    "transaction_id": "string",
    "created_date": "date (YYYY-MM-DD)",
    "product_name": "string",
    "amount": "decimal"
  },
  "delivery": {
    "frequency": "weekly",
    "delivery_time": "Monday at 8:00 AM",
    "location": "s3://fake-company-bucket/sales/YYYY/MM/DD/*.csv"
  }
}

YAML 🛠️

A cleaner alternative to JSON for configuration-driven workflows or teams familiar with DevOps practices

Pros -
- Easy to read, even for non-technical users
- Cleaner syntax than JSON, especially for nested data
- Easy to edit manually
Cons -
- Prone to errors if indentations are not correct
- Not as widely supported as JSON in some ecosystems

Example:

schema:
  transaction_id: string
  created_date: date (YYYY-MM-DD)
  product_name: string
  amount: decimal

delivery:
  frequency: weekly
  delivery_time: Monday at 8:00 AM
  location: s3://fake-company-bucket/sales/YYYY/MM/DD/*.csv

Protobuf (Protocol Buffers)⚙️

If you want to serialize large data efficiently with low latency and strict schema enforcement

Pros -
- Enforces strict schema validation
- Optimized for speed and efficient/compact data storage
- Great for machine-to-machine communication, especially in distributed systems
Cons -
- Need special tools to parse or edit
- May be difficult for non-technical users to read

syntax = "proto3";

message Transaction {
  string transaction_id = 1;
  string created_date = 2; // Format: YYYY-MM-DD
  string product_name = 3;
  double amount = 4; // Must be positive
}

Avro🗂️

For teams with big data workflows where schema enforcement and compatibility matter

Pros -
- Efficiently serializes data
- Supports schema evolution (forward and backward compatibility)
- Works great with data tools like Hadoop and Hive
Cons -
- Can be hard for non-technical users to read
- Needs special tools to edit and view

Avro files are written using JSON-like syntax. Here’s an example of one:

{
  "type": "record",
  "name": "Transaction",
  "fields": [
    { "name": "transaction_id", "type": "string" },
    { "name": "created_date", "type": { "type": "string", "logicalType": "date" } },
    { "name": "product_name", "type": "string" },
    { "name": "amount", "type": "double" }
  ]
}

Open table formats (e.g. Delta, Iceberg)❄️

If you want your data to be version-controlled and used by modern data platforms

Pros -
- Supports ACID transactions and schema evolution
- Can integrate with big data tools (like Databricks, Snowflake)
- Can version control and upsert data to support SCD type 2 use cases
Cons -
- Needs specialized ecosystems to work well (…like Databricks, Snowflake etc)
- Can be complex to set up and maintain from scratch

If you’re in Databricks for example, you can create something like this:

from delta.tables import DeltaTable
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, DateType

# --- 1. Define the schema
schema = StructType([
    StructField("transaction_id", StringType(), False),
    StructField("created_date", DateType(), False),
    StructField("product_name", StringType(), False),
    StructField("amount", DoubleType(), False)
])

# --- 2. Create or update the delta table with the schema
DeltaTable.createIfNotExists(spark) \\
    .tableName("transactions") \\
    .addColumn("transaction_id", "STRING", "NOT NULL") \\
    .addColumn("created_date", "DATE", "NOT NULL") \\
    .addColumn("product_name", "STRING", "NOT NULL") \\
    .addColumn("amount", "DOUBLE", "NOT NULL") \\
    .location("s3://fake-company-bucket/sales/") \\
    .execute()

There’s no right or wrong format to use - it all depends on

how complex the schema is
how it will be used (human-readable vs machine-readable)
the ecosystem it works in (e.g. big data tools, APIs, legacy system)

Each format carries a trade off, so select the format that makes the most sense for your team and end user’s unique needs.

When you should NOT use a data contract🚫

You may not need a data contract if

1. The overhead outweighs the benefits

If the effort to set up and maintain a data contract is greater than the value it brings, it may not be worth investing in.

For example,

two teams collaborating on a one-off data exchange that has no future use
a small team of 3-5 people would find quick direct communication more efficient than creating formal agreements

2. The same team produces and consumes the data

If the same team is responsible for creating and using the data, the expectations and responsibilities may already be well understood internally - and therefore no need for the extra overhead.

For example,

a data science team that generates data for its own machine learning models

3. The data is used in non-critical scenarios

If the data is used for low-stake purposes where mistakes and errors don’t have any consequences, a data contract wouldn’t matter in this scenario either.

For example,

a dashboard created for a brainstorming or exploratory session

4. Speed + agility matters more

In situations where quick iterations are prioritized more than long-term reliability, data contracts may slow down progress

For example,

during the early stages of prototyping and experimenting on a new idea, the goal is to validate an idea quickly rather than ensuring the data is accurate

Conclusion 🔚

In this post, we introduced the idea of data contracts and how they solve common problems in data workflows.

In part 2, we’ll walk through a real example of a Python data pipeline that uses data contracts to validate and transform data across 3 AWS S3 buckets (bronze, silver, gold) and then uploads the final version to a Postgres database.

Use data contracts to automate data workflows - part 1

Table of contents

Preface 📚

What is a contract? 🤔

So…what is a data contract? 📝

Problems before data contracts⚠️

1. Miscommunication between producers and consumers

2. No single source of truth

3. Mismatched expectations

How a data contract solves these problems✅

For data producers…

For data consumers…

Content of a data contract📝

1. Schema📐

2. Constraints🔒

3. Delivery rules🚚

4. Data validation 🧪

5. Service level agreements (SLAs) 📈

6. Ownership👤

7. Change management policy🔄

Why does this work?🌟

1. Transparency

2. Accountability

3. Efficiency

Formats of a data contracts📂

A piece of paper 📝

CSV 📄

JSON 🔷

YAML 🛠️

Protobuf (Protocol Buffers)⚙️

Avro🗂️

Open table formats (e.g. Delta, Iceberg)❄️

When you should NOT use a data contract🚫

1. The overhead outweighs the benefits

2. The same team produces and consumes the data

3. The data is used in non-critical scenarios

4. Speed + agility matters more

Conclusion 🔚

Subscribe to my newsletter

Stephen David-Williams

Stephen David-Williams