Use data contracts to automate data workflows - part 1

Preface 📚
I’ve actually written a post on data contracts before, so have a quick scan here if you want to see a project I created on them using Python, AWS S3 and other libraries (like Selenium and Soda).
What is a contract? 🤔
Let’s first talk about what a contract is, and what problems it solves.
A contract is an agreement between at least two parties about what they’re going to do for each other.
In a workplace, this could be a written agreement between you and your employers on
what your role is,
how much you’ll be paid for it, and
what responsibilities you and your employer have
Contracts exist so to set clear expectations from the very beginning. Both parties can discuss and agree on what’s expected before any agreement gets signed. Then once signed, this agreement becomes the single (and legal) source of truth the relationship depends on.
If there’s ever a misunderstanding or dispute, both sides can go back to the contract to figure out
who’s responsible for what
how to handle these issues
what the agreed outcome should be
If done right, a contract can be a tool for creating a fair relationship between parties built on trust.
So…what is a data contract? 📝
A data contract is like any other contract, but instead of people and jobs only, its about data.
It’s an agreement between
the people who create the data (producers)
the people who use the data (consumers)
It explains
how the data should look
how the data should be delivered
who to talk to if something goes wrong
Problems before data contracts⚠️
Now let’s talk about what problems existed before data contracts
1. Miscommunication between producers and consumers
A producer might rename a column or change the data format without telling anyone
The consumer would still assume the data is still in its original format, which results to breaking workflows that depend on this particular format or naming structure
Example:
A producer changes a date columns format from YYYY-MM-DD to DD-MM-YYYY in the SQL Server table, but the consumer’s pipeline only works with the original format. This could break a Power BI dashboard designed to work with the original format, and lead to losing hours to troubleshooting.
2. No single source of truth
When something went wrong with the data, it wasn’t clear
who to contact,
whether the issue was with the data or how it was used/interpreted
how to fix it quickly
3. Mismatched expectations
Teams often had different assumptions about the data, which led to
inaccurate reports
broken dashboards
time wasted fixing problems instead of creating value
Example:
A producer may assume it’s fine to leave null values in a critical field. However the consumer’s system isn’t designed to handle nulls, so it fails.
This could result in:
- firefighting - fixing problems reactively instead of proactively
- lost productivity - spending time cleaning the data instead of using it
- reputational damage - teams losing trust in the data and the systems that build it
How a data contract solves these problems✅
A data contract sets clear expectations between the producers and consumers.
For data producers…
They know exactly
what the data should contain
the format and structure they need to follow
how often the data needs to be updated
For example,
The data must include these columns:
transaction_id (string)
created_date (date, YYYY-MM-DD format)
product_name (string)
amount (decimal, 2 decimal places)
For data consumers…
They know
what data to expect
where to find it
when + how it will be delivered
who to contact if there’s a problem
Example: The data must be delivered as a CSV file every Monday at 8am containing transactions from the previous week.
Content of a data contract📝
This defines the rules about the data itself.
1. Schema📐
The schema defines what the data should look like.
It includes
Column names
Data types (e.g. strings, integers, timestamps)
Here’s an example:
Column name | Data type | Example value |
transaction_id | string | TXN123 |
created_date | date | 2024-12-17 |
product_name | string | iPhone 15 Pro |
amount | decimal | 11.99 |
2. Constraints🔒
A constraint is a rule the values in each column must follow.
Here are a few examples
character limits - The product_name field must be less than 100 characters
valid ranges - The amount field must be a positive number smaller than 100
uniqueness - There must be no duplicate values in the transaction_id column
nullable - The amount field must not have any nulls
formatting - Date columns must use YYYY-MM-DD format
Having the right amount of constraints
ensures the data is accurate and clean
ensures the business logic is incorporated into the data
reduces the chances of errors found in downstream reports and systems
3. Delivery rules🚚
These define how, when and where the data should be delivered
It covers
Data sources - Where the data comes from e.g. databases, APIs
Transformations - How the data is cleaned + enriched (e.g. joins, filters, aggregations)
Destinations - where the data ends up e.g. dashboard, reports, storage
Delivery schedules - When and how the data is refreshed and scheduled e.g. daily, weekly
File format e.g. CSV, JSON, etc
Example:
The data pipeline drops a CSV file into an S3 bucket every Monday at 8am.
The file contains:
- transactions from the past week
- clean + validated data
4. Data validation 🧪
This defines the agreed rules the data must pass before it’s delivered
Validation checks include:
Schema validation - are the columns and data types correct?
Constraints validation - are there any duplicates or nulls?
Freshness - when was the last time the data was refreshed, and does it include all the records up to yesterday?
Row + column count - do the number of records and fields match the expectation counts?
Business logic validation - do the calculated fields and metrics align with the expected business logic?
5. Service level agreements (SLAs) 📈
This defines the performance expectations we guarantee the end users
Examples:
Availability - Data pipeline must have an uptime of 99%
Latency - Data must be delivered within 15 minutes of the scheduled time i.e. 8:15am latest
Error rate - Only less than 1% of records can contain errors
Completion time - Pipelines must not run for more than 45 mins
6. Ownership👤
This defines who’s responsible for different aspects of the data and who the consumers can contact for support
Producers - who creates/updates the data?
Consumers - who uses or relies on this data?
Support contacts - who should we reach out to when there are issues?
Example
Primary Owner:
Key Personnel: Tom (senior data engineer)
Responsibilities:
Checks pipeline runtime and data health checks every morning
Communicates schema changes to Data Science and BI teams via Slack
Secondary Owner:
Entity: Data Platforms team
Responsibilities:
Monitors pipeline errors
Provides escalated support
Support contacts
Communication channels
Slack: #data-support
7. Change management policy🔄
This deals with how updates to the data contract (like schema changes + new constraints) are handled.
This covers key policies like
Notification period - producers must give one week’s notice before schema changes
Approval process - Changes must be approved by appointed stakeholders
Versioning - All changes to the data contracts must include version numbers (v1.1 to v1.2)
Deprecation policy - Consumers must be given a grace period to adapt to changes made
Retirement policy - There should be a clear process for retiring columns no longer needed by downstream systems
Why does this work?🌟
1. Transparency
Everyone knows what is expected
2. Accountability
The producers and customers have a clear understanding of their responsibilities
3. Efficiency
Issues are easier to resolve quickly because everyone is on the same page
Formats of a data contracts📂
A data contract can be in different formats, here are some to try out:
A piece of paper 📝
Use if you care about simple agreements that do not need to be automated or scaled. (Ideal for individuals + small teams.)
Pros -
Easy and quick to create
Easy for everyone to understand
Cons -
Not scalable
Easy to lose and overlook
Hard to version control or track changes
Can’t be automated or integrated into workflows
It can look something as simple as this:
Data Contract Agreement
Producer: Data Engineering Team
Consumer: Business Intelligence Team
Data Schema:
- transaction_id (string)
- created_date (YYYY-MM-DD)
- product_name (string)
- amount (decimal)
Delivery Rules:
- Delivered as a CSV file
- Frequency: Weekly (Every Monday at 8:00 AM)
- Delivery Location: s3://fake-company-bucket/sales/YYYY/MM/DD/*.csv
Signed:
Producer: ___________________
Consumer: ___________________
Date: ___________________
CSV 📄
Best for tabular data when teams know tools like Excel and Python
Pros -
Supported by Excel
Easy to parse programmatically with Python
Familiar and straightforward for most users
Cons -
Only works for tabular data
Hard to enforce strict data validation
Hard to manage nested data and complex schema
Risk of parsing errors if data contains commas or special characters
Here’s an example:
Column Name,Data Type,Constraints
transaction_id,string,Unique
created_date,date,YYYY-MM-DD Format
product_name,string,Max 100 Characters
amount,decimal,Positive Numbers Only
JSON 🔷
For modern systems and APIs working with nested + hierarchical data
Pros -
Supports complex, nested data
Easy for humans and machines to read
Compatible with most modern tools, APIs and systems
Cons -
Hard to read for non-technical users
Gets harder to read once complexity grows
Here’s an example:
{
"schema": {
"transaction_id": "string",
"created_date": "date (YYYY-MM-DD)",
"product_name": "string",
"amount": "decimal"
},
"delivery": {
"frequency": "weekly",
"delivery_time": "Monday at 8:00 AM",
"location": "s3://fake-company-bucket/sales/YYYY/MM/DD/*.csv"
}
}
YAML 🛠️
A cleaner alternative to JSON for configuration-driven workflows or teams familiar with DevOps practices
Pros -
Easy to read, even for non-technical users
Cleaner syntax than JSON, especially for nested data
Easy to edit manually
Cons -
Prone to errors if indentations are not correct
Not as widely supported as JSON in some ecosystems
Example:
schema:
transaction_id: string
created_date: date (YYYY-MM-DD)
product_name: string
amount: decimal
delivery:
frequency: weekly
delivery_time: Monday at 8:00 AM
location: s3://fake-company-bucket/sales/YYYY/MM/DD/*.csv
Protobuf (Protocol Buffers)⚙️
If you want to serialize large data efficiently with low latency and strict schema enforcement
Pros -
Enforces strict schema validation
Optimized for speed and efficient/compact data storage
Great for machine-to-machine communication, especially in distributed systems
Cons -
Need special tools to parse or edit
May be difficult for non-technical users to read
syntax = "proto3";
message Transaction {
string transaction_id = 1;
string created_date = 2; // Format: YYYY-MM-DD
string product_name = 3;
double amount = 4; // Must be positive
}
Avro🗂️
For teams with big data workflows where schema enforcement and compatibility matter
Pros -
Efficiently serializes data
Supports schema evolution (forward and backward compatibility)
Works great with data tools like Hadoop and Hive
Cons -
Can be hard for non-technical users to read
Needs special tools to edit and view
Avro files are written using JSON-like syntax. Here’s an example of one:
{
"type": "record",
"name": "Transaction",
"fields": [
{ "name": "transaction_id", "type": "string" },
{ "name": "created_date", "type": { "type": "string", "logicalType": "date" } },
{ "name": "product_name", "type": "string" },
{ "name": "amount", "type": "double" }
]
}
Open table formats (e.g. Delta, Iceberg)❄️
If you want your data to be version-controlled and used by modern data platforms
Pros -
Supports ACID transactions and schema evolution
Can integrate with big data tools (like Databricks, Snowflake)
Can version control and upsert data to support SCD type 2 use cases
Cons -
Needs specialized ecosystems to work well (…like Databricks, Snowflake etc)
Can be complex to set up and maintain from scratch
If you’re in Databricks for example, you can create something like this:
from delta.tables import DeltaTable
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, DateType
# --- 1. Define the schema
schema = StructType([
StructField("transaction_id", StringType(), False),
StructField("created_date", DateType(), False),
StructField("product_name", StringType(), False),
StructField("amount", DoubleType(), False)
])
# --- 2. Create or update the delta table with the schema
DeltaTable.createIfNotExists(spark) \\
.tableName("transactions") \\
.addColumn("transaction_id", "STRING", "NOT NULL") \\
.addColumn("created_date", "DATE", "NOT NULL") \\
.addColumn("product_name", "STRING", "NOT NULL") \\
.addColumn("amount", "DOUBLE", "NOT NULL") \\
.location("s3://fake-company-bucket/sales/") \\
.execute()
There’s no right or wrong format to use - it all depends on
how complex the schema is
how it will be used (human-readable vs machine-readable)
the ecosystem it works in (e.g. big data tools, APIs, legacy system)
Each format carries a trade off, so select the format that makes the most sense for your team and end user’s unique needs.
When you should NOT use a data contract🚫
You may not need a data contract if
1. The overhead outweighs the benefits
If the effort to set up and maintain a data contract is greater than the value it brings, it may not be worth investing in.
For example,
two teams collaborating on a one-off data exchange that has no future use
a small team of 3-5 people would find quick direct communication more efficient than creating formal agreements
2. The same team produces and consumes the data
If the same team is responsible for creating and using the data, the expectations and responsibilities may already be well understood internally - and therefore no need for the extra overhead.
For example,
- a data science team that generates data for its own machine learning models
3. The data is used in non-critical scenarios
If the data is used for low-stake purposes where mistakes and errors don’t have any consequences, a data contract wouldn’t matter in this scenario either.
For example,
- a dashboard created for a brainstorming or exploratory session
4. Speed + agility matters more
In situations where quick iterations are prioritized more than long-term reliability, data contracts may slow down progress
For example,
- during the early stages of prototyping and experimenting on a new idea, the goal is to validate an idea quickly rather than ensuring the data is accurate
Conclusion 🔚
In this post, we introduced the idea of data contracts and how they solve common problems in data workflows.
In part 2, we’ll walk through a real example of a Python data pipeline that uses data contracts to validate and transform data across 3 AWS S3 buckets (bronze, silver, gold) and then uploads the final version to a Postgres database.
Subscribe to my newsletter
Read articles from Stephen David-Williams directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Stephen David-Williams
Stephen David-Williams
I am a developer with 5+ years of data engineering experience in the financial + professional services sector. Feel free to drop a message for any questions! https://medium.com/@sdw-online