Use data contracts to automate data workflows - part 1

Preface 📚

I’ve actually written a post on data contracts before, so have a quick scan here if you want to see a project I created on them using Python, AWS S3 and other libraries (like Selenium and Soda).

What is a contract? 🤔

Let’s first talk about what a contract is, and what problems it solves.

A contract is an agreement between at least two parties about what they’re going to do for each other.

In a workplace, this could be a written agreement between you and your employers on

  • what your role is,

  • how much you’ll be paid for it, and

  • what responsibilities you and your employer have

Contracts exist so to set clear expectations from the very beginning. Both parties can discuss and agree on what’s expected before any agreement gets signed. Then once signed, this agreement becomes the single (and legal) source of truth the relationship depends on.

If there’s ever a misunderstanding or dispute, both sides can go back to the contract to figure out

  • who’s responsible for what

  • how to handle these issues

  • what the agreed outcome should be

If done right, a contract can be a tool for creating a fair relationship between parties built on trust.

So…what is a data contract? 📝

A data contract is like any other contract, but instead of people and jobs only, its about data.

It’s an agreement between

  • the people who create the data (producers)

  • the people who use the data (consumers)

It explains

  • how the data should look

  • how the data should be delivered

  • who to talk to if something goes wrong

Problems before data contracts⚠️

Now let’s talk about what problems existed before data contracts

1. Miscommunication between producers and consumers

  • A producer might rename a column or change the data format without telling anyone

  • The consumer would still assume the data is still in its original format, which results to breaking workflows that depend on this particular format or naming structure

Example:

A producer changes a date columns format from YYYY-MM-DD to DD-MM-YYYY in the SQL Server table, but the consumer’s pipeline only works with the original format. This could break a Power BI dashboard designed to work with the original format, and lead to losing hours to troubleshooting.

2. No single source of truth

When something went wrong with the data, it wasn’t clear

  • who to contact,

  • whether the issue was with the data or how it was used/interpreted

  • how to fix it quickly

3. Mismatched expectations

Teams often had different assumptions about the data, which led to

  • inaccurate reports

  • broken dashboards

  • time wasted fixing problems instead of creating value

Example:

A producer may assume it’s fine to leave null values in a critical field. However the consumer’s system isn’t designed to handle nulls, so it fails.

This could result in:

  • firefighting - fixing problems reactively instead of proactively
  • lost productivity - spending time cleaning the data instead of using it
  • reputational damage - teams losing trust in the data and the systems that build it

How a data contract solves these problems✅

A data contract sets clear expectations between the producers and consumers.

For data producers…

They know exactly

  • what the data should contain

  • the format and structure they need to follow

  • how often the data needs to be updated

For example,

The data must include these columns:

  • transaction_id (string)

  • created_date (date, YYYY-MM-DD format)

  • product_name (string)

  • amount (decimal, 2 decimal places)

For data consumers…

They know

  • what data to expect

  • where to find it

  • when + how it will be delivered

  • who to contact if there’s a problem

Example: The data must be delivered as a CSV file every Monday at 8am containing transactions from the previous week.

Content of a data contract📝

This defines the rules about the data itself.

1. Schema📐

The schema defines what the data should look like.

It includes

  • Column names

  • Data types (e.g. strings, integers, timestamps)

Here’s an example:

Column nameData typeExample value
transaction_idstringTXN123
created_datedate2024-12-17
product_namestringiPhone 15 Pro
amountdecimal11.99

2. Constraints🔒

A constraint is a rule the values in each column must follow.

Here are a few examples

  • character limits - The product_name field must be less than 100 characters

  • valid ranges - The amount field must be a positive number smaller than 100

  • uniqueness - There must be no duplicate values in the transaction_id column

  • nullable - The amount field must not have any nulls

  • formatting - Date columns must use YYYY-MM-DD format

Having the right amount of constraints

  • ensures the data is accurate and clean

  • ensures the business logic is incorporated into the data

  • reduces the chances of errors found in downstream reports and systems

3. Delivery rules🚚

These define how, when and where the data should be delivered

It covers

  • Data sources - Where the data comes from e.g. databases, APIs

  • Transformations - How the data is cleaned + enriched (e.g. joins, filters, aggregations)

  • Destinations - where the data ends up e.g. dashboard, reports, storage

  • Delivery schedules - When and how the data is refreshed and scheduled e.g. daily, weekly

  • File format e.g. CSV, JSON, etc

Example:

The data pipeline drops a CSV file into an S3 bucket every Monday at 8am.

The file contains:

  • transactions from the past week
  • clean + validated data

4. Data validation 🧪

This defines the agreed rules the data must pass before it’s delivered

Validation checks include:

  • Schema validation - are the columns and data types correct?

  • Constraints validation - are there any duplicates or nulls?

  • Freshness - when was the last time the data was refreshed, and does it include all the records up to yesterday?

  • Row + column count - do the number of records and fields match the expectation counts?

  • Business logic validation - do the calculated fields and metrics align with the expected business logic?

5. Service level agreements (SLAs) 📈

This defines the performance expectations we guarantee the end users

Examples:

  • Availability - Data pipeline must have an uptime of 99%

  • Latency - Data must be delivered within 15 minutes of the scheduled time i.e. 8:15am latest

  • Error rate - Only less than 1% of records can contain errors

  • Completion time - Pipelines must not run for more than 45 mins

6. Ownership👤

This defines who’s responsible for different aspects of the data and who the consumers can contact for support

  • Producers - who creates/updates the data?

  • Consumers - who uses or relies on this data?

  • Support contacts - who should we reach out to when there are issues?

Example

  • Primary Owner:

    • Key Personnel: Tom (senior data engineer)

    • Responsibilities:

      • Checks pipeline runtime and data health checks every morning

      • Communicates schema changes to Data Science and BI teams via Slack

  • Secondary Owner:

    • Entity: Data Platforms team

    • Responsibilities:

      • Monitors pipeline errors

      • Provides escalated support

  • Support contacts

7. Change management policy🔄

This deals with how updates to the data contract (like schema changes + new constraints) are handled.

This covers key policies like

  • Notification period - producers must give one week’s notice before schema changes

  • Approval process - Changes must be approved by appointed stakeholders

  • Versioning - All changes to the data contracts must include version numbers (v1.1 to v1.2)

  • Deprecation policy - Consumers must be given a grace period to adapt to changes made

  • Retirement policy - There should be a clear process for retiring columns no longer needed by downstream systems

Why does this work?🌟

1. Transparency

Everyone knows what is expected

2. Accountability

The producers and customers have a clear understanding of their responsibilities

3. Efficiency

Issues are easier to resolve quickly because everyone is on the same page

Formats of a data contracts📂

A data contract can be in different formats, here are some to try out:

A piece of paper 📝

Use if you care about simple agreements that do not need to be automated or scaled. (Ideal for individuals + small teams.)

  • Pros -

    • Easy and quick to create

    • Easy for everyone to understand

  • Cons -

    • Not scalable

    • Easy to lose and overlook

    • Hard to version control or track changes

    • Can’t be automated or integrated into workflows

It can look something as simple as this:

Data Contract Agreement

Producer: Data Engineering Team
Consumer: Business Intelligence Team

Data Schema:
- transaction_id (string)
- created_date (YYYY-MM-DD)
- product_name (string)
- amount (decimal)

Delivery Rules:
- Delivered as a CSV file
- Frequency: Weekly (Every Monday at 8:00 AM)
- Delivery Location: s3://fake-company-bucket/sales/YYYY/MM/DD/*.csv

Signed:
Producer: ___________________
Consumer: ___________________
Date: ___________________

CSV 📄

Best for tabular data when teams know tools like Excel and Python

  • Pros -

    • Supported by Excel

    • Easy to parse programmatically with Python

    • Familiar and straightforward for most users

  • Cons -

    • Only works for tabular data

    • Hard to enforce strict data validation

    • Hard to manage nested data and complex schema

    • Risk of parsing errors if data contains commas or special characters

Here’s an example:

Column Name,Data Type,Constraints
transaction_id,string,Unique
created_date,date,YYYY-MM-DD Format
product_name,string,Max 100 Characters
amount,decimal,Positive Numbers Only

JSON 🔷

For modern systems and APIs working with nested + hierarchical data

  • Pros -

    • Supports complex, nested data

    • Easy for humans and machines to read

    • Compatible with most modern tools, APIs and systems

  • Cons -

    • Hard to read for non-technical users

    • Gets harder to read once complexity grows

Here’s an example:

{
  "schema": {
    "transaction_id": "string",
    "created_date": "date (YYYY-MM-DD)",
    "product_name": "string",
    "amount": "decimal"
  },
  "delivery": {
    "frequency": "weekly",
    "delivery_time": "Monday at 8:00 AM",
    "location": "s3://fake-company-bucket/sales/YYYY/MM/DD/*.csv"
  }
}

YAML 🛠️

A cleaner alternative to JSON for configuration-driven workflows or teams familiar with DevOps practices

  • Pros -

    • Easy to read, even for non-technical users

    • Cleaner syntax than JSON, especially for nested data

    • Easy to edit manually

  • Cons -

    • Prone to errors if indentations are not correct

    • Not as widely supported as JSON in some ecosystems

Example:

schema:
  transaction_id: string
  created_date: date (YYYY-MM-DD)
  product_name: string
  amount: decimal

delivery:
  frequency: weekly
  delivery_time: Monday at 8:00 AM
  location: s3://fake-company-bucket/sales/YYYY/MM/DD/*.csv

Protobuf (Protocol Buffers)⚙️

If you want to serialize large data efficiently with low latency and strict schema enforcement

  • Pros -

    • Enforces strict schema validation

    • Optimized for speed and efficient/compact data storage

    • Great for machine-to-machine communication, especially in distributed systems

  • Cons -

    • Need special tools to parse or edit

    • May be difficult for non-technical users to read

syntax = "proto3";

message Transaction {
  string transaction_id = 1;
  string created_date = 2; // Format: YYYY-MM-DD
  string product_name = 3;
  double amount = 4; // Must be positive
}

Avro🗂️

For teams with big data workflows where schema enforcement and compatibility matter

  • Pros -

    • Efficiently serializes data

    • Supports schema evolution (forward and backward compatibility)

    • Works great with data tools like Hadoop and Hive

  • Cons -

    • Can be hard for non-technical users to read

    • Needs special tools to edit and view

Avro files are written using JSON-like syntax. Here’s an example of one:

{
  "type": "record",
  "name": "Transaction",
  "fields": [
    { "name": "transaction_id", "type": "string" },
    { "name": "created_date", "type": { "type": "string", "logicalType": "date" } },
    { "name": "product_name", "type": "string" },
    { "name": "amount", "type": "double" }
  ]
}

Open table formats (e.g. Delta, Iceberg)❄️

If you want your data to be version-controlled and used by modern data platforms

  • Pros -

    • Supports ACID transactions and schema evolution

    • Can integrate with big data tools (like Databricks, Snowflake)

    • Can version control and upsert data to support SCD type 2 use cases

  • Cons -

    • Needs specialized ecosystems to work well (…like Databricks, Snowflake etc)

    • Can be complex to set up and maintain from scratch

If you’re in Databricks for example, you can create something like this:

from delta.tables import DeltaTable
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, DateType

# --- 1. Define the schema
schema = StructType([
    StructField("transaction_id", StringType(), False),
    StructField("created_date", DateType(), False),
    StructField("product_name", StringType(), False),
    StructField("amount", DoubleType(), False)
])

# --- 2. Create or update the delta table with the schema
DeltaTable.createIfNotExists(spark) \\
    .tableName("transactions") \\
    .addColumn("transaction_id", "STRING", "NOT NULL") \\
    .addColumn("created_date", "DATE", "NOT NULL") \\
    .addColumn("product_name", "STRING", "NOT NULL") \\
    .addColumn("amount", "DOUBLE", "NOT NULL") \\
    .location("s3://fake-company-bucket/sales/") \\
    .execute()

There’s no right or wrong format to use - it all depends on

  • how complex the schema is

  • how it will be used (human-readable vs machine-readable)

  • the ecosystem it works in (e.g. big data tools, APIs, legacy system)

Each format carries a trade off, so select the format that makes the most sense for your team and end user’s unique needs.

When you should NOT use a data contract🚫

You may not need a data contract if

1. The overhead outweighs the benefits

If the effort to set up and maintain a data contract is greater than the value it brings, it may not be worth investing in.

For example,

  • two teams collaborating on a one-off data exchange that has no future use

  • a small team of 3-5 people would find quick direct communication more efficient than creating formal agreements

2. The same team produces and consumes the data

If the same team is responsible for creating and using the data, the expectations and responsibilities may already be well understood internally - and therefore no need for the extra overhead.

For example,

  • a data science team that generates data for its own machine learning models

3. The data is used in non-critical scenarios

If the data is used for low-stake purposes where mistakes and errors don’t have any consequences, a data contract wouldn’t matter in this scenario either.

For example,

  • a dashboard created for a brainstorming or exploratory session

4. Speed + agility matters more

In situations where quick iterations are prioritized more than long-term reliability, data contracts may slow down progress

For example,

  • during the early stages of prototyping and experimenting on a new idea, the goal is to validate an idea quickly rather than ensuring the data is accurate

Conclusion 🔚

In this post, we introduced the idea of data contracts and how they solve common problems in data workflows.

In part 2, we’ll walk through a real example of a Python data pipeline that uses data contracts to validate and transform data across 3 AWS S3 buckets (bronze, silver, gold) and then uploads the final version to a Postgres database.

0
Subscribe to my newsletter

Read articles from Stephen David-Williams directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Stephen David-Williams
Stephen David-Williams

I am a developer with 5+ years of data engineering experience in the financial + professional services sector. Feel free to drop a message for any questions! https://medium.com/@sdw-online