Type-Safe ML Configs with Hydra + Pydantic (Step by Step)

Repo (optional): https://github.com/siddhi47/pydantic-hydra

Managing configurations in machine learning projects can get messy—fast. What starts as a few command-line arguments or a small JSON file (better than having no configuration at all) often grows into a tangled mess of hard-coded values, inconsistent file paths, and mysterious hyperparameter changes that are impossible to track. I have introduced briefly about the configuration files and hydra in my previous blogs.

In this tutorial, we’ll combine Hydra—a powerful framework for composing and overriding YAML configs—with Pydantic, which enforces strict type validation and catches mistakes before they crash your training run. Together, they give you flexible, readable, and type-safe configurations that scale from a single experiment to a full production ML pipeline.


Why Not Just Use argparse or a Dict of plain old config file?

  • No type safety → Typos or wrong data types silently break your experiment.

  • No validation → Missing fields or invalid values fail at runtime (or worse, not at all).

  • Poor maintainability → Big projects with multiple config sections become unreadable.

Pydantic solves these problems by binding your config to a schema.

What you’ll build

  • A small ML-ready config system where:

    • Configs live in readable YAML files (organized by groups like data/, model/, training/).

    • Hydra composes and overrides configs from the command line.

    • Pydantic validates the composed config (types, required fields, bounds).

  • (BONUS) Includes a COCO dataset variant with file/path validation and safe defaults.


Prereqs

  • Python

  • Git

  • Basic familiarity with virtual envs


1) Project structure

mkdir pydantic-hydra && cd pydantic-hydra
mkdir -p conf/data conf/model conf/training src
touch main.py src/schema.py conf/config.yaml \
      conf/data/coco.yaml conf/data/generic.yaml \
      conf/model/resnet.yaml conf/training/default.yaml

Your directory structure should now look like this:

pydantic-hydra/
├─ conf/
│  ├─ config.yaml
│  ├─ data/
│  │  ├─ coco.yaml
│  │  └─ generic.yaml
│  ├─ model/
│  │  └─ resnet.yaml
│  └─ training/
│     └─ default.yaml
├─ src/
│  └─ schema.py
└─ main.py

2) Install deps

Option A: pip

python -m venv .venv && source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -U pip
pip install hydra-core omegaconf "pydantic>=2"

(If you like uv)

Option B: uv

uv venv && source .venv/bin/activate
uv pip install hydra-core omegaconf "pydantic>=2"

3) Write the Pydantic schema (src/schema.py)

We’ll model two dataset types—generic and coco—using a discriminated union so Pydantic knows which schema to apply based on the type field.

from pydantic import BaseModel, Field
from typing import Literal, Union


class ModelConfig(BaseModel):
    name: str
    hidden_units: int
    dropout: float


class DataConfig(BaseModel):
    type: Literal["generic"] = "generic"
    path: str

    shuffle: bool = True


class COCOConfig(DataConfig):
    type: Literal["coco"] = "coco"
    annotation_file: str
    image_size: int
    allowed_class: list[str]


class TrainingConfig(BaseModel):
    learning_rate: float
    batch_size: int
    epochs: int


class LoggingConfig(BaseModel):
    log_dir: str
    log_interval: int


class PipelineConfig(BaseModel):
    model: ModelConfig
    data: Union[DataConfig, COCOConfig] = Field(discriminator="type")
    training: TrainingConfig
    logging: LoggingConfig

4) Write Hydra configs (YAML)

conf/config.yaml

defaults:
  - model: resnet
  - data: coco
  - training: default
  - _self_

logging:
  log_dir: ./logs
  log_interval: 50

conf/model/resnet.yaml

name: resnet50
hidden_units: 256
dropout: 0.4

conf/training/default.yaml

learning_rate: 0.0005
batch_size: 64
epochs: 20

conf/data/coco.yaml

type: coco
path: /mnt/datasets/coco2017
annotation_file: /mnt/datasets/coco2017/annotations/instances_train2017.json
image_size: 640
allowed_classes: [person, bicycle, car]
shuffle: true

conf/data/generic.yaml

type: generic
path: /mnt/datasets/mydataset
shuffle: true

5) Glue it together with Hydra (main.py)

# main.py
import hydra
from omegaconf import OmegaConf
from src.schema import PipelineConfig

@hydra.main(config_path="conf", config_name="config", version_base=None)
def main(cfg):
    # 1) Hydra gives an OmegaConf object
    # 2) Convert to a plain dict
    cfg_dict = OmegaConf.to_container(cfg, resolve=True)
    # 3) Validate with Pydantic
    validated = PipelineConfig(**cfg_dict)

    # Example usage
    print("Model:", validated.model.name)
    print("Data path:", validated.data.path)
    if validated.data.__class__.__name__ == "COCOConfig":
        print("Annotation:", validated.data.annotation_file)
        print("Classes:", validated.data.allowed_classes)
        print("Image size:", validated.data.image_size)

if __name__ == "__main__":
    main()

Run it:

python main.py

6) Override anything from the CLI (Hydra superpower)

No file edits needed—compose on the fly:

# Switch to generic dataset
python main.py data=generic data.path=/data/custom

# Keep COCO, bump image size
python main.py data.image_size=1024

# Change allowed classes inline
python main.py data.allowed_classes='[person,car,dog]'

Hydra can launch multiple runs with different overrides:

python main.py -m training.batch_size=32,64 training.learning_rate=0.001,0.0005

This spawns 4 runs:

(32, 0.001)  (32, 0.0005)
(64, 0.001)  (64, 0.0005)

8) Production tips

  • Keep configs modular: prefer many small files (e.g., data/coco.yaml, data/generic.yaml) over one giant YAML.

  • Validate paths & bounds: use Path fields and Field(ge=..., le=...) to catch mistakes early.

  • Stable defaults: avoid mutable defaults; prefer default_factory for lists like allowed_classes.

  • Reproducibility: the .hydra/ folder per run captures the composed config—commit it or store it with artifacts.

Wrap-up

You now have a clean, composable, and type-safe configuration system:

  • Hydra for composition & overrides

  • Pydantic for validation & helpful errors

  • YAML for readability & version control

This pattern scales from a single script to a full ML platform without turning your configs into spaghetti.

Note: I’ve used uv to create the package instead of creating the structure manually. So, the github repo may look a bit different from this project structure. Refer to my post here to learn more on this!

0
Subscribe to my newsletter

Read articles from Siddhi Kiran Bajracharya directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Siddhi Kiran Bajracharya
Siddhi Kiran Bajracharya

Hi there! I'm a machine learning, python guy. Reach out to me for collaboration and stuff! :D