Experiment tracking with ClearML and Hydra

Introduction

Untitled.ipynb, final.ipynb, final_last.ipynb, final_with_lr_09.ipynb, latest_model.pt, latest_model_2025_01_19.pt.

Sounds similar? If so, you are not alone. We’ve all been there—struggling with an unconventional experiment tracking method that only makes sense to us (until it doesn’t). But what if there was a better way? Let me introduce you to ClearML. ClearML is one of the end-to-end AI platform that has many features. The best part? ClearML is free for personal use, so you can start organizing your experiments today!

One of the main reasons you end up with multiple files is the sheer number of configurations required by different algorithms. But don't worry—Hydra is here to save the day! And no, I'm not talking about the legendary multi-headed beast from Greek mythology. Hydra is a powerful tool designed to simplify and manage complex configurations effortlessly.

Messy experiment tracking and scattered configuration files can make ML workflows chaotic and hard to reproduce. But with ClearML handling experiment tracking and Hydra managing configurations, you can bring structure, automation, and clarity to your projects.

By integrating these tools, you gain:
Automatic experiment logging – No more manually saving metrics or model versions.
Config-driven workflows – Easily tweak hyperparameters and settings without modifying code.
Reproducibility – Effortlessly track, compare, and rerun experiments.

So why stick with the old, disorganized way? Start using ClearML and Hydra today and take your ML experiment management to the next level!

In this article, I will be using basic image classification using PyTorch. This tutorial will assume you have some experience with python, python virtual environments, some basic python libraries. We will dive into training a basic image classification model using a custom CNN. Artifacts produced from the experiments will be stored to S3 storage. For example, model and metrics can be tracked and versioned. We will use a popular dataset called CIFAR dataset. It’s available for download using pytorch.

Setting Up the Environment

ClearML

ClearML is very easy to setup. All you have to do is open a free account here. The free account is enough to get started with tracking your experiments and models.

I recommend setting up a new virtual environment to follow along.

Once you setup the free account you can now install ClearML with pip.

pip install clearml

Once you install ClearML, you can now start the clearml configuration wizard:

clearml-init

You will be asked to add credentials. You can create new credentials from your account settings. This will create a configuration file at ~/clearml.conf. This configuration file not only links your workspace to clearml server, but it can also be used to configure git, cloud credentials, and many more. You can checkout their getting started page.

In this project, we will use s3 storage to store our models. You can configure s3 by adding these to ~/.clearml.conf:

aws {
        s3 {
            # S3 credentials, used for read/write access by various SDK elements

            # The following settings will be used for any bucket not specified 
            # below in the "credentials" section
            # ---------------------------------------------------------------------------------------------------
            region: "" #Optionally specify region
            # Specify explicit keys
            key: "<KEY>"
            secret: "<SECRET>"
            # Or enable credentials chain to let Boto3 pick the right credentials.
            # This includes picking credentials from environment variables,
            # credential file and IAM role using metadata service.
            # Refer to the latest Boto3 docs
            use_credentials_chain: false
            # Additional ExtraArgs passed to boto3 when uploading files. 
            # Can also be set per-bucket under "credentials".
            extra_args: {}
            # ---------------------------------------------------------------------------------------------------


            credentials: [
                # specifies key/secret credentials to use when handling         
                # s3 urls (read or write)
                 {
                     bucket: "cifar" #bucket to use
                     key: "<KEY>"
                     secret: "<SECRET>"
                 },
                # {
                #     # This will apply to all buckets in this host 
                # (unless key/value is specifically provided for a given bucket)
                #     host: "my-minio-host:9000"
                #     key: "12345678"
                #     secret: "12345678"
                #     multipart: false
                #     secure: false
                # }
            ]
        }
        boto3 {
            pool_connections: 512
            max_multipart_concurrency: 16
            multipart_threshold: 8388608 # 8MB
            multipart_chunksize: 8388608 # 8MB
        }
    }

You may want to check configuration file. Some configurations are pre-generated.

You can now start tracking by including just one line of code:

from clearml import Task
task = Task.init(
        project_name="examples",
        task_name="CIFAR10 PyTorch",
        output_uri="s3://cifar/models/",
    )

Hydra

Hydra like the multi headed beast, allows you to run your experiments using multiple hierarchical configuration using a config file. You can override these configurations from command line as well. It also reduces the need to have multiple configuration files. Furthermore, it also allows setting up hyper-parameter search using the config files.

To get started, just install hydra using pip:

pip install hydra-core --upgrade

Let’s create a config file called hydra_config.yaml.

#hydra_config.yaml
model:
  type: "custom"
  lr: 0.1
  epoch: 2  
  momentum: 0.1
  batch_size: 4

As you can see, its a configuration file with basic deep learning parameters. You can load them in your program.

#train.py
import hydra

@hydra.main(config_path=".", config_name="hydra_config.yaml")
def my_app(cfg):
    print(cfg.model.type) # custom
    print(cfg.model.lr) # 0.1

if __name__ == "__main__":
    my_app()

Now you can just run the script normally or override them by passing command line arguments:

python train.py # normal
python train.py model.lr=0.01 model.epoch=200 #override

Bringing It All Together

NOTE: Instead of copying these codes individually, checkout the GitHub repository. Treat these code blocks as explanation.

Let’s begin the actual coding part! As always, let’s start with necessary imports. These are basic libraries that you should be familiar with.

import torch
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
import sklearn
import torch.optim as optim
from clearml import Task, OutputModel
from argparse import ArgumentParser
import hydra

Organize the CNN architecture into a class:


class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = torch.flatten(x, 1)  # flatten all dimensions except batch
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

Load the CIFAR dataset and split it into train/test set as necessary. You will see that we are using the hydra configuration (cfg variable) to get the batch_size.

 transform = transforms.Compose(
        [transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]
    )
    #loaded from hydra
    batch_size = cfg.model.batch_size

    trainset = torchvision.datasets.CIFAR10(
        root="./data", train=True, download=True, transform=transform
    )
    trainloader = torch.utils.data.DataLoader(
        trainset, batch_size=batch_size, shuffle=True, num_workers=2
    )

    testset = torchvision.datasets.CIFAR10(
        root="./data", train=False, download=True, transform=transform
    )
    testloader = torch.utils.data.DataLoader(
        testset, batch_size=batch_size, shuffle=False, num_workers=2
    )

Create a list with all the classes:

 classes = (
        "plane",
        "car",
        "bird",
        "cat",
        "deer",
        "dog",
        "frog",
        "horse",
        "ship",
        "truck",
    )

Create a clearml logger. This logger allows you to log various data types to clearml.

logger = Task.current_task().get_logger()

Start the training:

dataiter = iter(trainloader)
images, labels = next(dataiter)

net = Net()

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(
        net.parameters(), lr=cfg.model.lr, momentum=cfg.model.momentum
)
EPOCH=cfg.model.epoch
for epoch in range(EPOCH):  # loop over the dataset multiple times
        running_loss = 0.0
        for i, data in enumerate(trainloader, 0):
            # get the inputs; data is a list of [inputs, labels]
            inputs, labels = data

            # zero the parameter gradients
            optimizer.zero_grad()

            # forward + backward + optimize
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            # print statistics
            running_loss += loss.item()
            if i % 2000 == 1999:  # print every 2000 mini-batches
                print(f"[{epoch + 1}, {i + 1:5d}] loss: {running_loss / 2000:.3f}")
                logger.report_scalar("loss", "train", running_loss / 2000, iteration=i)
                running_loss = 0.0

Test your model and report the metrics (in this case confusion matrix):

predictions = []
    actuals = []
    for data in testloader:
        images, labels = data
        outputs = net(images)
        _, predicted = torch.max(outputs, 1)
        predictions.extend(predicted)
        actuals.extend(labels)

    confusion_matrix = sklearn.metrics.confusion_matrix(actuals, predictions)
    logger.report_matrix("confusion matrix", "test", confusion_matrix, iteration=0)

Save your model and push it to clearml repository:

PATH = "./cifar_net.pth"
    torch.save(net.state_dict(), PATH)

    output_model = OutputModel(
        task=task,
        framework="pytorch",
    )
    output_model.set_upload_destination("s3://cifar/models/")

Now let’s wrap this function in a function (our main function).

@hydra.main(config_path=".", config_name="hydra_config.yaml")
def main(cfg):
    ...
    #rest of the code here. refer to github repository for full code.

if __name__ == "__main__":
    main()

You can now run the python script normally or you can also pass the arguments to override hydra configuration.

python train.py
python train.py model.lr=0.01 model.epoch=200

BONUS: Hyper-parameter search using Hydra

Add search parameters to your hydra_config.yaml:

hydra:
  sweeper:
    params:
      model.lr: 0.001,0.01,0.1
      model.momentum: 0.2,0.5

Now, just run:

python train.py. --multirun

Once you run these, your Clearml dashboard should start populating with metrics and models. Go to the projects page, and you will see the list of experiments here.

Overview tab

Configuration tab

You can see the configuration in configuration tab:

Artifacts and model tab

You can see your model in the artifacts tab as well as in the models page. You may also want to store other relevant files as artifacts.

Plots

You can also see different plots under scalar and plot tabs. There are many available plots that you can integrate with clearml. Checkout this page.

Conclusion

That’s a lot of things for very less effort. Now, I hope you will stop with any weird naming convention and start using this tool to track your experiment, code, configuration, and models. Additionally, you can even compare each experiments side by side! I will leave you to it! As promised here is the github link. You may want to ⭐ it .

Happy experimenting!

0
Subscribe to my newsletter

Read articles from Siddhi Kiran Bajracharya directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Siddhi Kiran Bajracharya
Siddhi Kiran Bajracharya

Hi there! I'm a machine learning, python guy. Reach out to me for collaboration and stuff! :D