End-To-End MLOps: Reproducible ML with DVC & Mlflow

In this blog post, I will show you how to use two popular tools for MLOps: DVC and MLflow. DVC is a version control system for data and models, while MLflow is a platform for managing the ML lifecycle. Together, DVC and MLflow contribute by creating an organized and reproducible MLOps pipeline, where data and models are managed, tracked, and deployed efficiently.

Prerequisites for a smooth sailing are; basic knowledge of Git commands, familiarity with ML frameworks such as pandas and sklearn and an open browser tab.

What is DVC?

DVC stands for Data Version Control. It is an open-source tool that integrates with Git and helps you manage large data files, ensuring that changes to datasets are tracked and reproducible. DVC allows you to:

Store your data and models in a remote storage of your choice (e.g., S3, Google Cloud Storage, etc.)
Track the changes in your data and models using Git-like commands
Reproduce your experiments by linking your code, data, and models
Collaborate with your team members using Git workflows

What is MLflow?

MLflow is an open-source platform that covers the entire ML lifecycle, from experimentation and training to packaging and deployment. It consists of four components:

Tracking: This allows you to record and compare your experiments, including parameters, metrics, artifacts, and source code.
Projects: This allows you to package your code and dependencies in a reusable and reproducible way.
Models: This allows you to export and deploy your models to various platforms and frameworks such as FastAI, MXNet Gluon, PyTorch, TensorFlow, XGBoost, CatBoost, h2o, etc.
Model Registry: This allows you to store, manage, and serve your models in a centralized place.

Demo: Using DVC and MLflow together

DVC and MLflow are not mutually exclusive. In fact, they can complement each other very well. DVC handles the data and pipeline versioning, while MLflow handles the experiment tracking and model deployment.

For this demo, we will utilize this Car Evaluation data from UCI as our use case.

NB: EDA was performed beforehand. Feature engineering and model training were first written and tested in a notebook and converted to .py later to keep the code organized. Find the full work here.

The DVC part

Sugoi! Without further ado, below are 20 steps you can use to keep track of the data changes using DVC:

Install DVC: pip install dvc
Initializing Git and DVC: git init && dvc init
Commit changes/files from dvc init : git commit -m “initialize repo”
DVC shows you where you want to store your data: dvc remote add -d dvc-remote /tmp/dvc-storage (remote storage)
- NB: Path to the stored is in .dvc/config
Commit the new remote storage changes to Git: git commit -m ”configure remote storage” .dvc/config
Create a data/ directory and load in your data. In this tutorial, we are using Car evaluation data which you can get from UCI Machine Learning Repository
Track your data with DVC: dvc add data/car_evaluation_processed.csv
- NB: EDA was performed on this notebook where the data was processed and exported as car_evaluation_processed.csv
A new car_evaluation_processed.csv.dvc file is created in data/
- The file contains most importantly an md5 hash key of the data. This represents each specific data.
The original data path file is now added to .data/gitignore (will not be pushed to Git repository instead car_evaluation_processed.csv.dvc will)
- This is the DVC (only) layer
Add and commit both the .gitignore and the tracked .dvc files:
- git add data/.gitignore data/car_evaluation_processed.csc.dvc
- git commit -m “track the data with DVC”
Create a tag for each version of the dataset. This will help us to access whichever version we want in Mlflow: git tag -a “v1” -m “original data”
- The tag subcommand creates tags that are references to specific points in Git history
- While the -a flag provides details about the tag
Send the data from the data/ to the remote storage: dvc push
- Check for a 1 file pushed in your terminal
- View a copy of the dataset is in: ls -l /tmp/dvc-storage
Since we have 2 copies of the original data, the one in data/ can be deleted. But with dvc pull we can retrieve it again
Edit the dataset by adding/removing a couple of rows. This for the sake of tracking the version of our dataset
Track the data again: dvc add data/car_evaluation_processed.csv
Add and commit the new changes (from the new dataset):
- git add data/car_evaluation_processed.csc.dvc
- git commit -m “new 10 rows added”
Create a new tag for the latest version (v2) of the dataset: git tag -a “v2” -m “10 new rows added/removed”
Push the new version to our remote storage: dvc push
Remove unused copies of our dataset:
- rm -rf data/car_evaluation_processed.csv
- rm -rf .dvc/cache
Finally, we can git log and see our committed messages as well as tags assigned to each version of the dataset hence keeping track of the data.

The Linking (Mlflow) part

Great! Since we have 2 different versions of the dataset, we can use any version to track our ML experiments using Mlflow. Below are a sample of code snippets about how the data from DVC can be integrated in Mlflow and track the experiments.

First, we will need to install Mlflow by running pip install mlflow in our environment. Whereas for importing, we will need 2 packages:

import mlflow
import mlflow.sklearn

The dvc.api has a good get_url() function which returns the URL to the storage location of a data file or directory tracked in a DVC repo. This is very important!

import dvc.api

data_url = dvc.api.get_url(
    path=path, # path of our original data
    repo=repo, # location of the DVC repository (in our case it's in the current directory)
    rev=version # version of data (remember the tags (v1 or v2) we gave each version)
)

By default, the experiment name would be default, but we will create our own using mlflow.set_experiment("car-evaluation")

Another important aspect about this workflow is logging parameters and artifacts. This helps you follow-up on what’s changing and easily share with others

# log parameters
def log_data_params(data_url, data_version, data):
    """
    Logging data parameters to MLflow

    Inp: data url, version, dataframe
    Out: none
    """
    mlflow.log_param("data_url", data_url)
    mlflow.log_param("data_version", data_version)
    mlflow.log_param("num_rows", data.shape[0])
    mlflow.log_param("num_cols", data.shape[1])

# log artifacts: features
X_cols = pd.DataFrame(list(X_train.columns))
X_cols.to_csv('artifacts/features.csv', header=False, index=False)
mlflow.log_artifact('artifacts/features.csv')

After doing feature engineering and model training, the model and its metric can now be logged

# training the model
dtc = DecisionTreeClassifier(criterion=criterion, max_depth=max_depth, random_state=42)
dtc.fit(X_train, y_train)

# log the model
mlflow.sklearn.log_model(dtc, "car-evaluation-model")

# model performance using accuracy
accuracy = model_eval(y_test, y_pred)

# log the metric
mlflow.log_metric("accuracy", accuracy)

To train the model and execute the workflow, run python train.py. By default, it will employ criterion=”entropy” and max_depth=3 as hyperparameters of the model.

Finally, you can visualize and track the experiments over time by running mlflow ui

Conclusion

DVC and MLflow are powerful tools that can help you manage your ML projects more efficiently. By combining them, you can leverage the best of both worlds: data versioning with DVC, and experiment tracking and model deployment with MLflow. I hope this blog post has taught you a thing or two on how to use them together.

If you want to learn more about DVC, MLflow or MLOps in general, you can check these resources:

You can also find the full code for this blog post on GitHub.

Sayonara! Happy hacking 🚀

Implementing End-to-End MLOps