Implementing End-to-End MLOps
In this blog post, I will show you how to use two popular tools for MLOps: DVC and MLflow. DVC is a version control system for data and models, while MLflow is a platform for managing the ML lifecycle. Together, DVC and MLflow contribute by creating an organized and reproducible MLOps pipeline, where data and models are managed, tracked, and deployed efficiently.
Prerequisites for a smooth sailing are; basic knowledge of Git commands, familiarity with ML frameworks such as pandas and sklearn and an open browser tab.
What is DVC?
DVC stands for Data Version Control. It is an open-source tool that integrates with Git and helps you manage large data files, ensuring that changes to datasets are tracked and reproducible. DVC allows you to:
Store your data and models in a remote storage of your choice (e.g., S3, Google Cloud Storage, etc.)
Track the changes in your data and models using Git-like commands
Reproduce your experiments by linking your code, data, and models
Collaborate with your team members using Git workflows
What is MLflow?
MLflow is an open-source platform that covers the entire ML lifecycle, from experimentation and training to packaging and deployment. It consists of four components:
Tracking: This allows you to record and compare your experiments, including parameters, metrics, artifacts, and source code.
Projects: This allows you to package your code and dependencies in a reusable and reproducible way.
Models: This allows you to export and deploy your models to various platforms and frameworks such as FastAI, MXNet Gluon, PyTorch, TensorFlow, XGBoost, CatBoost, h2o, etc.
Model Registry: This allows you to store, manage, and serve your models in a centralized place.
Demo: Using DVC and MLflow together
DVC and MLflow are not mutually exclusive. In fact, they can complement each other very well. DVC handles the data and pipeline versioning, while MLflow handles the experiment tracking and model deployment.
For this demo, we will utilize this Car Evaluation data from UCI as our use case.
NB: EDA was performed beforehand. Feature engineering and model training were first written and tested in a notebook and converted to .py
later to keep the code organized. Find the full work here.
The DVC part
Sugoi! Without further ado, below are 20 steps you can use to keep track of the data changes using DVC:
Install DVC:
pip install dvc
Initializing Git and DVC:
git init && dvc init
Commit changes/files from
dvc init
:git commit -m “initialize repo”
DVC shows you where you want to store your data:
dvc remote add -d dvc-remote /tmp/dvc-storage
(remote storage)- NB: Path to the stored is in
.dvc/config
- NB: Path to the stored is in
Commit the new remote storage changes to Git:
git commit -m ”configure remote storage” .dvc/config
Create a
data/
directory and load in your data. In this tutorial, we are using Car evaluation data which you can get from UCI Machine Learning RepositoryTrack your data with DVC:
dvc add data/car_evaluation_processed.csv
- NB: EDA was performed on this notebook where the data was processed and exported as
car_evaluation_processed.csv
- NB: EDA was performed on this notebook where the data was processed and exported as
A new
car_evaluation_processed.csv.dvc
file is created indata/
- The file contains most importantly an
md5
hash key of the data. This represents each specific data.
- The file contains most importantly an
The original data path file is now added to
.data/gitignore
(will not be pushed to Git repository insteadcar_evaluation_processed.csv.dvc
will)- This is the DVC (only) layer
Add and commit both the
.gitignore
and the tracked.dvc
files:git add data/.gitignore data/car_evaluation_processed.csc.dvc
git commit -m “track the data with DVC”
Create a tag for each version of the dataset. This will help us to access whichever version we want in Mlflow:
git tag -a “v1” -m “original data”
The
tag
subcommand creates tags that are references to specific points in Git historyWhile the
-a
flag provides details about the tag
Send the data from the
data/
to the remote storage:dvc push
Check for a
1 file pushed
in your terminalView a copy of the dataset is in:
ls -l /tmp/dvc-storage
Since we have 2 copies of the original data, the one in
data/
can be deleted. But withdvc pull
we can retrieve it againEdit the dataset by adding/removing a couple of rows. This for the sake of tracking the version of our dataset
Track the data again:
dvc add data/car_evaluation_processed.csv
Add and commit the new changes (from the new dataset):
git add data/car_evaluation_processed.csc.dvc
git commit -m “new 10 rows added”
Create a new tag for the latest version (v2) of the dataset:
git tag -a “v2” -m “10 new rows added/removed”
Push the new version to our remote storage:
dvc push
Remove unused copies of our dataset:
rm -rf data/car_evaluation_processed.csv
rm -rf .dvc/cache
Finally, we can
git log
and see our committed messages as well as tags assigned to each version of the dataset hence keeping track of the data.
The Linking (Mlflow) part
Great! Since we have 2 different versions of the dataset, we can use any version to track our ML experiments using Mlflow. Below are a sample of code snippets about how the data from DVC can be integrated in Mlflow and track the experiments.
First, we will need to install Mlflow by running pip install mlflow
in our environment. Whereas for importing, we will need 2 packages:
import mlflow
import mlflow.sklearn
The dvc.api has a good get_url()
function which returns the URL to the storage location of a data file or directory tracked in a DVC repo. This is very important!
import dvc.api
data_url = dvc.api.get_url(
path=path, # path of our original data
repo=repo, # location of the DVC repository (in our case it's in the current directory)
rev=version # version of data (remember the tags (v1 or v2) we gave each version)
)
By default, the experiment name would be default, but we will create our own using mlflow.set_experiment("car-evaluation")
Another important aspect about this workflow is logging parameters and artifacts. This helps you follow-up on what’s changing and easily share with others
# log parameters
def log_data_params(data_url, data_version, data):
"""
Logging data parameters to MLflow
Inp: data url, version, dataframe
Out: none
"""
mlflow.log_param("data_url", data_url)
mlflow.log_param("data_version", data_version)
mlflow.log_param("num_rows", data.shape[0])
mlflow.log_param("num_cols", data.shape[1])
# log artifacts: features
X_cols = pd.DataFrame(list(X_train.columns))
X_cols.to_csv('artifacts/features.csv', header=False, index=False)
mlflow.log_artifact('artifacts/features.csv')
After doing feature engineering and model training, the model and its metric can now be logged
# training the model
dtc = DecisionTreeClassifier(criterion=criterion, max_depth=max_depth, random_state=42)
dtc.fit(X_train, y_train)
# log the model
mlflow.sklearn.log_model(dtc, "car-evaluation-model")
# model performance using accuracy
accuracy = model_eval(y_test, y_pred)
# log the metric
mlflow.log_metric("accuracy", accuracy)
To train the model and execute the workflow, run python train.py
. By default, it will employ criterion=”entropy”
and max_depth=3
as hyperparameters of the model.
Finally, you can visualize and track the experiments over time by running mlflow ui
Conclusion
DVC and MLflow are powerful tools that can help you manage your ML projects more efficiently. By combining them, you can leverage the best of both worlds: data versioning with DVC, and experiment tracking and model deployment with MLflow. I hope this blog post has taught you a thing or two on how to use them together.
If you want to learn more about DVC, MLflow or MLOps in general, you can check these resources:
You can also find the full code for this blog post on GitHub.
Sayonara! Happy hacking 🚀
Subscribe to my newsletter
Read articles from Jean Nshuti directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by