Introduction

In this technical blog, we will explore the world of MLflow, a powerful tool for experiment tracking and model management. MLflow offers a streamlined workflow to manage machine learning projects, from data preprocessing to model deployment. We will dive into a homework assignment that showcases the key features of MLflow, including package installation, data preprocessing, model training, hyperparameter tuning, and model registry.

Q1. Install the package

To embark on our MLflow journey, we first need to install the MLflow package. The recommended approach is to create a separate Python environment, such as a conda environment, and install the package using pip or conda. Once installed, we will verify the installation by checking the version of MLflow.

The installation steps involve cloning the repository and installing the required packages from the provided requirements.txt file. The output confirms the successful installation of MLflow, displaying the version number.

Q2. Download and preprocess the data

Our next challenge involves working with real-world data—the Green Taxi Trip Records dataset. We will predict the number of tips for each trip using MLflow. To begin, we download the dataset for January, February, and March 2022 in parquet format from the provided source. To preprocess the data, we execute a Python script named preprocess_data.py. This script loads the data, fits a DictVectorizer on the training set, and saves the preprocessed datasets and the DictVectorizer to disk.

To complete this step, we download the datasets and execute the provided command, ensuring to replace <TAXI_DATA_FOLDER> with the correct location where the data is saved. After successful execution, we determine the size of the saved DictVectorizer file.

Q3. Train a model with autolog

With the preprocessed data in hand, we move on to training a machine learning model using MLflow's autologging capabilities. We train a RandomForestRegressor on the taxi dataset by modifying the provided train.py script. It loads the datasets, trains the model, and calculates the RMSE score on the validation set.

To enable autologging, we wrap the training code with mlflow.start_run() and execute the modified script. Next, we launch the MLflow UI to verify that the experiment run was properly tracked. In this step, we identify the value of the max_depth hyperparameter.

Q4. Tune model hyperparameters

To further improve our model's performance, we delve into hyperparameter tuning using optuna. We modify the hpo.py script and ensure that the validation RMSE is logged to the tracking server for each run of the hyperparameter optimization. Running the script without passing any parameters, we open the MLflow UI to explore the runs from the "random-forest-hyperopt" experiment and find the best validation RMSE.

Q5. Promote the best model to the model registry

Having achieved promising results through hyperparameter optimization, we aim to promote the best model to the model registry. The register_model.py script helps us in this task. It selects the top 5 runs based on the previous step's results, calculates the RMSE on the test set, and saves the results to a new experiment called "random-forest-best-models".

By updating the script, we ensure that the model with the lowest RMSE on the test set is registered to the model registry. This involves using the search_runs method and the register_model method provided by MLflow.

Q6. Model metadata

In the final step, we explore the best model in the model registry using the MLflow UI. We examine the information available about each model in the model registry. The model registry contains essential details such as the version number, source experiment, and model signature. All of the above answers are correct.

Conclusion

In this blog, we embarked on a journey through MLflow, a powerful tool for experiment tracking and model management. We completed a homework assignment that involved installing MLflow, downloading and preprocessing data, training models with autolog, launching a tracking server, tuning model hyperparameters, and promoting the best model to the model registry. Throughout the assignment, we explored various features and capabilities of MLflow, empowering us to efficiently manage machine learning projects from start to finish.

By harnessing the capabilities of MLflow, data scientists and machine learning practitioners can streamline their workflows, track experiments, manage models, and deploy them to production with confidence. MLflow's integration with popular machine learning libraries, its intuitive UI, and its powerful model registry make it an indispensable tool in the modern machine learning landscape.

We hope this blog has provided valuable insights into the practical application of MLflow and its role in experiment tracking and model management. By embracing MLflow, data scientists can focus more on innovation, collaboration, and delivering impactful machine learning solutions.

References

MLflow Documentation: https://mlflow.org/docs/home/
NYC Taxi and Limousine Commission: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
Conda Environments Documentation: https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html#managing-envs

Addendum: Introducing DAGsHub and Using it with Google Colab

In addition to exploring the powerful capabilities of MLflow, we would like to introduce you to DAGsHub, a collaborative platform for managing and versioning data science projects. DAGsHub seamlessly integrates with Git and enables teams to track experiments, collaborate on code, and easily share project results. In this addendum, we will demonstrate how to use DAGsHub with Google Colab, a popular cloud-based Jupyter Notebook environment.

Introducing DAGsHub

DAGsHub is a platform specifically designed for data science collaboration and version control. It provides a unified environment where data scientists can manage their code, track experiments, and collaborate with team members effectively. DAGsHub builds on the power of Git, offering additional features tailored to the needs of the data science community.

With DAGsHub, you can:

Track and version your experiments: DAGsHub seamlessly integrates with Git, enabling you to track and version your code, data, and models. You can easily compare different versions of your experiments, understand the changes, and keep a complete history of your project.
Collaborate with your team: DAGsHub provides a collaborative environment where team members can work together on data science projects. You can share code, notebooks, and experiment results with your colleagues, facilitating knowledge sharing and fostering a collaborative culture.
Reproducibility and provenance: DAGsHub ensures that your experiments are reproducible by capturing the full provenance of your project. You can easily reproduce any previous experiment by accessing the exact code, data, and environment used during that experiment.
Integrations and continuous integration (CI): DAGsHub integrates with popular tools such as MLflow, Jupyter Notebooks, and GitLab, enabling smooth workflows and automating processes. It also supports continuous integration (CI), allowing you to automate testing, model training, and deployment pipelines.

Using DAGsHub with Google Colab

To leverage the power of DAGsHub in your Google Colab environment, please use DAGsYard as your setup notebook after you create a cookie-cutter-mlops repo on DAGsHub.

By combining the power of DAGsHub with the convenience of Google Colab, you can enjoy a seamless and efficient workflow for your data science projects. You can leverage the collaborative features of DAGsHub while utilizing the computational resources and interactive environment of Google Colab.

Conclusion

In this addendum, we introduced DAGsHub, a collaborative platform for managing and versioning data science projects. We discussed its features and how it enhances collaboration and reproducibility in data science workflows. Additionally, we explored how to use DAGsHub with Google Colab, enabling seamless integration of version control and collaboration capabilities within the Colab environment.

By leveraging the power of DAGsHub and Google Colab together, data scientists can efficiently track experiments, collaborate with team members, and manage their projects with ease. DAGsHub provides the necessary tools to ensure reproducibility, version control, and collaboration, making it an invaluable asset for data science teams.

With the combined strength of MLflow for experiment tracking and model management, and DAGsHub for collaboration and version control, data scientists can streamline their workflows, foster collaboration, and deliver impactful machine learning solutions with confidence.

https://dagshub.com/wonhyeongseo/mlops-week2

References

Google Colab: https://colab.research.google.com/
DAGsHub Documentation: https://docs.dagshub.com/
Setting Up MLflow on Google Colab (StackOverflow): https://stackoverflow.com/questions/61615818/setting-up-mlflow-on-google-colab

MLOps Zoomcamp: week 2

Table of contents

Introduction

Q1. Install the package

Q2. Download and preprocess the data

Q3. Train a model with autolog

Q4. Tune model hyperparameters

Q5. Promote the best model to the model registry

Q6. Model metadata

Conclusion

Addendum: Introducing DAGsHub and Using it with Google Colab

Introducing DAGsHub

Using DAGsHub with Google Colab

Conclusion

Subscribe to my newsletter

Wonhyeong Seo

Wonhyeong Seo