In the dynamic world of data science, efficient and reliable workflow orchestration is crucial. Recently, I embarked on an enlightening journey exploring data workflows using Prefect, a workflow management system designed for modern infrastructure. My task was to build, execute, and monitor a data workflow for predicting taxi ride durations using machine learning.

Data Preparation and Feature Engineering

The first stage in any data science project is gathering and preparing the data. I used the read_data task in Prefect to load the Green Taxi data from January and February 2023. This task reads data into a pandas DataFrame, calculates trip durations, filters out rides that lasted less than one minute or more than an hour and transforms certain features into the correct data types.

Next, I performed some feature engineering with the add_features task. This task creates a new feature "PU_DO", a concatenation of the pickup and dropoff locations, and then vectorizes the categorical features using a DictVectorizer. This step is crucial for preparing our data for the machine learning model.

Model Training and Evaluation

After data preparation and feature engineering, I utilized the train_best_model task to train an XGBoost model. To ensure the robustness of the model, I employed a set of optimal hyperparameters. The model was trained on the prepared data, and its performance was evaluated using the root mean square error (RMSE) metric.

To keep track of the model's performance and the hyperparameters used, I integrated the workflow with MLflow, an open-source platform for managing the end-to-end machine learning lifecycle. This integration allowed me to log the hyperparameters and RMSE of the model, providing a clear record of the model's performance for future reference.

Data Workflow Orchestration with Prefect

One of the most empowering aspects of Prefect is its flexibility and control over task execution. In Prefect, tasks are Python functions that can take inputs, perform work, and return an output. They also receive metadata about upstream dependencies and their states, which enables tasks to wait for the completion of other tasks before executing.

By dividing the workflow into small tasks, Prefect enhances the control over task failures, and in turn, the reliability of the entire data pipeline. Tasks in Prefect can also be customized through optional arguments, such as setting the number of retries on task failure and specifying a delay before retrying, which were valuable features for managing my data workflow.

The Road Ahead

My journey with Prefect and MLflow has been insightful and enriching. The ability to manage data workflows effectively and monitor the machine learning lifecycle has made me appreciate the importance of these tools in the realm of data science.

As I move forward, I aim to delve deeper into more advanced features of Prefect and MLflow, such as parallel task execution, state handlers, and model versioning. The adventure of data science continues, and I'm excited about the new challenges and opportunities that lie ahead.

Code

https://dagshub.com/wonhyeongseo/mlops-zoomcamp/src/master/week3/homework.ipynb

MLOps Zoomcamp: week 3

Table of contents