MLflow: An Introduction

What is MLflow?

MLflow is an open-source platform designed to streamline the machine learning (ML) lifecycle, including experimentation, reproducibility, deployment, and a central model registry. It enables data scientists and ML engineers to track experiments, version models, and deploy them efficiently.

Why Use MLflow?

  • Experiment Tracking: Log parameters, code versions, metrics, and output files when running machine learning code.

  • Reproducibility: Reproduce runs to understand how models were trained.

  • Model Management: Register, track, and manage ML models.

  • Deployment: Deploy models in diverse environments using MLflow Models.


Key Components of MLflow

  1. MLflow Tracking: Records and queries experiments; logs parameters, metrics, and artifacts (output files).

  2. MLflow Projects: Packages ML code in a reusable, reproducible form with environment specification.

  3. MLflow Models: A standard format for packaging ML models to deploy in various environments.

  4. MLflow Model Registry: A centralized store to manage the full lifecycle of MLflow models.


MLflow with Code Examples

Example: Training a Simple Linear Regression Model with MLflow Tracking

Objective: Demonstrate how to use MLflow to track experiments by training a linear regression model on the Diabetes dataset.

Step 1: Install Required Libraries

Ensure you have the necessary libraries installed:

pip install mlflow scikit-learn pandas

Step 2: Import Libraries

import mlflow
import mlflow.sklearn
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import pandas as pd

Step 3: Load and Prepare Data

# Load dataset
data = load_diabetes()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='target')

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 4: Set Up MLflow Tracking

# Set the MLflow experiment (optional)
mlflow.set_experiment("Diabetes_LR_Experiment")

# Start an MLflow run
with mlflow.start_run():
    # Instantiate the model with a parameter
    model = LinearRegression(fit_intercept=True)

    # Log model parameters
    mlflow.log_param("fit_intercept", model.fit_intercept)

    # Train the model
    model.fit(X_train, y_train)

    # Make predictions
    y_pred = model.predict(X_test)

    # Calculate metrics
    mse = mean_squared_error(y_test, y_pred)
    rmse = mse ** 0.5

    # Log metrics
    mlflow.log_metric("mse", mse)
    mlflow.log_metric("rmse", rmse)

    # Log the model
    mlflow.sklearn.log_model(model, "linear_regression_model")

    print(f"Logged data and model in run {mlflow.active_run().info.run_id}")

Explanation:

  • mlflow.set_experiment: Sets the experiment name under which runs are grouped.

  • mlflow.start_run(): Starts a new MLflow run.

  • mlflow.log_param: Logs a parameter used in the model.

  • mlflow.log_metric: Logs a metric from model evaluation.

  • mlflow.sklearn.log_model: Logs the trained model.

Step 5: View the Logged Data

Start the MLflow UI to visualize the experiment data:

mlflow ui

By default, the UI will be available at http://localhost:5000. Navigate there to see the logged parameters, metrics, and model artifacts.


End-to-End Functionality of MLflow

Let's delve deeper into MLflow's functionality, covering each component with code examples and commands.

1. MLflow Tracking

Logging Additional Parameters and Metrics

You can log any number of parameters and metrics:

# Log more parameters
mlflow.log_param("normalize", False)
mlflow.log_param("copy_X", True)

# Log more metrics
from sklearn.metrics import r2_score

r2 = r2_score(y_test, y_pred)
mlflow.log_metric("r2_score", r2)

Logging Artifacts

Artifacts are output files or data you want to associate with a run:

# Save predictions to a CSV file
predictions = pd.DataFrame({"Actual": y_test, "Predicted": y_pred})
predictions.to_csv("predictions.csv", index=False)

# Log the artifact
mlflow.log_artifact("predictions.csv")

2. MLflow Projects

Creating an MLproject File

An MLproject file defines the project structure and dependencies:

name: Diabetes_LR_Project

conda_env: conda.yaml

entry_points:
  main:
    parameters:
      fit_intercept: {type: bool, default: True}
    command: "python train.py --fit_intercept {fit_intercept}"

Creating a Conda Environment File (conda.yaml)

name: mlflow-env
channels:
  - defaults
dependencies:
  - python=3.8
  - scikit-learn
  - pandas
  - pip
  - pip:
    - mlflow

Running the MLflow Project

mlflow run . -P fit_intercept=False

This command tells MLflow to run the project in the current directory (.) with the parameter fit_intercept set to False.

3. MLflow Models

Saving and Loading Models

You can save the model locally and load it later:

Saving the Model

mlflow.sklearn.save_model(model, "model")

Loading the Model

loaded_model = mlflow.sklearn.load_model("model")

Making Predictions with the Loaded Model

loaded_model.predict(X_test)

4. MLflow Model Registry

The Model Registry allows you to manage model versions and stages.

Registering a Model

# Assume you have a run ID
run_id = mlflow.active_run().info.run_id

# Register the model
result = mlflow.register_model(
    model_uri=f"runs:/{run_id}/linear_regression_model",
    name="DiabetesLinearRegressionModel"
)

Transitioning Model Versions Between Stages

from mlflow.tracking import MlflowClient

client = MlflowClient()
client.transition_model_version_stage(
    name="DiabetesLinearRegressionModel",
    version=1,
    stage="Staging"
)

Listing Registered Models

for rm in client.list_registered_models():
    print(rm)

Loading a Registered Model

model_name = "DiabetesLinearRegressionModel"
model_stage = "Staging"

loaded_model = mlflow.pyfunc.load_model(
    model_uri=f"models:/{model_name}/{model_stage}"
)

Making Predictions

predictions = loaded_model.predict(X_test)

5. Deployment

MLflow Models can be deployed to various platforms. Here's an example of deploying a model as a local REST API using mlflow models serve.

Serving the Model Locally

mlflow models serve -m "models:/DiabetesLinearRegressionModel/Staging" -p 1234

Making a Prediction via REST API

curl -X POST -H "Content-Type: application/json" --data '{"columns":[...],"data":[...]}' http://127.0.0.1:1234/invocations

Replace ... with the actual column names and data.


All Commands Explained

  • Install MLflow and Dependencies

      pip install mlflow scikit-learn pandas
    
  • Set Experiment

      mlflow.set_experiment("Experiment_Name")
    
  • Start an MLflow Run

      with mlflow.start_run():
          # Your code here
    
  • Log Parameters

      mlflow.log_param("param_name", param_value)
    
  • Log Metrics

      mlflow.log_metric("metric_name", metric_value)
    
  • Log Artifacts

      mlflow.log_artifact("path/to/file")
    
  • Log a Model

      mlflow.sklearn.log_model(model, "model_name")
    
  • Save a Model

      mlflow.sklearn.save_model(model, "path/to/save")
    
  • Load a Model

      model = mlflow.sklearn.load_model("path/to/save")
    
  • Register a Model

      mlflow.register_model(model_uri, "Registered_Model_Name")
    
  • Transition Model Stage

      client.transition_model_version_stage(
          name="Registered_Model_Name",
          version=version_number,
          stage="Production"
      )
    
  • Serve a Model Locally

      mlflow models serve -m "models:/Registered_Model_Name/Stage" -p 1234
    
  • Run an MLflow Project

      mlflow run . -P param_name=param_value
    
  • Start the MLflow UI

      mlflow ui
    

Quick Revision Notes

  • MLflow Components:

    • Tracking: Log and query experiments.

    • Projects: Reproducible runs with environment specifications.

    • Models: Package models in a standard format.

    • Model Registry: Centralized model store.

  • MLflow Tracking:

    • Parameters: Model inputs (e.g., hyperparameters).

    • Metrics: Evaluation results (e.g., accuracy, loss).

    • Artifacts: Output files (e.g., models, datasets).

    • Commands:

      • mlflow.log_param("param_name", value)

      • mlflow.log_metric("metric_name", value)

      • mlflow.log_artifact("path/to/artifact")

  • MLflow Projects:

    • Define project structure with MLproject file.

    • Specify dependencies using conda.yaml.

    • Run projects with mlflow run.

  • MLflow Models:

    • Log models with mlflow.<framework>.log_model.

    • Load models with mlflow.<framework>.load_model.

    • Serve models locally or deploy to platforms.

  • Model Registry:

    • Register models using mlflow.register_model.

    • Manage model versions and stages (Staging, Production).

    • Transition stages with client.transition_model_version_stage.

  • Best Practices:

    • Consistent Experimentation: Use mlflow.set_experiment to organize runs.

    • Parameter Logging: Log all relevant parameters for reproducibility.

    • Metric Tracking: Log key performance metrics.

    • Artifact Management: Log models and important files.

    • Version Control: Keep code in Git or another VCS.

    • Environment Specification: Use conda.yaml for dependency management.


0
Subscribe to my newsletter

Read articles from Sai Prasanna Maharana directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Sai Prasanna Maharana
Sai Prasanna Maharana