MLflow: An Introduction
What is MLflow?
MLflow is an open-source platform designed to streamline the machine learning (ML) lifecycle, including experimentation, reproducibility, deployment, and a central model registry. It enables data scientists and ML engineers to track experiments, version models, and deploy them efficiently.
Why Use MLflow?
Experiment Tracking: Log parameters, code versions, metrics, and output files when running machine learning code.
Reproducibility: Reproduce runs to understand how models were trained.
Model Management: Register, track, and manage ML models.
Deployment: Deploy models in diverse environments using MLflow Models.
Key Components of MLflow
MLflow Tracking: Records and queries experiments; logs parameters, metrics, and artifacts (output files).
MLflow Projects: Packages ML code in a reusable, reproducible form with environment specification.
MLflow Models: A standard format for packaging ML models to deploy in various environments.
MLflow Model Registry: A centralized store to manage the full lifecycle of MLflow models.
MLflow with Code Examples
Example: Training a Simple Linear Regression Model with MLflow Tracking
Objective: Demonstrate how to use MLflow to track experiments by training a linear regression model on the Diabetes dataset.
Step 1: Install Required Libraries
Ensure you have the necessary libraries installed:
pip install mlflow scikit-learn pandas
Step 2: Import Libraries
import mlflow
import mlflow.sklearn
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import pandas as pd
Step 3: Load and Prepare Data
# Load dataset
data = load_diabetes()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='target')
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Step 4: Set Up MLflow Tracking
# Set the MLflow experiment (optional)
mlflow.set_experiment("Diabetes_LR_Experiment")
# Start an MLflow run
with mlflow.start_run():
# Instantiate the model with a parameter
model = LinearRegression(fit_intercept=True)
# Log model parameters
mlflow.log_param("fit_intercept", model.fit_intercept)
# Train the model
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Calculate metrics
mse = mean_squared_error(y_test, y_pred)
rmse = mse ** 0.5
# Log metrics
mlflow.log_metric("mse", mse)
mlflow.log_metric("rmse", rmse)
# Log the model
mlflow.sklearn.log_model(model, "linear_regression_model")
print(f"Logged data and model in run {mlflow.active_run().info.run_id}")
Explanation:
mlflow.set_experiment: Sets the experiment name under which runs are grouped.
mlflow.start_run(): Starts a new MLflow run.
mlflow.log_param: Logs a parameter used in the model.
mlflow.log_metric: Logs a metric from model evaluation.
mlflow.sklearn.log_model: Logs the trained model.
Step 5: View the Logged Data
Start the MLflow UI to visualize the experiment data:
mlflow ui
By default, the UI will be available at http://localhost:5000
. Navigate there to see the logged parameters, metrics, and model artifacts.
End-to-End Functionality of MLflow
Let's delve deeper into MLflow's functionality, covering each component with code examples and commands.
1. MLflow Tracking
Logging Additional Parameters and Metrics
You can log any number of parameters and metrics:
# Log more parameters
mlflow.log_param("normalize", False)
mlflow.log_param("copy_X", True)
# Log more metrics
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
mlflow.log_metric("r2_score", r2)
Logging Artifacts
Artifacts are output files or data you want to associate with a run:
# Save predictions to a CSV file
predictions = pd.DataFrame({"Actual": y_test, "Predicted": y_pred})
predictions.to_csv("predictions.csv", index=False)
# Log the artifact
mlflow.log_artifact("predictions.csv")
2. MLflow Projects
Creating an MLproject File
An MLproject
file defines the project structure and dependencies:
name: Diabetes_LR_Project
conda_env: conda.yaml
entry_points:
main:
parameters:
fit_intercept: {type: bool, default: True}
command: "python train.py --fit_intercept {fit_intercept}"
Creating a Conda Environment File (conda.yaml
)
name: mlflow-env
channels:
- defaults
dependencies:
- python=3.8
- scikit-learn
- pandas
- pip
- pip:
- mlflow
Running the MLflow Project
mlflow run . -P fit_intercept=False
This command tells MLflow to run the project in the current directory (.
) with the parameter fit_intercept
set to False
.
3. MLflow Models
Saving and Loading Models
You can save the model locally and load it later:
Saving the Model
mlflow.sklearn.save_model(model, "model")
Loading the Model
loaded_model = mlflow.sklearn.load_model("model")
Making Predictions with the Loaded Model
loaded_model.predict(X_test)
4. MLflow Model Registry
The Model Registry allows you to manage model versions and stages.
Registering a Model
# Assume you have a run ID
run_id = mlflow.active_run().info.run_id
# Register the model
result = mlflow.register_model(
model_uri=f"runs:/{run_id}/linear_regression_model",
name="DiabetesLinearRegressionModel"
)
Transitioning Model Versions Between Stages
from mlflow.tracking import MlflowClient
client = MlflowClient()
client.transition_model_version_stage(
name="DiabetesLinearRegressionModel",
version=1,
stage="Staging"
)
Listing Registered Models
for rm in client.list_registered_models():
print(rm)
Loading a Registered Model
model_name = "DiabetesLinearRegressionModel"
model_stage = "Staging"
loaded_model = mlflow.pyfunc.load_model(
model_uri=f"models:/{model_name}/{model_stage}"
)
Making Predictions
predictions = loaded_model.predict(X_test)
5. Deployment
MLflow Models can be deployed to various platforms. Here's an example of deploying a model as a local REST API using mlflow models serve
.
Serving the Model Locally
mlflow models serve -m "models:/DiabetesLinearRegressionModel/Staging" -p 1234
Making a Prediction via REST API
curl -X POST -H "Content-Type: application/json" --data '{"columns":[...],"data":[...]}' http://127.0.0.1:1234/invocations
Replace ...
with the actual column names and data.
All Commands Explained
Install MLflow and Dependencies
pip install mlflow scikit-learn pandas
Set Experiment
mlflow.set_experiment("Experiment_Name")
Start an MLflow Run
with mlflow.start_run(): # Your code here
Log Parameters
mlflow.log_param("param_name", param_value)
Log Metrics
mlflow.log_metric("metric_name", metric_value)
Log Artifacts
mlflow.log_artifact("path/to/file")
Log a Model
mlflow.sklearn.log_model(model, "model_name")
Save a Model
mlflow.sklearn.save_model(model, "path/to/save")
Load a Model
model = mlflow.sklearn.load_model("path/to/save")
Register a Model
mlflow.register_model(model_uri, "Registered_Model_Name")
Transition Model Stage
client.transition_model_version_stage( name="Registered_Model_Name", version=version_number, stage="Production" )
Serve a Model Locally
mlflow models serve -m "models:/Registered_Model_Name/Stage" -p 1234
Run an MLflow Project
mlflow run . -P param_name=param_value
Start the MLflow UI
mlflow ui
Quick Revision Notes
MLflow Components:
Tracking: Log and query experiments.
Projects: Reproducible runs with environment specifications.
Models: Package models in a standard format.
Model Registry: Centralized model store.
MLflow Tracking:
Parameters: Model inputs (e.g., hyperparameters).
Metrics: Evaluation results (e.g., accuracy, loss).
Artifacts: Output files (e.g., models, datasets).
Commands:
mlflow.log_param("param_name", value)
mlflow.log_metric("metric_name", value)
mlflow.log_artifact("path/to/artifact")
MLflow Projects:
Define project structure with
MLproject
file.Specify dependencies using
conda.yaml
.Run projects with
mlflow run
.
MLflow Models:
Log models with
mlflow.<framework>.log_model
.Load models with
mlflow.<framework>.load_model
.Serve models locally or deploy to platforms.
Model Registry:
Register models using
mlflow.register_model
.Manage model versions and stages (
Staging
,Production
).Transition stages with
client.transition_model_version_stage
.
Best Practices:
Consistent Experimentation: Use
mlflow.set_experiment
to organize runs.Parameter Logging: Log all relevant parameters for reproducibility.
Metric Tracking: Log key performance metrics.
Artifact Management: Log models and important files.
Version Control: Keep code in Git or another VCS.
Environment Specification: Use
conda.yaml
for dependency management.
Subscribe to my newsletter
Read articles from Sai Prasanna Maharana directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by