Demystifying Modular Pipelines: A Simplified Guide

Building robust machine learning pipelines is more than just model training—it's about maintainability, traceability, and reproducibility. In this article, I'll walk you through how to set up structured logging and utility functions for configuration and file management, using a real-world project as an example.

We'll cover:

Why structured logging matters
How to implement logging in your Python package
Utility functions for YAML, JSON, and binary files
Managing schema and parameters with YAML
Example project structure

Let's dive in!

Why Structured Logging?

When your ML pipeline grows, simple print() statements just won't cut it. You need to know what happened, when, and where—especially when debugging or running in production. Structured logging provides:

Timestamps for every event
Log levels (INFO, WARNING, ERROR, etc.)
Module names for context
Multiple outputs (console and file)

Setting Up Structured Logging

Let's create a logging setup that writes logs to both the console and a file, with clear formatting.

# src/Predict_Pipe/logging/__init__.py

import os
import sys
import logging

logging_str = "[%(asctime)s: %(levelname)s: %(module)s]: %(message)s"

log_dir = "logs"
log_filepath = os.path.join(log_dir, "running_logs.log")
os.makedirs(log_dir, exist_ok=True)

logging.basicConfig(
    level=logging.INFO,
    format=logging_str,
    handlers=[
        logging.FileHandler(log_filepath),
        logging.StreamHandler(sys.stdout)
    ]
)

logger = logging.getLogger("Predict_Pipe_logger")

What this does:

Creates a logs directory (if it doesn't exist)
Logs to both a file and the console
Adds timestamps, log levels, and module names to every log message

Utility Functions: YAML, JSON, and Binary Files

ML projects often juggle multiple config files and artifacts. Let's build utility functions for:

Reading YAML configs (for schema, parameters, etc.)
Creating directories
Saving/loading JSON and binary files

# src/Predict_Pipe/utils/common.py

import os
import yaml
from ..logging import logger
import json
import joblib
from ensure import ensure_annotations
from box import ConfigBox
from pathlib import Path
from typing import Any
from box.exceptions import BoxValueError

@ensure_annotations
def read_yaml(path_to_yaml: Path) -> ConfigBox:
    try:
        with open(path_to_yaml) as yaml_file:
            content = yaml.safe_load(yaml_file)
            logger.info(f"yaml file: {path_to_yaml} loaded successfully")
    except BoxValueError:
        raise ValueError("yaml file is empty")
    except Exception as e:
        raise e
    return ConfigBox(content)

@ensure_annotations
def create_directories(path_to_directories: list, verbose=True):
    for path in path_to_directories:    
        os.makedirs(path, exist_ok=True)
        if verbose:
            logger.info(f"created directory at: {path}")

@ensure_annotations
def save_json(path: Path, data: dict):
    with open(path, "w") as f:
        json.dump(data, f, indent=4)
    logger.info(f"json file saved at: {path}")

@ensure_annotations
def load_json(path: Path) -> ConfigBox:
    with open(path) as f:
        content = json.load(f)
    logger.info(f"json file loaded successfully from: {path}")
    return ConfigBox(content)

@ensure_annotations
def save_bin(data: Any, path: Path) -> Any:
    joblib.dump(data, path)
    logger.info(f"binary file saved at: {path}")
    return data

Highlights:

All actions are logged for traceability.
Uses ConfigBox for dot-accessible configs.
Type annotations and error handling included.

Managing Schema and Parameters with YAML

YAML files are perfect for storing schema definitions and model parameters.

Example: schema.yaml

textCOLUMNS:
  fixed acidity: float64
  volatile acidity: float64
  citric acid: float64
  residual sugar: float64
  chlorides: float64
  free sulfur dioxide: float64
  total sulfur dioxide: float64
  density: float64
  pH: float64
  sulphates: float64
  alcohol: float64
  quality: int64

TARGET_COLUMN: "quality"

Example: params.yaml

textElasticNet:
  alpha: 0.2
  l1_ratio: 0.1

These files are loaded using the read_yaml() utility, making your pipeline flexible and easy to update.

Final Thoughts

By combining structured logging with robust utility functions and YAML-based configuration, you set your ML pipeline up for success:

Easier debugging: Every step is logged with context.
Reproducibility: Configs are version-controlled and human-readable.
Scalability: Utilities can be reused across projects.

Ready to try it?
Clone the structure above, adapt the utilities to your needs, and watch your ML projects become more maintainable and production-ready!

Let me know in the comments:
What logging or configuration tricks do you use in your ML pipelines?

Happy coding! 🚀

Tags:
#Python #MachineLearning #Logging #YAML #MLOps #BestPractices

Next we’ll see the section for running each module of MlPipeline

04 What You Need: Prerequisite Files Overview

Table of contents