04 What You Need: Prerequisite Files Overview

Building robust machine learning pipelines is more than just model training—it's about maintainability, traceability, and reproducibility. In this article, I'll walk you through how to set up structured logging and utility functions for configuration and file management, using a real-world project as an example.
We'll cover:
Why structured logging matters
How to implement logging in your Python package
Utility functions for YAML, JSON, and binary files
Managing schema and parameters with YAML
Example project structure
Let's dive in!
Why Structured Logging?
When your ML pipeline grows, simple print()
statements just won't cut it. You need to know what happened, when, and where—especially when debugging or running in production. Structured logging provides:
Timestamps for every event
Log levels (INFO, WARNING, ERROR, etc.)
Module names for context
Multiple outputs (console and file)
Setting Up Structured Logging
Let's create a logging setup that writes logs to both the console and a file, with clear formatting.
# src/Predict_Pipe/logging/__init__.py
import os
import sys
import logging
logging_str = "[%(asctime)s: %(levelname)s: %(module)s]: %(message)s"
log_dir = "logs"
log_filepath = os.path.join(log_dir, "running_logs.log")
os.makedirs(log_dir, exist_ok=True)
logging.basicConfig(
level=logging.INFO,
format=logging_str,
handlers=[
logging.FileHandler(log_filepath),
logging.StreamHandler(sys.stdout)
]
)
logger = logging.getLogger("Predict_Pipe_logger")
What this does:
Creates a
logs
directory (if it doesn't exist)Logs to both a file and the console
Adds timestamps, log levels, and module names to every log message
Utility Functions: YAML, JSON, and Binary Files
ML projects often juggle multiple config files and artifacts. Let's build utility functions for:
Reading YAML configs (for schema, parameters, etc.)
Creating directories
Saving/loading JSON and binary files
# src/Predict_Pipe/utils/common.py
import os
import yaml
from ..logging import logger
import json
import joblib
from ensure import ensure_annotations
from box import ConfigBox
from pathlib import Path
from typing import Any
from box.exceptions import BoxValueError
@ensure_annotations
def read_yaml(path_to_yaml: Path) -> ConfigBox:
try:
with open(path_to_yaml) as yaml_file:
content = yaml.safe_load(yaml_file)
logger.info(f"yaml file: {path_to_yaml} loaded successfully")
except BoxValueError:
raise ValueError("yaml file is empty")
except Exception as e:
raise e
return ConfigBox(content)
@ensure_annotations
def create_directories(path_to_directories: list, verbose=True):
for path in path_to_directories:
os.makedirs(path, exist_ok=True)
if verbose:
logger.info(f"created directory at: {path}")
@ensure_annotations
def save_json(path: Path, data: dict):
with open(path, "w") as f:
json.dump(data, f, indent=4)
logger.info(f"json file saved at: {path}")
@ensure_annotations
def load_json(path: Path) -> ConfigBox:
with open(path) as f:
content = json.load(f)
logger.info(f"json file loaded successfully from: {path}")
return ConfigBox(content)
@ensure_annotations
def save_bin(data: Any, path: Path) -> Any:
joblib.dump(data, path)
logger.info(f"binary file saved at: {path}")
return data
Highlights:
All actions are logged for traceability.
Uses
ConfigBox
for dot-accessible configs.Type annotations and error handling included.
Managing Schema and Parameters with YAML
YAML files are perfect for storing schema definitions and model parameters.
Example: schema.yaml
textCOLUMNS:
fixed acidity: float64
volatile acidity: float64
citric acid: float64
residual sugar: float64
chlorides: float64
free sulfur dioxide: float64
total sulfur dioxide: float64
density: float64
pH: float64
sulphates: float64
alcohol: float64
quality: int64
TARGET_COLUMN: "quality"
Example: params.yaml
textElasticNet:
alpha: 0.2
l1_ratio: 0.1
These files are loaded using the read_yaml()
utility, making your pipeline flexible and easy to update.
Final Thoughts
By combining structured logging with robust utility functions and YAML-based configuration, you set your ML pipeline up for success:
Easier debugging: Every step is logged with context.
Reproducibility: Configs are version-controlled and human-readable.
Scalability: Utilities can be reused across projects.
Ready to try it?
Clone the structure above, adapt the utilities to your needs, and watch your ML projects become more maintainable and production-ready!
Let me know in the comments:
What logging or configuration tricks do you use in your ML pipelines?
Happy coding! 🚀
Tags:#Python
#MachineLearning
#Logging
#YAML
#MLOps
#BestPractices
Next we’ll see the section for running each module of MlPipeline
Subscribe to my newsletter
Read articles from Yash Maini directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
