07 Data Transformation

Yash MainiYash Maini
3 min read

Overview

The Data Transformation stage is responsible for splitting the input dataset into training and testing sets. This is a crucial step in preparing your data for machine learning, ensuring that your model can be trained and evaluated effectively.

Configuration

The pipeline uses a YAML configuration file to specify paths for storing artifacts and locating the dataset.

# config\config.yaml

data_transformation:
  root_dir: artifacts/data_transformation
  data_path: artifacts/data_ingestion/winequality-red.csv
  • root_dir: Directory where transformed data (train/test splits) will be saved.

  • data_path: Path to the input CSV data file.

Entity Definition

We use a data class to define the configuration entity for data transformation. This ensures type safety and easy access to configuration parameters.

# src/Predict_Pipe/entity/config_entity.py

from dataclasses import dataclass
from pathlib import Path

@dataclass
class DataTransformationConfig:
    root_dir: Path
    data_path: Path

Configuration Manager

The configuration manager reads the YAML file and creates the necessary directories for storing artifacts.

# src/Predict_Pipe/config/configuration.py

def get_data_transformation_config(self) -> DataTransformationConfig:
    config = self.config.data_transformation

    create_directories([config.root_dir])

    data_transformation_config = DataTransformationConfig(
        root_dir=config.root_dir,
        data_path=config.data_path
    )

    return data_transformation_config
  • Ensures the root_dir exists before proceeding.

  • Returns a DataTransformationConfig instance for use in the pipeline.

Data Transformation Component

This component handles the actual splitting of the data.

# src/Predict_Pipe/components/data_transformation.py

import os
import pandas as pd
from sklearn.model_selection import train_test_split
from src.Predict_Pipe.entity.config_entity import DataTransformationConfig
from src.Predict_Pipe.logging import logger

class DataTransformation:
    def __init__(self, config: DataTransformationConfig):
        self.config = config

    def train_test_spliting(self):
        data = pd.read_csv(self.config.data_path)
        train, test = train_test_split(data)
        train.to_csv(os.path.join(self.config.root_dir, "train.csv"), index=False)
        test.to_csv(os.path.join(self.config.root_dir, "test.csv"), index=False)
        logger.info("Splited data into train and test")
        logger.info(train.shape)
        logger.info(test.shape)
        print(train.shape)
        print(test.shape)
  • Reads the dataset from the specified path.

  • Splits the data into training and testing sets.

  • Saves the splits as CSV files in the root_dir.

  • Logs the shapes of the resulting datasets.

Pipeline Orchestration

The pipeline ensures that data transformation only occurs if data validation has succeeded.

# src/Predict_Pipe/pipeline/data_transformation.py

from pathlib import Path
from src.Predict_Pipe.config.configuration import ConfigurationManager
from src.Predict_Pipe.components.data_transformation import DataTransformation
from src.Predict_Pipe.logging import logger

STAGE_NAME = "Data Transformation stage"

class DataTransformationTrainingPipeline:
    def __init__(self):
        pass

    def initiate_data_transformation(self):
        try:
            status_file = Path("artifacts/data_validation/status.txt")
            if not status_file.exists():
                logger.error(f"Status file not found at {status_file}")
                raise Exception("Data validation status file not found")

            with open(status_file, "r") as f:
                content = f.read()
                logger.info(f"Status file content: '{content}'")
                status = content.split(" ")[-1]
                logger.info(f"Extracted status: '{status}'")

            if status.strip() == "True":
                config = ConfigurationManager()
                data_transformation_config = config.get_data_transformation_config()
                data_transformation = DataTransformation(data_transformation_config)
                data_transformation.train_test_spliting()
            else:
                logger.error(f"Data validation status is not True. Got: '{status}'")
                raise Exception("Data validation failed")

        except Exception as e:
            logger.error(f"Error in data transformation: {str(e)}")
            raise e
  • Checks if the data validation status is True before proceeding.

  • Logs errors and raises exceptions if validation fails.

Main Execution

The main script runs the pipeline and logs the progress.

# main.py

STAGE_NAME = "Data Transformation stage"
try:
    logger.info(f">>>>>> stage {STAGE_NAME} started <<<<<<")
    obj = DataTransformationTrainingPipeline()
    obj.initiate_data_transformation()
    logger.info(f">>>>>>>{STAGE_NAME} completed <<<<<<\n\nx==========x")
except Exception as e:
    logger.exception(e)
    raise e
  • Starts the data transformation stage.

  • Logs the start and completion of the stage.

  • Handles and logs any exceptions.

Artifacts and Logs

  • Artifacts: Transformed data (train.csv, test.csv) is saved in the artifacts/data_transformation directory.

  • Logs: All logs are stored in the logs directory for easy debugging and tracking.


    In next section we train the model

0
Subscribe to my newsletter

Read articles from Yash Maini directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Yash Maini
Yash Maini