07 Data Transformation

Overview
The Data Transformation stage is responsible for splitting the input dataset into training and testing sets. This is a crucial step in preparing your data for machine learning, ensuring that your model can be trained and evaluated effectively.
Configuration
The pipeline uses a YAML configuration file to specify paths for storing artifacts and locating the dataset.
# config\config.yaml
data_transformation:
root_dir: artifacts/data_transformation
data_path: artifacts/data_ingestion/winequality-red.csv
root_dir
: Directory where transformed data (train/test splits) will be saved.data_path
: Path to the input CSV data file.
Entity Definition
We use a data class to define the configuration entity for data transformation. This ensures type safety and easy access to configuration parameters.
# src/Predict_Pipe/entity/config_entity.py
from dataclasses import dataclass
from pathlib import Path
@dataclass
class DataTransformationConfig:
root_dir: Path
data_path: Path
Configuration Manager
The configuration manager reads the YAML file and creates the necessary directories for storing artifacts.
# src/Predict_Pipe/config/configuration.py
def get_data_transformation_config(self) -> DataTransformationConfig:
config = self.config.data_transformation
create_directories([config.root_dir])
data_transformation_config = DataTransformationConfig(
root_dir=config.root_dir,
data_path=config.data_path
)
return data_transformation_config
Ensures the
root_dir
exists before proceeding.Returns a
DataTransformationConfig
instance for use in the pipeline.
Data Transformation Component
This component handles the actual splitting of the data.
# src/Predict_Pipe/components/data_transformation.py
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from src.Predict_Pipe.entity.config_entity import DataTransformationConfig
from src.Predict_Pipe.logging import logger
class DataTransformation:
def __init__(self, config: DataTransformationConfig):
self.config = config
def train_test_spliting(self):
data = pd.read_csv(self.config.data_path)
train, test = train_test_split(data)
train.to_csv(os.path.join(self.config.root_dir, "train.csv"), index=False)
test.to_csv(os.path.join(self.config.root_dir, "test.csv"), index=False)
logger.info("Splited data into train and test")
logger.info(train.shape)
logger.info(test.shape)
print(train.shape)
print(test.shape)
Reads the dataset from the specified path.
Splits the data into training and testing sets.
Saves the splits as CSV files in the
root_dir
.Logs the shapes of the resulting datasets.
Pipeline Orchestration
The pipeline ensures that data transformation only occurs if data validation has succeeded.
# src/Predict_Pipe/pipeline/data_transformation.py
from pathlib import Path
from src.Predict_Pipe.config.configuration import ConfigurationManager
from src.Predict_Pipe.components.data_transformation import DataTransformation
from src.Predict_Pipe.logging import logger
STAGE_NAME = "Data Transformation stage"
class DataTransformationTrainingPipeline:
def __init__(self):
pass
def initiate_data_transformation(self):
try:
status_file = Path("artifacts/data_validation/status.txt")
if not status_file.exists():
logger.error(f"Status file not found at {status_file}")
raise Exception("Data validation status file not found")
with open(status_file, "r") as f:
content = f.read()
logger.info(f"Status file content: '{content}'")
status = content.split(" ")[-1]
logger.info(f"Extracted status: '{status}'")
if status.strip() == "True":
config = ConfigurationManager()
data_transformation_config = config.get_data_transformation_config()
data_transformation = DataTransformation(data_transformation_config)
data_transformation.train_test_spliting()
else:
logger.error(f"Data validation status is not True. Got: '{status}'")
raise Exception("Data validation failed")
except Exception as e:
logger.error(f"Error in data transformation: {str(e)}")
raise e
Checks if the data validation status is
True
before proceeding.Logs errors and raises exceptions if validation fails.
Main Execution
The main script runs the pipeline and logs the progress.
# main.py
STAGE_NAME = "Data Transformation stage"
try:
logger.info(f">>>>>> stage {STAGE_NAME} started <<<<<<")
obj = DataTransformationTrainingPipeline()
obj.initiate_data_transformation()
logger.info(f">>>>>>>{STAGE_NAME} completed <<<<<<\n\nx==========x")
except Exception as e:
logger.exception(e)
raise e
Starts the data transformation stage.
Logs the start and completion of the stage.
Handles and logs any exceptions.
Artifacts and Logs
Artifacts: Transformed data (
train.csv
,test.csv
) is saved in theartifacts/data_transformation
directory.Logs: All logs are stored in the
logs
directory for easy debugging and tracking.
In next section we train the model
Subscribe to my newsletter
Read articles from Yash Maini directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
