06 Data Validation

Yash MainiYash Maini
3 min read

Hey everyone! đź‘‹
Ever wondered how data scientists make sure their data is actually good before training a model? Let’s break down a simple but cool data validation pipeline using Python, YAML config, and some logging magic. Ready? Let’s go!

1. Configuration: config/config.yaml

First, we tell our pipeline where to find stuff and where to save results. This is our config.yaml:

data_validation:
  root_dir: artifacts/data_validation
  unzip_data_dir: artifacts/data_ingestion/winequality-red.csv
  STATUS_FILE: artifacts/data_validation/status.txt
  • root_dir: Where validation results go.

  • unzip_data_dir: Where the CSV file is.

  • STATUS_FILE: Where we write if validation passed or failed.

2. The Config Entity: src/Predict_Pipe/entity/config_entity.py

We use a Python dataclass to keep things neat:

from dataclasses import dataclass
from pathlib import Path

@dataclass
class DataValidationConfig:
    root_dir: Path
    STATUS_FILE: str
    unzip_data_dir: Path
    all_schema: dict
  • This class holds all the config info we need for validation.

3. Configuration Manager: src/Predict_Pipe/config/configuration.py

This function grabs the config from YAML and sets up the validation config:

def get_data_validation_config(self) -> DataValidationConfig:
    config = self.config.data_validation
    create_directories([config.root_dir])  # Make sure the folder exists!

    data_validation_config = DataValidationConfig(
        root_dir=config.root_dir,
        STATUS_FILE=config.STATUS_FILE,
        unzip_data_dir=config.unzip_data_dir,
        all_schema=self.schema
    )

    return data_validation_config
  • create_directories: Makes sure the output folder exists (no "folder not found" errors!).

  • all_schema: Holds what columns and data types we expect.

4. Data Validation Logic: src/Predict_Pipe/components/data_validation.py

Here’s the real magic! We check if all columns are there and if their types match what we expect.

import pandas as pd

class DataValidation:
    def __init__(self, config: DataValidationConfig):
        self.config = config 

    def validate_all_columns(self) -> bool:
        try:
            validation_status = True
            data = pd.read_csv(self.config.unzip_data_dir)
            all_cols = list(data.columns)
            all_schema = self.config.all_schema.get('COLUMNS', {})

            missing_cols = []
            for col in all_cols:
                if col not in all_schema:
                    validation_status = False
                    missing_cols.append(col)
                else:
                    expected_type = all_schema[col]
                    actual_type = data[col].dtype
                    if expected_type != actual_type:
                        validation_status = False
                        missing_cols.append(f"{col} (expected: {expected_type}, found: {actual_type})")

            with open(self.config.STATUS_FILE, 'w') as f:
                f.write(f"Validation status: {validation_status}")

            if not validation_status:
                logger.error(f"Validation failed. Columns not in schema or type mismatch: {missing_cols}")

            return validation_status

        except Exception as e:
            raise e
  • Reads the CSV and checks columns/types.

  • Writes status to a file.

  • Logs errors if something’s wrong.

5. Running the Pipeline: src/Predict_Pipe/pipeline/data_validation.py

This is the script that runs the whole thing:

class DataValidationTrainingPipeline:
    def __init__(self):
        pass

    def initiate_data_validation(self):
        config = ConfigurationManager()
        data_validation_config = config.get_data_validation_config()
        data_validation = DataValidation(config=data_validation_config)
        data_validation.validate_all_columns()

if __name__ == '__main__':
    try:
        logger.info("stage Data Validation stage started")
        obj = DataValidationTrainingPipeline()  
        obj.initiate_data_validation()
        logger.info("stage Data Validation stage completed")
    except Exception as e:
        logger.exception(e)
        raise e
  • Logs when validation starts and ends.

  • Runs the validation using the config.

6. Main Entry Point: main.py

This is how you’d kick off the pipeline from your main script:

STAGE_NAME = "Data Validation stage"
try:
    logger.info(f">>>>>> stage {STAGE_NAME} started <<<<<<")
    obj = DataValidationTrainingPipeline()
    obj.initiate_data_validation()
    logger.info(f">>>>>> stage {STAGE_NAME} completed <<<<<<\n\nx==========x")
except Exception as e:
    logger.exception(e)
    raise e

7. Artifacts and Logs

  • Artifacts: Outputs (like status files) are saved in the artifacts/ directory.

  • Logs: All actions, errors, and statuses are logged so you know what happened (super helpful for debugging!).

NOTES:

  1. Config tells the pipeline where stuff is.

  2. Python classes read the config and run validation.

  3. Validation checks columns and types in your CSV.

  4. Status is saved to a file and logs are updated.

  5. Easy to run—just execute your main script!


In next, we’ll see data transformation for

0
Subscribe to my newsletter

Read articles from Yash Maini directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Yash Maini
Yash Maini