06 Data Validation

Hey everyone! đź‘‹
Ever wondered how data scientists make sure their data is actually good before training a model? Let’s break down a simple but cool data validation pipeline using Python, YAML config, and some logging magic. Ready? Let’s go!
1. Configuration: config/config.yaml
First, we tell our pipeline where to find stuff and where to save results. This is our config.yaml
:
data_validation:
root_dir: artifacts/data_validation
unzip_data_dir: artifacts/data_ingestion/winequality-red.csv
STATUS_FILE: artifacts/data_validation/status.txt
root_dir: Where validation results go.
unzip_data_dir: Where the CSV file is.
STATUS_FILE: Where we write if validation passed or failed.
2. The Config Entity: src/Predict_Pipe/entity/config_entity.py
We use a Python dataclass to keep things neat:
from dataclasses import dataclass
from pathlib import Path
@dataclass
class DataValidationConfig:
root_dir: Path
STATUS_FILE: str
unzip_data_dir: Path
all_schema: dict
- This class holds all the config info we need for validation.
3. Configuration Manager: src/Predict_Pipe/config/configuration.py
This function grabs the config from YAML and sets up the validation config:
def get_data_validation_config(self) -> DataValidationConfig:
config = self.config.data_validation
create_directories([config.root_dir]) # Make sure the folder exists!
data_validation_config = DataValidationConfig(
root_dir=config.root_dir,
STATUS_FILE=config.STATUS_FILE,
unzip_data_dir=config.unzip_data_dir,
all_schema=self.schema
)
return data_validation_config
create_directories: Makes sure the output folder exists (no "folder not found" errors!).
all_schema: Holds what columns and data types we expect.
4. Data Validation Logic: src/Predict_Pipe/components/data_validation.py
Here’s the real magic! We check if all columns are there and if their types match what we expect.
import pandas as pd
class DataValidation:
def __init__(self, config: DataValidationConfig):
self.config = config
def validate_all_columns(self) -> bool:
try:
validation_status = True
data = pd.read_csv(self.config.unzip_data_dir)
all_cols = list(data.columns)
all_schema = self.config.all_schema.get('COLUMNS', {})
missing_cols = []
for col in all_cols:
if col not in all_schema:
validation_status = False
missing_cols.append(col)
else:
expected_type = all_schema[col]
actual_type = data[col].dtype
if expected_type != actual_type:
validation_status = False
missing_cols.append(f"{col} (expected: {expected_type}, found: {actual_type})")
with open(self.config.STATUS_FILE, 'w') as f:
f.write(f"Validation status: {validation_status}")
if not validation_status:
logger.error(f"Validation failed. Columns not in schema or type mismatch: {missing_cols}")
return validation_status
except Exception as e:
raise e
Reads the CSV and checks columns/types.
Writes status to a file.
Logs errors if something’s wrong.
5. Running the Pipeline: src/Predict_Pipe/pipeline/data_validation.py
This is the script that runs the whole thing:
class DataValidationTrainingPipeline:
def __init__(self):
pass
def initiate_data_validation(self):
config = ConfigurationManager()
data_validation_config = config.get_data_validation_config()
data_validation = DataValidation(config=data_validation_config)
data_validation.validate_all_columns()
if __name__ == '__main__':
try:
logger.info("stage Data Validation stage started")
obj = DataValidationTrainingPipeline()
obj.initiate_data_validation()
logger.info("stage Data Validation stage completed")
except Exception as e:
logger.exception(e)
raise e
Logs when validation starts and ends.
Runs the validation using the config.
6. Main Entry Point: main.py
This is how you’d kick off the pipeline from your main script:
STAGE_NAME = "Data Validation stage"
try:
logger.info(f">>>>>> stage {STAGE_NAME} started <<<<<<")
obj = DataValidationTrainingPipeline()
obj.initiate_data_validation()
logger.info(f">>>>>> stage {STAGE_NAME} completed <<<<<<\n\nx==========x")
except Exception as e:
logger.exception(e)
raise e
7. Artifacts and Logs
Artifacts: Outputs (like status files) are saved in the
artifacts/
directory.Logs: All actions, errors, and statuses are logged so you know what happened (super helpful for debugging!).
NOTES:
Config tells the pipeline where stuff is.
Python classes read the config and run validation.
Validation checks columns and types in your CSV.
Status is saved to a file and logs are updated.
Easy to run—just execute your main script!
In next, we’ll see data transformation for
Subscribe to my newsletter
Read articles from Yash Maini directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
