Utility for Data Reconciliation: Comparing Table Extracts and Files

VipinVipin
2 min read

In data reconciliation projects, it's crucial to ensure the integrity and consistency of data between various sources. To address this need, I have developed a Python utility that compares extracts from tables and files. This utility simplifies the process of identifying discrepancies and ensures data accuracy.

Key Features:

  • Compatibility with CSV and Excel: The utility can handle both CSV and Excel file formats, making it versatile for different data sources.

  • Visual Differentiation: It highlights matching cells in green and differing cells in red, providing clear visual feedback.

  • Error Handling: The script includes error handling to manage scenarios where input files may be missing.

Python Utility Code:

import pandas as pd
from openpyxl import Workbook, load_workbook
from openpyxl.styles import PatternFill

def read_file(file):
    if file.endswith('.csv'):
        return pd.read_csv(file, header=None)
    elif file.endswith('.xlsx'):
        return pd.read_excel(file, header=None)
    else:
        raise ValueError("Unsupported file format. Please provide a .csv or .xlsx file.")

def compare_files(file1, file2, file3):
    try:
        # Load the data from the files
        df1 = read_file(file1)
        df2 = read_file(file2)
    except FileNotFoundError as e:
        print(f"Error: {e}")
        print(f"Please check if the files {file1} and {file2} exist.")
        return

    # Create a new workbook for the result
    wb = Workbook()
    ws = wb.active

    green_fill = PatternFill(start_color="C6EFCE", end_color="C6EFCE", fill_type="solid")
    red_fill = PatternFill(start_color="FFC7CE", end_color="FFC7CE", fill_type="solid")

    # Compare each cell
    for i in range(df1.shape[0]):
        for j in range(df1.shape[1]):
            cell_value1 = df1.iloc[i, j] if i < len(df1) and j < len(df1.columns) else None
            cell_value2 = df2.iloc[i, j] if i < len(df2) and j < len(df2.columns) else None

            if cell_value1 == cell_value2:
                ws.cell(row=i+1, column=j+1).value = f"{file1}: {cell_value1}\n{file2}: {cell_value2}"
                ws.cell(row=i+1, column=j+1).fill = green_fill
            else:
                ws.cell(row=i+1, column=j+1).value = f"{file1}: {cell_value1}\n{file2}: {cell_value2}"
                ws.cell(row=i+1, column=j+1).fill = red_fill

    # Save the result to File3
    wb.save(file3)

# Example usage
file1 = 'File1.csv'  # or 'File1.xlsx'
file2 = 'File2.xlsx'  # or 'File2.xlsx'
file3 = 'comparison_results.xlsx'  # result file (should be Excel to handle cell colors)

compare_files(file1, file2, file3)

Result File(comparison_results.xlsx):

Github:

https://github.com/vipinputhanveetil/data-reconcil-util

How to Use:

  1. Install Required Libraries:

    bash

     pip install pandas openpyxl
    
  2. Prepare Your Files:

    • Ensure your input files (File1 and File2) are in either .csv or .xlsx format and place them in the same directory as your script.
  3. Run the Script:

    • Update the file1, file2, and file3 variables with your actual file names.

    • Execute the script to generate the comparison result in File3.

This utility is designed to make data reconciliation more efficient and error-free, saving you valuable time and effort. If you have any questions or need further customization, feel free to reach out!

0
Subscribe to my newsletter

Read articles from Vipin directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Vipin
Vipin

Highly skilled Data Test Automation professional with over 10 years of experience in data quality assurance and software testing. Proven ability to design, execute, and automate testing across the entire SDLC (Software Development Life Cycle) utilizing Agile and Waterfall methodologies. Expertise in End-to-End DWBI project testing and experience working in GCP, AWS, and Azure cloud environments. Proficient in SQL and Python scripting for data test automation.