Utility for Data Reconciliation: Comparing Table Extracts and Files

VipinVipin
3 min read

In data reconciliation projects, it's crucial to ensure the integrity and consistency of data between various sources. To address this need, I have developed a Python utility that compares extracts from tables and files. This utility simplifies the process of identifying discrepancies and ensures data accuracy.

Key Features:

  • Compatibility with CSV and Excel: The utility can handle both CSV and Excel file formats, making it versatile for different data sources.

  • Visual Differentiation: It highlights matching cells in green and differing cells in red, providing clear visual feedback.

  • Error Handling: The script includes error handling to manage scenarios where input files may be missing.

Python Utility Code:

import openpyxl
from openpyxl import load_workbook
from openpyxl.styles import PatternFill
import csv
import os
import time
def load_file(file_path):
    try:
        if file_path.endswith('.csv'):
            with open(file_path, newline='') as csvfile:
                reader = list(csv.reader(csvfile))
            return reader
        elif file_path.endswith('.xlsx'):
            wb = load_workbook(file_path)
            sheet = wb.active
            return [[cell.value for cell in row] for row in sheet.iter_rows()]
        else:
            raise ValueError("Unsupported file format")
    except FileNotFoundError as fnf_error: 
        print(f"Error: {fnf_error}") 
    except Exception as e: 
        print(f"An unexpected error occurred: {e}")
def compare_and_highlight(input_src_file, input_tgt_file, xlsx_file3_path,rextract_ref):
    src_file = input_src_file.split('\\')
    tgt_file = input_tgt_file.split('\\')
    print(f"{rextract_ref} comparison started")
    log_file.write(f"\nComparing {rextract_ref}: Source Extract:{src_file[2]}, Target Extract:{tgt_file[2]}\n")
    #time.sleep(1)
    # Load the files
    flag=0
    data1 = load_file(input_src_file)
    data2 = load_file(input_tgt_file)
    # Create a new Excel workbook
    wb3 = openpyxl.Workbook()
    sheet3 = wb3.active
    # Define the fill colors
    green_fill = PatternFill(start_color="C6EFCE", end_color="C6EFCE", fill_type="solid")
    red_fill = PatternFill(start_color="FFC7CE", end_color="FFC7CE", fill_type="solid")
    # Get the max row and column count
    max_rows = max(len(data1), len(data2))
    max_cols = max(len(data1[0]), len(data2[0]))
    # Compare the cells and write to the new Excel file
    for row in range(max_rows):
        for col in range(max_cols):
            cell1 = data1[row][col] if row < len(data1) and col < len(data1[row]) else None
            cell2 = data2[row][col] if row < len(data2) and col < len(data2[row]) else None

            if cell1 == cell2:
                sheet3.cell(row=row+1, column=col+1).value = f"{input_src_file}: {cell1}\n {input_tgt_file}: {cell2}"
                sheet3.cell(row=row+1, column=col+1).fill = green_fill
            else:
                sheet3.cell(row=row+1, column=col+1).value = f"{input_src_file}: {cell1}\n {input_tgt_file}: {cell2}"
                sheet3.cell(row=row+1, column=col+1).fill = red_fill
                flag=1

    # Save the results to the Excel file
    wb3.save(xlsx_file3_path)
    if flag==0:
        #print("No difference observed.")
        log_file.write(f"Source and Target extract matching. SUCCESS.\n")
        log_file.write(f"--------")
    else:
        log_file.write(f"####################################Mismatch between source and target extract. FAIL####################################\n")
        log_file.write(f"--------")
    print(f"{rextract_ref} comparison completed")
with open("utility_execution_logs.txt", "w") as log_file:
    with open('config.txt', 'r') as file:
        # Read each line in the file
        for line in file:
            # Remove any leading/trailing whitespace, including newline characters
            line = line.strip()
            words = line.split(',')
            rextract_ref=words[0]
            input_src_file =words[1]
            input_tgt_file=words[2]
            output_result_file=words[3]
            compare_and_highlight(input_src_file, input_tgt_file, output_result_file,rextract_ref)

Result File(comparison_results.xlsx):

Github:

https://github.com/vipinputhanveetil/data-reconcil-util

How to Use:

  1. Install Required Libraries:

    bash

     pip install pandas openpyxl
    
  2. Prepare Your Files:

    • Ensure your input files (File1 and File2) are in either .csv or .xlsx format and place them in the same directory as your script.
  3. Run the Script:

    • Update the file1, file2, and file3 variables with your actual file names.

    • Execute the script to generate the comparison result in File3.

This utility is designed to make data reconciliation more efficient and error-free, saving you valuable time and effort. If you have any questions or need further customization, feel free to reach out!

0
Subscribe to my newsletter

Read articles from Vipin directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Vipin
Vipin

Highly skilled Data Test Automation professional with over 10 years of experience in data quality assurance and software testing. Proven ability to design, execute, and automate testing across the entire SDLC (Software Development Life Cycle) utilizing Agile and Waterfall methodologies. Expertise in End-to-End DWBI project testing and experience working in GCP, AWS, and Azure cloud environments. Proficient in SQL and Python scripting for data test automation.