AInvoice

Project Type: Data Engineering + Dashboarding + NLP/OCR
Tech Stack: Python, OpenAI API, Tesseract OCR, SQLite, Power BI
Project Focus: Multicurrency invoice parsing, FX normalization, interactive dashboard


๐Ÿš€ Project Overview

AInvoice is a fully automated pipeline to extract structured data from scanned invoices or images (in multiple currencies), normalize the amounts via live FX conversion, and store the results in a SQLite database. The pipeline outputs a .csv file ready for visualization in Power BI, featuring a dashboard that tracks monthly expenses, high-value invoices, and spending per supplier.

This project combines OCR, NLP, FX APIs, and data modeling โ€” all orchestrated in Python, and visualized in a polished dashboard.


๐Ÿงพ Folder Structure & Components

AInvoice/
โ”‚
โ”œโ”€โ”€ main.py               # Main execution script
โ”œโ”€โ”€ functions.py          # OCR, API, and processing functions
โ”œโ”€โ”€ prompt.py             # Prompt template for OpenAI extraction
โ”œโ”€โ”€ queries.sql           # SQL file with useful queries
โ”œโ”€โ”€ environment.yml       # Conda environment config
โ”œโ”€โ”€ .env                  # Template for API keys (OpenAI + FX)
โ”‚
โ”œโ”€โ”€ invoices/             # Monthly folders with invoice files (add your invoices manually here)
โ”‚   โ”œโ”€โ”€ 01_January/
โ”‚   โ”œโ”€โ”€ ...
โ”‚   โ””โ”€โ”€ 12_December/
โ”‚
โ”œโ”€โ”€ PowerBI/                     # Power BI visual templates
โ”‚   โ”œโ”€โ”€ Dashboard_AInvoice.pbix               # Power BI clean template
โ”‚   โ”œโ”€โ”€ Dashboard_AInvoice (reference).pbix   # Layout reference
โ”‚   โ””โ”€โ”€ IMPORTANT - READ.txt                  # Dashboard instructions
โ”‚
โ””โ”€โ”€ invoices_export.csv          # Optional CSV export from SQLite queries for Power BI

๐Ÿ“Œ Important: Make sure the .env file is configured with your OpenAI and FX API keys.


โš™๏ธ Environment Setup

1. Install prerequisites:

  • Tesseract OCR

  • sqlite3 (already included in most Python installs)

2. Create the Conda environment:

conda env create -f environment.yml

3. Activate the environment and install Jupyter kernel (for interactive development):

conda activate invoice-extractor
pip install ipykernel
python -m ipykernel install --user --name=invoice-extractor

๐Ÿง  How the Pipeline Works

  1. OCR Extraction
    Each file in /invoices/ is parsed using Tesseract and OpenAI to identify fields like supplier, description, date, import, currency.

  2. Currency Conversion
    A live FX rate API (like Fixer.io) is used to convert all amounts into euros.

  3. Data Storage
    Results are saved into a local SQLite database (invoices.db) with automatic schema creation.

  4. CSV Export
    A final invoices_export.csv is created for use in Power BI or Excel.

๐Ÿ“Œ Execution of main.py in VS Code (Interactive Window):

Execution of main.py in VS Code (Interactive Window)


๐Ÿงช SQLite Queries

Use the queries.sql file to interact with your database. To launch SQLite in Anaconda Powershell or CMD (from the project root where invoices.db is located):

sqlite3 invoices.db
.read queries.sql

Example queries include:

  • Selecting total spending

  • Filtering invoices by date or supplier

  • Exporting to .csv (optional)


๐Ÿ“Š Power BI Integration

1. Load invoices_export.csv

Go to Home > Get Data > Text/CSV and select the file. Configure encoding as 65001: Unicode (UTF-8).

2. Handle Import Formatting

To avoid issues with decimal/comma separators and scientific notation:

  1. After selecting invoices_export.csv, click Transform Data instead of Load.

  2. In Power Query Editor, change the import column type to Text first.

    • When prompted, click Replace current.
  3. Then open the type dropdown again and choose Using Locale....

    • Set the type to Fixed decimal number

    • Set Locale to English (United States)

This double conversion (Text โ†’ Fixed decimal with Locale) ensures proper parsing of both scientific notation and decimal separators across regional settings.

3. Close & Apply

Click Close & Apply to load the cleaned data into Power BI.

4. Format Column

In the Column Tools tab:

  • Data type: Fixed decimal number

  • Format: Currency

  • Decimal places: 2

๐Ÿ“Œ Formatted table in Power BI:

Formatted table in Power BI


๐Ÿ“ˆ Dashboard Output

The dashboard shows:

  • ๐Ÿ”ข Number of invoices

  • ๐Ÿ’ฐ Total spending

  • ๐Ÿ“Š Monthly trend of expenses

  • ๐Ÿข Breakdown by supplier

  • ๐Ÿงพ Highest value invoices

All visuals are updated dynamically based on the contents of invoices_export.csv.

๐Ÿ“Œ Dashboard Preview:

Dashboard Preview


โœ… Final Notes

  • Ensure monthly folders are not empty, or they'll be skipped automatically.

  • Unsupported or missing currency labels will be logged and excluded.

  • Add or remove currencies from currency_map in main.py.

  • The project supports adding more FX providers if needed.

  • You can modify the OpenAI model in functions.py (e.g. gpt-4, gpt-3.5-turbo).

  • Add or change fields by editing the prompt template in prompt.py.


๐Ÿ” Security Tip

Never upload your real .env file to GitHub. Always include only the .env.example template in version control.


๐ŸŒ Explore the Project


๐Ÿ’ฌ Feedback & Contact

If you found this useful or have questions, feel free to reach out or fork the repo on GitHub!


Built with ๐Ÿ’ป Python, ๐Ÿ“Š Power BI, and ๐Ÿค– OpenAI โ€” by Tayfur Akkaya

10
Subscribe to my newsletter

Read articles from Tayfur Akkaya Clavijo directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Tayfur Akkaya Clavijo
Tayfur Akkaya Clavijo

I'm an AI, data and finance enthusiast. Passionate about lifelong learning and cross-sector innovation, Iโ€™m always looking to collaborate on impactful projects at the intersection of data, strategy, and technology.