Automate PDF Data Extraction with Python for Analytics

🧩 The Problem

In many organizations particularly those managing recurring telecom or utility services it's common to receive dozens, or even hundreds, of invoices in PDF format. These documents often contain critical information, yet they are not structured in a way that supports efficient analysis. Manually reviewing and consolidating this data is not only time-consuming but also highly prone to error.

Such was the case for a company that needed to maintain a detailed monthly log of mobile data consumption by user. However, the platform they relied on could only generate a single consolidated report in PDF format. As a result, the team was forced to manually extract and input the relevant data into Excel spreadsheets a process that typically took between two to four weeks to complete for each billing cycle..

🎯 The Objective

The goal was to design an automated solution capable of:

Reading and extracting text from multiple local PDF files.
Parsing and normalizing key information (phone number, data usage in GB, billing month).
Consolidating the results into a structured table suitable for analysis.
Optionally exporting the processed data into a JSON file.

🚀 Impact

This automation reduces processing time from over an hour of manual work per batch of files to just a few seconds. It also enables more reliable audits and the possibility of integrating real-time dashboards.

🛠️ The Solution

1. Retrieve PDF File Paths 📁

This function scans a specific directory and returns the full paths of all PDF files found within it.

Is used as the first step in the pipeline, assuming that all PDF files are consistently stored in a designated folder for batch processing.

Args:

directory (str): The path to the folder containing PDF files.

Returns:

List: A list of full file paths to all PDF documents in the specified directory.

def get_pdf_file_paths (directory):
    file_paths = []

    # Iterate through all files in the specified directory
    for file in os.listdir(directory):
        full_path = os.path.join(directory, file)

        # Check if the item is a file and ends with .pdf (case-insensitive)
        if os.path.isfile(full_path) and file.lower().endswith('.pdf'):
            file_paths.append(full_path)  # Add the valid PDF path to the list

    return file_paths

2. Extract Text from PDF Files 📄

This function reads a list of local PDF files and extracts their full text content. It also attempts to extract the month from the filename based on an assumed naming convention. It is important to clarify that the platform follows a standard in file naming, which facilitates the extraction of the month directly from the PDF document's name.

Args:

pdf_path_list (list): A list of full file paths to PDF documents.

Returns:

dict: A dictionary where each key is the PDF filename, and the value is another dictionary containing the extracted text ("content") and the inferred month ("month").

def read_local_pdfs(pdf_path_list):
    content_dict = {}

    for pdf_path in pdf_path_list:
        try:
            # Get just the file name (e.g., "invoice_january_2023.pdf")
            file_name = os.path.basename(pdf_path)
            text = ""

            # Open and read the PDF content using PyMuPDF (fitz)
            with fitz.open(pdf_path) as doc:
                for page in doc:
                    text += page.get_text()

            # Split the file name using "_" or "-" to extract metadata like month
            name_parts = re.split(r'[_\-]', file_name)

            # Try to extract the month (assumed to be the third element)
            if len(name_parts) >= 3:
                month = name_parts[2]
                file_type = name_parts[-1].replace('.pdf', '')  # Optional: identify file type
            else:
                month = ""

            # Store the result in a dictionary with filename as key
            content_dict[file_name] = {
                "content": text.strip(),
                "month": month,
            }

        except Exception as e:
            # Catch and print any errors that occur while reading the file
            print(f"Error reading {pdf_path}: {e}")

    return content_dict

3. Extract Usage Data (No "GB" label present)📄

This function Extracts mobile data usage by phone number from unstructured PDF text content. Designed for invoices where the usage value appears near the phrase 'Total usage' without an explicit 'GB' unit.

Args:

content_dict (dict): A dictionary where keys are PDF filenames and values are dictionaries containing 'content' (text) and 'month'.

Returns:

list: A list of dictionaries, each containing 'Month', 'Phone Number', and 'Usage' in float format.

def extract_usage_by_number(content_dict):

    table = []

    # Regex pattern to capture phone number and total usage
    # Format: 000.000.0000 followed (within ~100 characters) by 'Total usage' and a number
    pattern = re.compile(
        r'(\d{3}\.\d{3}\.\d{4})[\s\S]{0,100}?Total usage\s+([0-9.,]+)',
        re.IGNORECASE
    )

    # Loop through each PDF entry in the dictionary
    for file_name, data in content_dict.items():
        text = data["content"]
        month = data.get("month", "")

        # Apply regex to find matches in the PDF text
        matches = pattern.findall(text)

        for phone_number, usage in matches:
            # Remove commas (in case of thousands separators)
            usage = usage.replace(',', '')

            try:
                # Convert usage to float and store result
                usage_float = float(usage)
                table.append({
                    "Month": month,
                    "Phone Number": phone_number,
                    "Usage": usage_float
                })
            except ValueError:
                # If conversion fails (e.g., due to a malformed number), skip the entry
                continue

    return table

4. Extract Usage Data (With "GB" explicitly mentioned) 📄

Extracts mobile data usage for formats where the value includes the 'GB' unit, e.g., "xxx.xxx.xxxx ... xxx.xxGB".

Args:

diccionario_contenido (dict): Dictionary with PDF contents and month metadata.

Returns:

list: A structured list of records with phone numbers and usage in GB.

def extract_usage_with_gb(content_dict):

    usage_table = []

    pattern = re.compile(
        r'(\d{3}\.\d{3}\.\d{4})[\s\S]{0,100}?([0-9.,]+)GB',
        re.IGNORECASE
    )

    for file_name, data in content_dict.items():
        text = data["content"]
        month = data.get("month", "")

        matches = pattern.findall(text)

        for phone, usage in matches:
            usage = usage.replace(',', '')
            try:
                usage_float = float(usage)
                usage_table.append({
                    "Month": month,
                    "Phone": phone,
                    "Usage": usage_float
                })
            except ValueError:
                continue

    return usage_table

5. Optional: Export to JSON 💾

Saves a given dictionary to a JSON file with UTF-8 encoding.

Args:

data (dict): The data to save. output_path (str): File path where the JSON file will be written.

Returns:

None

def save_to_json(data, output_path):
    try:
        with open(output_path, 'w', encoding='utf-8') as json_file:
            json.dump(data, json_file, ensure_ascii=False, indent=4)
        print(f"JSON saved to: {output_path}")
    except Exception as e:
        print(f"Error saving JSON: {e}")

6. Main 💻

Full pipeline to process invoice PDFs, extract usage data from both formats (with and without 'GB'), and export the results to JSON

if __name__ == "__main__":
    # Set the path where all invoice PDFs are stored
    input_directory = r'Directory'

    # Step 1: Get all PDF file paths
    pdf_paths = get_pdf_file_paths(input_directory)

    # Step 2: Read and parse all PDFs
    parsed_pdfs = read_pdf_contents(pdf_paths)

    # Step 3: Extract usage from both formats
    usage_no_gb = extract_usage_no_gb(parsed_pdfs)       # Pattern without "GB"
    usage_with_gb = extract_usage_with_gb(parsed_pdfs)   # Pattern with "GB"

    # Step 4: Combine results into a single DataFrame
    df_v1 = pd.DataFrame(usage_no_gb)
    df_v2 = pd.DataFrame(usage_with_gb)
    final_df = pd.concat([df_v1, df_v2], ignore_index=True)

    # Step 5: Save final results
    save_to_json(final_df.to_dict(orient="records"), "usage_data.json") 

    # Step 6: Display preview
    print("✅ Combined usage data preview:")
    print(final_df.head())

6. Power BI Visualization – Monthly Mobile Data Usage📊

This simple bar chart displays the average monthly mobile data usage per phone number. While it may seem like a basic visualization, it serves as a foundational metric for understanding overall consumption trends.

When integrated with a CRM system in the near future, this dataset could be enriched with user location data, allowing organizations to:

Identify cities or regions with the highest data usage.
Make informed decisions on network coverage expansion.
Detect usage anomalies that could indicate fraud or system inefficiencies.
Optimize marketing and service offerings based on user behavior and location.

In other words, this initial dashboard is not just a static report—it’s the first step toward building a data-driven infrastructure that aligns operational strategy with user behavior in real time.

Git Proyect

Automating Data Extraction from PDF Invoices Using Python

🧩 The Problem

🎯 The Objective

🚀 Impact

🛠️ The Solution

1. Retrieve PDF File Paths 📁

2. Extract Text from PDF Files 📄

3. Extract Usage Data (No "GB" label present)📄

4. Extract Usage Data (With "GB" explicitly mentioned) 📄

5. Optional: Export to JSON 💾

6. Main 💻

6. Power BI Visualization – Monthly Mobile Data Usage📊

Subscribe to my newsletter

Sebastian Franco

Sebastian Franco