I have written several blogs about using LLMs, especially in Fabric, to extract structured data from unstructured data. In my last blog on this topic, I explained how to extract structured data from PDF invoices. PDFs are a type of unstructured data source, but the data doesn't always have to be in text or PDF format. In this blog, I will demonstrate how we can use the same techniques to extract tabular data from an Excel file where the tables lack a common structure and are poorly formatted. Who hasn't dealt with such Excel data?

I presented this use case at FabCon Vegas.

The Badly Formatted Excel

I came across this post on the Power BI forum. The user wanted to import the Excel sheet below into Power BI and create a model for reporting. Sounds simple, right? If you know Power Query and some M, you can easily turn this into a table for all Excel files on a schedule—as long as all the files have the same structure, the same column names, and the same columns. If there is any difference, you'll either get an error or incorrect results. Plus, imagine if below purchase orders were in different languages from different suppliers!

This is exactly what we will try to fix using Fabric AI Functions.

Another user shared the M code to clean the data but also noted that this will work if the form doesn’t change.

Synthetic Data

I don't have the data from this user, so I created an Excel file with four variations of the purchase order data mentioned above. I randomly placed data in different cells, added extra columns and rows, changed the layout, and used different column names to create messy layout. Although the data is in a table format, the formatting is unstructured and almost impossible to process using Power Query or Python.

Sheet 1:

Sheet 2:

Compared to sheet 1, in Sheet 2, following changes were made:

Split the PO header details in two columns instead of in one
Excluded JOB ID from PO header and instead added it as a table header in a merged cell (typically merged cells are hard to deal with)
PO date is in a slightly different format (mm-dd-yyyy instead of mm dd yyyy)

Sheet 3

Randomly changed the header layout. Notice that the positions are different, and instead of having the column:value pairs in two columns next to each other, they are now one below the other.
Added an extract NUMBER column
Added a merged cell table header for PO details
Changed column names
PO date is in different format
Missing phone number

Sheet 4

Added some text at the top of the table irrelevant to the data
Changed the column order and column header
Changed date format

As you can see - all the PO forms are different and although they have similar data, the layouts are totally different. There is no way Power Query or any Python code with regex pattern matching can extract data in a structured form from this Excel.

You can download the Excel from here : snippets/complete_purchase_order.xlsx at main · pawarbi/snippets

Goals

As mentioned above, the main goal is to extract the data, but we also we want to :

evaluate the accuracy of extraction to ensure AI Function can be used reliably.
perform the extraction efficiently with minimal CU consumption
make the process reproducible to ensure we can trace the results and performance in the future
identify risks, errors and solution space

Fabric AI Function

Fabric AI functions use LLM (Azure Open AI deployment) for text summarization, extraction, classification etc with convenient Python functions. It can be used with any F SKU in a Fabric PySpark notebook. We will start with an easy/baseline solution first and progressively work towards the goals mentioned above to achieve the final production-ready solution.

To get started, install AI functions in a Fabric PySpark notebook with runtime 1.3

## Fabric PySpark notebook with runtimr 1.3
%pip install -q tiktoken deepdiff tabulate openai==1.30 > /dev/null 2>&1
%pip install -q --force-reinstall httpx==0.27.0 > /dev/null 2>&1
%pip install -q --force-reinstall https://mmlspark.blob.core.windows.net/pip/1.0.9/synapseml_core-1.0.9-py2.py3-none-any.whl > /dev/null 2>&1
%pip install -q --force-reinstall https://mmlspark.blob.core.windows.net/pip/1.0.10.0-spark3.4-5-a5d50c90-SNAPSHOT/synapseml_internal-1.0.10.0.dev1-py2.py3-none-any.whl > /dev/null 2>&1

#optional - I will explain later
!wget -O /synfs/nb_resource/builtin/json_diff_extraction_accuracy.py https://raw.githubusercontent.com/pawarbi/snippets/refs/heads/main/json_diff_extraction_accuracy.py

Since we want to extract the data, we can use the .extract method in AI functions to specify the data we want to extract. Note that AI functions work on text data so we need to convert the excel to some text format. In my previous blogs, I have shared how LLM love markdown format. So we will convert the dataframe to markdown before passing it to AI functions.

import os
import re
import json
import time
from datetime import datetime
import tiktoken 
import pandas as pd
from tqdm import tqdm
import openai
from notebookutils import fs
import numpy as np
from synapse.ml.aifunc import Conf
from deepdiff import DeepDiff
from pydantic import BaseModel, ValidationError, field_validator
import builtin.json_diff_extraction_accuracy as diff

path = "/lakehouse/default/Files/complete_purchase_order.xlsx"
df = pd.read_excel(path)

# convert data to markdown
md = pd.DataFrame({"data":[df.to_markdown()]})

# define fields to extract
df = md["data"].ai.extract(
    'job_id',
    'customer_id',
    'po_date',
    'name',
    'phone_number',
    'delivery_type',
    'delivery_date',
    'delivery_time',
    'delivery_address',
    'eir_code',
    'type',
    'material_code',
    'material_description',
    'quantity',
    'uom'
)
display(df)

And just like that without any prompts or efforts, AI function extracted the fields correctly from the first sheet.

We can loop over the other sheets and be done with it. But not so fast. Note that we only got one row back corresponding to the PO header and didn’t get all the rows in the details table. That’s because in the .extract() we can only extract text and not complex data types like dictionaries and lists. To achieve that we will need to write a prompt and use .generate_response() function. To do so, we will:

write a prompt which includes the schema of the expected response we want
one of our goals is to achieve the solution by minimizing the CU consumption. AI functions are billed by the spark meter but measured in input and output tokens. We will count the tokens to track this.
process the response to create two dataframes - one for order detail and another for materials table.

Prompt

In the below prompt, I give instructions and define the JSON schema the LLM should use to provide the response. Notice the data types defined as well as enum (options to choose from, any other value would be invalid). This is a fairly straightforward prompt.

prompt = """
  Extract purchase order details as a valid JSON and material list from the originally from an Excel. Return ONLY the JSON without any explanation or details based on below schema.
  Ignore extraneous details.
  "response_schema": {
    "purchase_order": {
      "job_id": "string",
      "customer_id": "string",
      "po_date": "string (date format: DD/MM/YYYY)",
      "name": "string",
      "phone_number": "string",
      "delivery_type": "string" (enum: Delivery, Pickup, Shipping)
      "delivery_date": "string (date format: DD/MM/YYYY)",
      "delivery_time": "string",
      "delivery_address": "string",
      "eir_code": "string"
    },
    "materials": [
      {
        "type": "string",
        "material_code": "string",
        "material_description": "string",
        "quantity": "integer",
        "uom": "string" (unit of measure, null if not found)
      }
    ]
  }
"""

To calculate the number of tokens, we can use the tiktoken library. Currently, AI function uses gpt-3.5 model. If it changes to any other model, be sure update the model accordingly below (each model uses a different tokenizer).

model = "gpt-3.5-turbo"
tokenizer= tiktoken.encoding_for_model(model)
def get_token_count(obj):
    if not isinstance(obj, str):
        obj = str(obj)
    return len(tokenizer.encode(obj))

Extract the data

path = "/lakehouse/default/Files/complete_purchase_order.xlsx"


sheet_names = ['PO1', 'PO2', 'PO3', 'PO4', 'PO5']
df_dict = {
    sheet: pd.read_excel(path, sheet_name=sheet).dropna(how='all').dropna(how='all', axis=1)
    for sheet in sheet_names
}

df_extract = pd.DataFrame({
    "json_data": [json.loads(df_dict[f'PO{i}'].to_json()) for i in range(1, 6)],
    "string_data": [df_dict[f'PO{i}'].to_string() for i in range(1, 6)],
    "markdwn_data": [df_dict[f'PO{i}'].to_markdown() for i in range(1, 6)]
})
col = "string_data"
df_extract['data'] = df_extract[[col]].ai.generate_response(prompt, conf=Conf(seed=0, max_concurrency=50))
df_extract['input_tokens'] = df_extract[col].apply(get_token_count)
df_extract['output_tokens'] = df_extract["data"].apply(get_token_count)

# df_extract
#### Extract orders and materials

order_dfs = []
material_dfs = []

for num in range(4):
    po = f"PO{num+1}"
    current_data = json.loads(df_extract['data'][num])
    order_df = pd.json_normalize(current_data)
    order_df['source_po'] = po
    order_dfs.append(order_df.drop(columns=['materials'], errors='ignore'))

    materials_df = pd.json_normalize(current_data.get('materials', []))
    materials_df['source_po'] = po
    material_dfs.append(materials_df)

order_df = pd.concat(order_dfs, ignore_index=True).sort_values('purchase_order.customer_id').reset_index(drop=True)
materials_df = pd.concat(material_dfs, ignore_index=True).sort_values(['source_po','type','material_code']).reset_index(drop=True)

To get the number of tokens, use the .ai.stats method:

In this case, we used 3761 input tokens (prompt + input data) and 3773 output tokens (the JSON returned).

Next we still need to clean up the above dataframes a bit more to make it consumption ready by cleaning the column names, fixing the date format and data types etc.

def clean_order_df(df):
    """
    Clean the order dataframe with robust error handling
    """
    cleaned_df = df.copy()

    # Remove prefix from column names
    new_columns = {col: col.replace('purchase_order.', '') for col in cleaned_df.columns if col.startswith('purchase_order.')}
    cleaned_df = cleaned_df.rename(columns=new_columns)

    # Convert customer_id to integer with error handling
    if 'customer_id' in cleaned_df.columns:
        def safe_int_convert(x):
            if pd.isna(x) or x == '':
                return None
            try:
                return int(x)
            except (ValueError, TypeError):
                return None

        cleaned_df['customer_id'] = cleaned_df['customer_id'].apply(safe_int_convert)

    # Convert date columns with error handling
    date_columns = ['po_date', 'delivery_date']
    for date_col in date_columns:
        if date_col in cleaned_df.columns:
            def safe_date_convert(x):
                if pd.isna(x) or x == '':
                    return None
                try:
                    return pd.to_datetime(x, dayfirst=True).strftime('%d-%m-%Y')
                except (ValueError, TypeError):
                    return None

            cleaned_df[date_col] = cleaned_df[date_col].apply(safe_date_convert)

    # Convert text columns with error handling
    text_columns = ['name', 'delivery_type', 'delivery_time', 'delivery_address']
    for text_col in text_columns:
        if text_col in cleaned_df.columns:
            def safe_text_convert(x):
                if pd.isna(x) or x == '':
                    return None
                try:
                    return str(x).title()
                except (ValueError, TypeError):
                    return None

            cleaned_df[text_col] = cleaned_df[text_col].apply(safe_text_convert)

    # Sort and reset index
    try:
        cleaned_df = cleaned_df.sort_values('customer_id', na_position='last').reset_index(drop=True)
    except Exception as e:
        print(f"Warning: Sorting failed with error: {e}")
        cleaned_df = cleaned_df.reset_index(drop=True)

    return cleaned_df

final_order_df = clean_order_df(order_df)
display(final_order_df)

For materials data:

def clean_materials_df(df):
    df['quantity'] = df['quantity'].astype('float')
    df['uom'] = df['uom'].str.lower()
    return df.sort_values(['source_po','type','material_code']).reset_index(drop=True)

final_materials_df = clean_materials_df(materials_df)
display(final_materials_df)

Validation

We got the result we wanted but is it accurate? Let’s find out. To find the accuracy, we need to compare the final results with the actual data. There are many accuracy metrics but we will keep it simple. I will convert the dataframe to a JSON and then compare each field with the JSON of the original excel sheet I manually created for validation purposes.

Create a JSON of the above dfs:

purchase_order = {
    "job_id": final_order_df["job_id"].iloc[0] if "job_id" in final_order_df.columns else "",
    "customer_id": final_order_df["customer_id"].iloc[0] if "customer_id" in final_order_df.columns else "",
    "po_date": final_order_df["po_date"].iloc[0] if "po_date" in final_order_df.columns else "",
    "name": final_order_df["name"].iloc[0] if "name" in final_order_df.columns else "",
    "phone_number": final_order_df["phone_number"].iloc[0] if "phone_number" in final_order_df.columns else "",
    "delivery_type": final_order_df["delivery_type"].iloc[0] if "delivery_type" in final_order_df.columns else "",
    "delivery_date": final_order_df["delivery_date"].iloc[0] if "delivery_date" in final_order_df.columns else "",
    "delivery_time": final_order_df["delivery_time"].iloc[0] if "delivery_time" in final_order_df.columns else "",
    "delivery_address": final_order_df["delivery_address"].iloc[0] if "delivery_address" in final_order_df.columns else "",
    "eir_code": final_order_df["eir_code"].iloc[0] if "eir_code" in final_order_df.columns else ""
}


materials = final_materials_df.to_dict(orient='records')

order_json = {"purchase_order": purchase_order}
materials_json = {"materials": materials}

Fix the data types so they conform to JSON and load the ground truth as JSON.


def convert_numpy_types(obj):
    if isinstance(obj, dict):
        return {k: convert_numpy_types(v) for k, v in obj.items()}
    elif isinstance(obj, list):
        return [convert_numpy_types(i) for i in obj]
    elif isinstance(obj, np.integer):
        return int(obj)
    elif isinstance(obj, np.floating):
        return float(obj)
    elif isinstance(obj, np.ndarray):
        return obj.tolist()
    else:
        return obj


order_json = convert_numpy_types(order_json)
materials_json = convert_numpy_types(materials_json)

with open("/lakehouse/default/Files/ground_truth/order_ground_truth.json", 'r') as f:
    order_groundtruth = json.load(f)

#load materials
with open("/lakehouse/default/Files/ground_truth/materials_ground_truth.json", 'r') as f:
    material_groundtruth  = json.load(f)

To calculate accuracy by comparing each field I wrote Python function using deepdiff which you can get it from my repo.

result = diff.calculate_json_accuracy( material_groundtruth, materials_json)
accuracy_score = result['score']
total_fields = result['total_fields']
diff_stats = result['json_diff_stats']

print(f"Accuracy score: {result['score']}")
print(f"Total fields: {result['total_fields']}")
print(f"Diff stats: {result['json_diff_stats']}")

💡

You can save this diff function as a Fabric UDF to use it in various projects

In the materials dataframe, we compared total 324 fields and all of them are valid and match the actual data. Keep in mind that this is despite missing values, different column names, column orders etc. LLM was able to understand the context and overall structure of the text to extract the fields correctly.

Other Goals

We achieved one of the goals - to extract the data accurately. But what about minimizing the cost, i.e. CU consumption?

I omitted one step above intentionally to keep things simple. In the above code, I first converted the data as string before sending it to LLM via AI function. I could have converted the data to other formats as well like markdown I did in the first baseline or as JSON as well. Let’s see what happens if I do that.

#using string
display(df_extract)

Use markdown instead:

As you can see above, the number of input tokens are about ~10-20% more ! So using string instead of markdown is better in this case example without sacrificing accuracy.

AIOps

In data science, you usually run many experiments to find the right hyperparameters, configurations, and features, including feature engineering, before deciding on the best setup to meet your goals. LLM responses are stochastic, so they should be handled like any other data science project, applying the same principles for reproducibility and accuracy. Fabric has built-in MLflow integration, allowing us to run all the experiments and finalize our solution for production. This blog is already lengthy, so I won't cover all the details here, but at a high level : I set up MLflow experiment, instrument it with metrics I want to capture (inputs, tokens, reponses, accuracy etc.), vary the formatting used, log the runs and compare the results. In a real world project, you will run many experiments with different prompts, temperatures, seeds, model configurations etc but I am skipping it here.

Putting it all together:

import os
import re
import json
import time
import mlflow
import pandas as pd
import numpy as np
import tiktoken
import openai
from tqdm import tqdm
from datetime import datetime
from synapse.ml.aifunc import Conf
from deepdiff import DeepDiff
import builtin.json_diff_extraction_accuracy as diff
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

experiment_name = "purchase_order_extraction_format_test"
mlflow.set_experiment(experiment_name)

prompt = """
  Extract purchase order details as a valid JSON and material list from the originally from an Excel. Return ONLY the JSON without any explanation or details based on below schema.
  Ignore extraneous details.
  "response_schema": {
    "purchase_order": {
      "job_id": "string",
      "customer_id": "string",
      "po_date": "string (date format: DD/MM/YYYY)",
      "name": "string",
      "phone_number": "string",
      "delivery_type": "string" (enum: Delivery, Pickup, Shipping)
      "delivery_date": "string (date format: DD/MM/YYYY)",
      "delivery_time": "string",
      "delivery_address": "string",
      "eir_code": "string"
    },
    "materials": [
      {
        "type": "string",
        "material_code": "string",
        "material_description": "string",
        "quantity": "integer",
        "uom": "string" (unit of measure, null if not found)
      }
    ]
  }
"""


model = "gpt-3.5-turbo"
tokenizer = tiktoken.encoding_for_model(model)

def get_token_count(obj):
    if not isinstance(obj, str):
        obj = str(obj)
    return len(tokenizer.encode(obj))

## attach a lakehouse and upload the ground truth to teh below folder
def load_groundtruth():
    with open("/lakehouse/default/Files/ground_truth/order_ground_truth.json", 'r') as f:
        order_groundtruth = json.load(f)

    with open("/lakehouse/default/Files/ground_truth/materials_ground_truth.json", 'r') as f:
        material_groundtruth = json.load(f)

    return order_groundtruth, material_groundtruth

## this is for converting to int for deepdiff, otherwise not needed
def convert_numpy_types(obj):
    if isinstance(obj, dict):
        return {k: convert_numpy_types(v) for k, v in obj.items()}
    elif isinstance(obj, list):
        return [convert_numpy_types(i) for i in obj]
    elif isinstance(obj, np.integer):
        return int(obj)
    elif isinstance(obj, np.floating):
        return float(obj)
    elif isinstance(obj, np.ndarray):
        return obj.tolist()
    else:
        return obj


def clean_order_df(df):
    cleaned_df = df.copy()
    new_columns = {col: col.replace('purchase_order.', '') for col in cleaned_df.columns if col.startswith('purchase_order.')}
    cleaned_df = cleaned_df.rename(columns=new_columns)

    if 'customer_id' in cleaned_df.columns:
        def safe_int_convert(x):
            if pd.isna(x) or x == '':
                return None
            try:
                return int(x)
            except (ValueError, TypeError):
                return None

        cleaned_df['customer_id'] = cleaned_df['customer_id'].apply(safe_int_convert)


    date_columns = ['po_date', 'delivery_date']
    for date_col in date_columns:
        if date_col in cleaned_df.columns:
            def safe_date_convert(x):
                if pd.isna(x) or x == '':
                    return None
                try:
                    return pd.to_datetime(x, dayfirst=True).strftime('%d-%m-%Y')
                except (ValueError, TypeError):
                    return None

            cleaned_df[date_col] = cleaned_df[date_col].apply(safe_date_convert)


    text_columns = ['name', 'delivery_type', 'delivery_time', 'delivery_address']
    for text_col in text_columns:
        if text_col in cleaned_df.columns:
            def safe_text_convert(x):
                if pd.isna(x) or x == '':
                    return None
                try:
                    return str(x).title()
                except (ValueError, TypeError):
                    return None

            cleaned_df[text_col] = cleaned_df[text_col].apply(safe_text_convert)

    try:
        cleaned_df = cleaned_df.sort_values('customer_id', na_position='last').reset_index(drop=True)
    except Exception as e:
        print(f"Warning: Sorting failed with error: {e}")
        cleaned_df = cleaned_df.reset_index(drop=True)

    return cleaned_df

def clean_materials_df(df):
    df['quantity'] = df['quantity'].astype('float')
    df['uom'] = df['uom'].str.lower()
    return df.sort_values(['source_po','type','material_code']).reset_index(drop=True)

def process_format_test(format_col, df_extract, mlflow_run_name, temp=0.0):
    order_groundtruth, material_groundtruth = load_groundtruth()

    with mlflow.start_run(run_name=mlflow_run_name):
        mlflow.log_param("model", model)
        mlflow.log_param("format", format_col)
        mlflow.log_param("prompt", prompt)
        mlflow.log_param("timestamp", datetime.now().strftime("%Y-%m-%d %H:%M:%S"))

        start_time = time.time()
        df_extract['data'] = df_extract[[format_col]].ai.generate_response(prompt, conf=Conf(seed=0, max_concurrency=50, temperature=temp))
        execution_time = time.time() - start_time

        stats = df_extract.ai.stats
        mlflow.log_metric("num_successful", stats['num_successful'])
        mlflow.log_metric("num_exceptions", stats['num_exceptions'])
        mlflow.log_metric("num_unevaluated", stats['num_unevaluated'])
        mlflow.log_metric("prompt_tokens", stats['prompt_tokens'])
        mlflow.log_metric("completion_tokens", stats['completion_tokens'])
        mlflow.log_metric("execution_time", execution_time)

        df_extract['input_tokens'] = df_extract[format_col].apply(get_token_count)
        df_extract['output_tokens'] = df_extract["data"].apply(get_token_count)

        mlflow.log_metric("avg_input_tokens", df_extract['input_tokens'].mean())
        mlflow.log_metric("avg_output_tokens", df_extract['output_tokens'].mean())

        order_dfs = []
        material_dfs = []

        for num in range(min(4, len(df_extract))):
            po = f"PO{num+1}"
            try:
                current_data = json.loads(df_extract['data'][num])

                # order data
                order_df = pd.json_normalize(current_data)
                order_df['source_po'] = po
                order_dfs.append(order_df.drop(columns=['materials'], errors='ignore'))

                # materials data
                materials_df = pd.json_normalize(current_data.get('materials', []))
                materials_df['source_po'] = po
                material_dfs.append(materials_df)
            except Exception as e:
                mlflow.log_param(f"error_po{num+1}", str(e))
                continue

        if not order_dfs or not material_dfs:
            mlflow.log_metric("accuracy_score", 0.0)
            return

        order_df = pd.concat(order_dfs, ignore_index=True)
        materials_df = pd.concat(material_dfs, ignore_index=True)

        final_order_df = clean_order_df(order_df)
        final_materials_df = clean_materials_df(materials_df)

        # this is shaping the data for deepdiff, to convert into a record format
        purchase_order = {
            "job_id": final_order_df["job_id"].iloc[0] if "job_id" in final_order_df.columns and len(final_order_df) > 0 else "",
            "customer_id": final_order_df["customer_id"].iloc[0] if "customer_id" in final_order_df.columns and len(final_order_df) > 0 else "",
            "po_date": final_order_df["po_date"].iloc[0] if "po_date" in final_order_df.columns and len(final_order_df) > 0 else "",
            "name": final_order_df["name"].iloc[0] if "name" in final_order_df.columns and len(final_order_df) > 0 else "",
            "phone_number": final_order_df["phone_number"].iloc[0] if "phone_number" in final_order_df.columns and len(final_order_df) > 0 else "",
            "delivery_type": final_order_df["delivery_type"].iloc[0] if "delivery_type" in final_order_df.columns and len(final_order_df) > 0 else "",
            "delivery_date": final_order_df["delivery_date"].iloc[0] if "delivery_date" in final_order_df.columns and len(final_order_df) > 0 else "",
            "delivery_time": final_order_df["delivery_time"].iloc[0] if "delivery_time" in final_order_df.columns and len(final_order_df) > 0 else "",
            "delivery_address": final_order_df["delivery_address"].iloc[0] if "delivery_address" in final_order_df.columns and len(final_order_df) > 0 else "",
            "eir_code": final_order_df["eir_code"].iloc[0] if "eir_code" in final_order_df.columns and len(final_order_df) > 0 else ""
        }

        materials = final_materials_df.to_dict(orient='records')

        order_json = {"purchase_order": purchase_order}
        materials_json = {"materials": materials}

        order_json = convert_numpy_types(order_json)
        materials_json = convert_numpy_types(materials_json)

        # claculate accuracy
        result = diff.calculate_json_accuracy(material_groundtruth, materials_json)
        accuracy_score = result['score']
        total_fields = result['total_fields']
        diff_stats = result['json_diff_stats']

        # Log metrics
        mlflow.log_metric("accuracy_score", accuracy_score)
        mlflow.log_metric("total_fields", total_fields)
        mlflow.log_metric("diff_additions", diff_stats['additions'])
        mlflow.log_metric("diff_deletions", diff_stats['deletions'])
        mlflow.log_metric("diff_modifications", diff_stats['modifications'])
        mlflow.log_metric("diff_total", diff_stats['total'])

        with open("order_extracted.json", "w") as f:
            json.dump(order_json, f, indent=2)

        with open("materials_extracted.json", "w") as f:
            json.dump(materials_json, f, indent=2)

        mlflow.log_artifact("order_extracted.json")
        mlflow.log_artifact("materials_extracted.json")

        # optional i you want to log the data for tracing       
        # mlflow.log_artifact("order_dataframe.csv")
        # mlflow.log_artifact("materials_dataframe.csv")

        return {
            "format": format_col,
            "accuracy": accuracy_score,
            "execution_time": execution_time,
            "diff_total": diff_stats['total']
        }

Run the MLFlow experiments:


path = "/lakehouse/default/Files/complete_purchase_order.xlsx"
sheet_names = ['PO1', 'PO2', 'PO3', 'PO4', 'PO5']

# Load Excel data
df_dict = {
    sheet: pd.read_excel(path, sheet_name=sheet).dropna(how='all').dropna(how='all', axis=1)
    for sheet in sheet_names
}


df_extract = pd.DataFrame({
    "json_data": [json.loads(df_dict[f'PO{i}'].to_json()) for i in range(1, 6)],
    "string_data": [df_dict[f'PO{i}'].to_string() for i in range(1, 6)],
    "markdwn_data": [df_dict[f'PO{i}'].to_markdown() for i in range(1, 6)]
})


formats = ["string_data", "json_data", "markdwn_data"]
results = []

for fmt in formats:
    result = process_format_test(fmt, df_extract.copy(), f"format_test_{fmt}")
    if result:
        results.append(result)

if results:
    results_df = pd.DataFrame(results)
    results_df.to_csv("format_comparison.csv", index=False)

    with mlflow.start_run(run_name="summary"):
        mlflow.log_artifact("format_comparison.csv")
        best_format = results_df.loc[results_df['accuracy'].idxmax(), 'format']
        mlflow.log_param("best_format", best_format)

        for col in results_df.columns:
            if col != 'format':
                mlflow.log_metric(f"avg_{col}", results_df[col].mean())

MLFLow inline run comparison:

In the run comparison we can see that both text and markdown resulted in 100% accuracy however markdown generated more tokens. JSON on the other hand is <100% accurate.

Why does this matter? Because, as you receive more data, varied data, as LLMs used in AI function changes, you need to be able to trace this back and reproduce/benchmark the results (as much as possible) so you can productionize the solution confidently. Using AI is easy, building evals is hard but necessary.

Risks, Limitations

I created four layout examples. In the real world, you might encounter a layout different from the four I used. We can address this in a few ways. We can extract features from the current forms (Azure Doc Intelligence, for example, also provides the layout). We can train a machine learning model on many examples to create a classifier or use a large language model as a classifier for incoming forms. (e.g. a layout drift model)
Rate limits and context size: GPT-3.5 has a context window of only 16K tokens (including both input and output) and a maximum output of less than 4K tokens. This means it can't handle large Excel tables. In the example above, the average output was 1000 tokens, so for GPT-3.5, the data can't be more than about four times what I used. However, this is temporary. It was announced at FabCon that the model will be updated to newer GPT models with a larger context window (128K instead of 16K). AI functions also have a rate limit of 1000 TPM. For long documents, we can split the data into chunks to work around these limits, but that will require extra effort.
LLMs are bound to hallucinate and are subject to prompt injection, e.g. what if in the excel sheet there is text that intentionally or unintentionally instructs the models to behave differently and fudge the numbers?
The eval harness I created checked for two things - schema and values. But we could build more evaluation metrics for specific cases to ensure the accuracy risks are mitigated.
If Microsoft changes the underlying model, it may affect your results. You can always use a Azure Open AI custom deployment to control for this.
Always be aware of risks associated with sending sensitive data to LLMs and AI services.

To estimate cost, you can use my Fabric AI & Copilot cost calculator.

Unstructured to Structured : Extracting Data From Messy Excel Sheets Using Fabric AI Function