Unstructured To Structured Data : Which Engine Was Used To Create A Delta Table ?

In previous blogs, I discussed using Polars, DuckDB, Daft, Spark, and others to create Delta Lake tables in Fabric. Additionally, you can use other Fabric engines like Dataflow Gen2, Pipeline, and DWH to create Delta tables. You can also create a shortcut to a Delta table made outside of Fabric. But how do you know which engine was used to create the table? This can potentially help with performance debugging. I wrote a blog last year about how the layout of data in Delta tables can vary based on the engine used and how this affects Direct Lake performance. Fabric has advanced a lot since then, so that blog might be outdated, but the main idea still holds. If you look at the delta log, sometimes you can see the engineinfo
, and other times you might not or it could be nested in a field. There is no consistency because it depends on the engine used. So, what's the solution? Well, the delta log is a JSON file, meaning it's unstructured data. I can pass the latest log to an LLM to extract the engine used—it's a classic case of entity extraction.
This is second blog in my series of Unstructured To Structured series. First blog is here.
The Prompt:
Here is the prompt I created:
"Given a Delta table log JSON, return ONLY the engine used to create or modify the table. The engine information can be found in fields like 'engineInfo' or 'clientVersion' or similar. Just return the exact engine string with no additional explanation or text. Examples of valid responses: e.g. delta-rs.0.18.1, Mashup Engine , Apache-Spark/3.3.1.5.2-108696741 Delta-Lake/2.2.0.9, DataFactoryCopy, KQL etc. Do not provide any explanation or additional context - just output the engine name exactly as found in the log. If you are unsure return "NA"."
LLM:
You can use Azure OpenAI in Fabric which is available in paid F64+ capacities. I am using my personal trial capacity so the next best option is to use Google AI Studio because it’s free. So if you want to try below, just create an API key and use it.
Code:
In the below code, I:
Get a list of tables in a lakehouse
Get the latest logs for all the tables
Send all the logs and extract the engine
## I am using Fabric Python notebook, install Google API sdk
%pip install google-generativeai --q
import os
import google.generativeai as genai
from textwrap import dedent
import pandas as pd
import json
from tqdm import tqdm
## model
genai.configure(api_key="<key>")
generation_config = {"temperature": 1, "response_mime_type": "text/plain"}
model = genai.GenerativeModel(
model_name="gemini-2.0-flash-exp",
generation_config=generation_config,
system_instruction="Given a Delta table log JSON, return ONLY the engine used to create or modify the table. The engine information can be found in fields like 'engineInfo' or 'clientVersion'. Just return the exact engine string with no additional explanation or text. Examples of valid responses: e.g. delta-rs.0.18.1, Mashup Engine , Apache-Spark/3.3.1.5.2-108696741 Delta-Lake/2.2.0.9. Do not provide any explanation or additional context - just output the engine name exactly as found in the log. If you are unsure return \"NA\".",
)
def get_latest_delta_log(table_path):
"""Get the latest Delta log file content for a table path"""
log_files = [f.path for f in notebookutils.fs.ls(f"{table_path}/_delta_log") if f.path.endswith('.json')]
latest_version = sorted([int(f.split('/')[-1].replace('.json','')) for f in log_files])[-1]
latest_log = f"{table_path}/_delta_log/{str(latest_version).zfill(20)}.json"
return dedent(notebookutils.fs.head(latest_log).replace("\\",""))
def get_engine(table_path, model):
"""Extract engine information from Delta table log using Gemini model"""
try:
log_content = get_latest_delta_log(table_path)
chat = model.start_chat(history=[
{"role": "user", "parts": ["Extract the engine info from:"]}
])
return chat.send_message(log_content).text.strip()
except Exception as e:
print(f"Error @ {table_path}: {str(e)}")
return "NA"
def get_table_engines(base_abfs_path, model):
"""
Get engine information for all Delta tables in a directory
"""
try:
tables = [f.name for f in notebookutils.fs.ls(base_abfs_path)]
df = pd.DataFrame(tables, columns=["table"])
print(f"Processing {len(tables)} tables...")
df['engine'] = [get_engine(base_abfs_path + table, model) for table in tqdm(df['table'])]
df['full_path'] = base_abfs_path + df['table']
return df[['table', 'engine']]
except Exception as e:
print(f"Error: {str(e)}")
return pd.DataFrame()
path = "abfss://<ws>@onelake.dfs.fabric.microsoft.com/<lakehouse>/Tables/"
results_df = get_table_engines(path, model)
Here you go, for each table now we have the engine identified as Spark, delta-rs (polars, Duckdb etc.), Mashup Engine (Dataflow Gen2), DataFactory etc.
This is an easy entity extraction problem so you can really use any LLM you have access to.
Note: If you have many tables, you may hit the API limit. In that case, instead of iterating over the rows, I suggest concatenating the logs and sending it as a comma separated list. Flash has a context length of 2M tokens so it most likely will fit in that window and latency will be reduced.
Subscribe to my newsletter
Read articles from Sandeep Pawar directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
