Identify Column Mismatch Between Lakehouse Tables And SQL Endpoint Tables In Fabric

Sandeep PawarSandeep Pawar
3 min read

As you may know, the Lakehouse (Delta Lake) and SQL endpoint are two different engines, and there are differences in the features each engine supports. You can review them here. A common issue I've encountered is writing a Delta table to the Lakehouse and later discovering that a column isn't supported by the SQL Endpoint. TIMESTAMP_NTZ is one example, and an array type column is another.

For example, below I am using a Python notebook with Polars to create a Delta table.

import polars as pl

data = {
    "id": [1, 2, 3],
    "names": ["Survey1", "Survey2", "Survey3"],
    "scores": [[85, 90, 78], [88, 92, 81], [72, 75, 80]]
}
df = pl.DataFrame(data)
df.write_delta('abfss://<>/Tables/dbo/survey')

The Lakehouse shows the table without any errors. Notice the scores column is of type array

However, if I check the SQL Endpoint, survey table shows only two columns, excluding the scores column because it’s not supported.

So I created a utility to compare the two in a Python notebook (you can refactor it to PySpark notebook) and identify the issue early.

notebookutils.data()

There are number of different ways to do this. But I also wanted to highlight a recent addition to my favorite notebookutils library. You can now use notebookutils.data to connect to the SQL EP of a Lakehouse or a Warehouse. This only works in a Python notebook. (In Pyspark notebook you can use the Warehouse connector or Semantic Link Labs). In Polars, I am scanning the schema without loading the tables in memory.

import sempy.fabric as fabric
import polars as pl

lakehouse_name = "MyLakehouse"
lakehouse_workspace = "Sales"


lh_workspaceid = fabric.resolve_workspace_id(lakehouse_workspace)
abfs = notebookutils.lakehouse.getWithProperties(lakehouse_name, lh_workspaceid)['properties']['abfsPath']

schemas = [f.name for f in notebookutils.fs.ls(f"{abfs}/Tables")]
conn = notebookutils.data.connect_to_artifact(lakehouse_name, lh_workspaceid, "Lakehouse")

for schema in schemas:
    try:
        tables = [t.name for t in notebookutils.fs.ls(f"{abfs}/Tables/{schema}")]        
        for table in tables:
            try:
                print(f"Table name: {schema}.{table}")                              
                delta_columns = pl.scan_delta(f"{abfs}/Tables/{schema}/{table}").collect_schema().names()

                try:
                    sql_query = f"""
                        SELECT COLUMN_NAME
                        FROM INFORMATION_SCHEMA.COLUMNS
                        WHERE TABLE_NAME = '{table}' AND TABLE_SCHEMA = '{schema}';
                    """
                    result = conn.query(sql_query)

                    if result is None:
                        print("❌ SQL query returned None. Check SQL connection and query.")
                        continue

                    t_sql_cols = result['COLUMN_NAME'].to_list()

                    print(f"Delta tables has {len(delta_columns)} columns and SQL EP has {len(t_sql_cols)} columns")

                    if len(delta_columns) != len(t_sql_cols):
                        delta_only = set(delta_columns) - set(t_sql_cols)
                        sql_only = set(t_sql_cols) - set(delta_columns)
                        if delta_only:
                            print(f"❌Columns in Delta but not in SQL: {delta_only}")
                        if sql_only:
                            print(f"❌Columns in SQL but not in Delta: {sql_only}")
                    else:
                        print("✅Columns match")
                except Exception as sql_err:
                    print(f"❌SQL query error: {str(sql_err)}")
                print("-"*80)
            except Exception as e:
                print(f"Error reading table: {schema}.{table}")
                print(f"Error details: {str(e)}")
    except Exception as e:
        print(f"Error reading schema: {schema}")
        print(f"Error details: {str(e)}")

Example Output:

Above function will only work for Lakehouse with schemas. If you want a more comprehensive function, use this instead.

Notes

  • To fix above, either un-nest/explode the array or if you want the array (e.g. a dictionary) cast it as string. To fix TIMESTAMP_NTZ, define the timezone in teh timestamp if you are using the delta-lake library (e.g. polars, rust engine). Spark doesn’t have this issue.

  • At FabCon it was announced that you will soon be able to use a T-SQL cell in a Python notebook and bind the T-SQL result to a pandas df. Currently you can’t mix Python and T-SQL.

  • SQL EP may not show the latest data if there is a sync issue. Just wait a few minutes in that case or sync manually or programmatically. You can also use it to detect the sync issues.

  • If you have created a Direct Lake model off of this Lakehouse and see below CannotReteiveModelException error, unsupported data type is likely the culprit. Fixing that and reframing the model should help fix it.

0
Subscribe to my newsletter

Read articles from Sandeep Pawar directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Sandeep Pawar
Sandeep Pawar

Microsoft MVP with expertise in data analytics, data science and generative AI using Microsoft data platform.