File this under a test I have been wanting to do for some time. If I am exploring some data in a Fabric notebook using PySpark, can I switch between Python and PySpark engines with minimal code changes in an interactive session? The goal is to use the Python notebook for some exploration or use existing PySpark/SparkSQL or develop the logic in a low compute environment (to save CUs) and scale it in a distributed Spark environment. Understandably, there will be limitations with this approach given the difference in environments, configs etc., but can it be done?

Code

It’s straightforward, detect the environment using os.environ. If it’s Jupyter, install delta-spark and create a spark session. If it’s a Spark environment, no additional setup is required.

## Sandeep Pawar | fabric.guru 

def is_python():
    """
    Detect Fabric enviroment

    """
    import os

    if 'SPARK_HOME' in os.environ and 'HADOOP_CONF_DIR' in os.environ:
        return False

    elif 'JUPYTER_SERVER_HOME' in os.environ and 'SPARK_HOME' not in os.environ:
        return True

    return "NA"


if is_python() is True:
    """
    If Python, install delta-spark and initialize a spark session. 
    If PySpark, dont do anything.

    """
    try:
        import sys
        import subprocess
        subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'delta-spark', '-q'])

        import pyspark
        from delta import *

        # Create Spark session
        builder = (pyspark.sql.SparkSession.builder.appName("FabricPythonNotebook") 
            .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") 
            .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
        )

        spark = configure_spark_with_delta_pip(builder).getOrCreate()
        print(f"-----------------------------------------------------")
        print(f"Python PySpark session created, spark {spark.version}")


    except Exception as e:
        print(f"Error setting up Spark: {str(e)}")
else:
    print("already Spark")



## load data from an attached lakehouse
from pyspark.sql.functions import col
table_name = "lineitem"
# in spark env, use File path or abfs, in Python use relative path 
path = f"/lakehouse/default/Tables/{table_name}" if is_python() else f"Tables/{table_name}"
df = spark.read.format("delta").load(path)
df.count()
##explore data
df.where(col('l_quantity')<10).groupby(col('l_returnflag')).count().toPandas().plot(kind='barh');

Python Notebook: single node 2 core/16GB

PySpark Notebook: Default starter pool, RT1.3

It works !

Notes:

In the above code, I installed delta-spark to be able to read and write Delta Lake tables. If you don’t need it, you can skip the installation.
In Python notebook, you can only read and write if a Lakehouse is attached. Can’t use abfs paths unlike Spark. Not sure if there is a way.
In Python notebook, I can use PySpark but will not get all the goodness of Fabric Spark environment, the extra optimizations, configurations like NEE, autotune, VORDER, vegas cache etc. that are specific & proprietary to Fabric Spark engine. You also don’t see the spark monitor, UI for monitoring jobs. So above is only good for exploration in some specific scenarios.
There is a slight difference in spark version 3.5 vs 3.5.3 and also Delta lake version which may lead to incompatibility. I haven’t tested.
You can create a Delta table using Pyspark in Python notebook but again because of the difference in spark configurations, the Delta tables may be sub-optimal and/or may have incompatible features.
Nothing prevents you from setting any spark config you want. In a Python Pyspark notebook, I can set the vorder config as spark.conf.set('spark.sql.parquet.vorder.enabled', 'true') but that doesn’t mean vorder will be applied. It’s only available in the Spark runtime.
If you notice in the above contrived example, Spark took 14s vs 99s in Python. Though Spark is using 8 cores and Python is 2 cores, in the end you may end up burning more CUs with Python notebook. Not always but it will depend on the query. Plus, with Fabric Spark optimizations like vegas cache, autotune etc., for repeated queries and scans, Spark may be more economical. If you are using a bit of Spark and more Python engines like pandas, Polars then Python notebook may be better. Data Science use cases can also benefit from above approach. Test and find out. All of this dravidipranayam, is to save a few CUs for customers using low F SKUs or are on a budget. But depending on the queries, it may or may not be worth it. This only works in interactive session and can’t use %%configure . But now I at least know, what works and what doesn’t.
Another option that I have not tested : Ibis. It’s a wrapper instead of an engine. You can use the same code and just swap the engine with minimal changes.

Thanks for reading, this is the last blog of the year !

QuickTest: Switching Between Fabric Python And PySpark Notebooks

Code

Notes:

Subscribe to my newsletter

Sandeep Pawar

Sandeep Pawar