Effortless Databricks Setup Guide

I'm pretty sure that if you're using Databricks to run your PySpark job, these might be your typical steps:

Design and develop business logic.
A notebook that performs all the business logic.
Running that notebook using Databricks Workflow.

This is easier said than done. Logistics for running this notebook(s) is one of the biggest headaches. I am sure you won’t be just running one or two notebooks. Setting the environment correctly for every one of them can be cumbersome. It becomes even tougher if you are using proprietary/private libraries.

Following is the system that I came up with which is the most practical solution:

Step 1: Universal environment setup notebook.
- Create a universal environment setup notebook in your repo.
- This notebook can be placed in the root of your repo so that all other notebooks can easily access it

Step 2: Secrets, Environment variables, Constants, etc

Secrets, variables, and constants change depending on the environment in which the notebook is running, so they must be set accordingly.

We use dbutils.secret to fetch all the secrets. Let’s see in action

  # Getting all secrets
  # Note - It is important to make sure that the scope is appropriately mapped to
  # the secret store assigned to environment-specific workspace
  PAT = dbutils.secrets.get("<your_secret_scope>", "<secret_1_name>")
  SECRET_1 = dbutils.secrets.get("<your_secret_scope>", "<secret_1_name>")
  SECRET_2 = dbutils.secrets.get("<your_secret_scope>", "<secret_2_name>")
  SECRET_3 = dbutils.secrets.get("<your_secret_scope>", "<secret_3_name>")

  # Environment specific constants
  databricks_host = spark.conf.get("spark.databricks.workspaceUrl")

  if databricks_host == "<your dev workapce host url>":
      environment = "dev"
      CONSTANT_1 = "dev constant 1"
      CONSTANT_2 = "dev constant 2"
      volume_path = "/Volumes/dev/path"
      catalog_name = "dev_catalog"
  elif databricks_host == "<your uat workapce host url>":
      environment = "uat"
      CONSTANT_1 = "uat constant 1"
      CONSTANT_2 = "uat constant 2"
      volume_path = "/Volumes/uat/path"
      catalog_name = "uat_catalog"
  else databricks_host == "<your prod workapce host url>":
      environment = "prd"
      CONSTANT_1 = "prod constant 1"
      CONSTANT_2 = "prod constant 2"
      volume_path = "/Volumes/prod/path"
      catalog_name = "prod_catalog"
  else:
      raise NameError("Incorrect databricks workspace")

Step 3: Passing all variables to the downstream notebookThere are a couple of ways to pass these variables to the downstream notebook to access environment-specific values. Here are two methods I use:
- DBFS JSON file: Save all the values in a JSON file at the dbfs/tmp/ location. This way, the file is deleted once the notebook job is done. I used this method for a while, but Databricks recently changed the permission rules, and now only admins can access it.
- Temp View: Using a TEMP VIEW is even better than a file. It doesn't require admin permission. So, create a TEMP VIEW that includes all these variables.

  # Creating dict of all variables

  env_vars = {
      "secret_1": SECRET_1,
      "secret_2": SECRET_2,
      "secret_3": SECRET_3,
      "constant_1": CONSTANT_1,
      "constant_2": CONSTANT_2,
      "catalog_name": catalog_name
  }

  # Write environment variables to a TEMP VIEW
  spark.createDataFrame([env_vars]).createOrReplaceTempView("env_vars")

Step 4: Installing libraries, especially private

You should definitely package your codebase as a Python Library. Then install & use it just like any other open source library. This way you won’t have to worry about Path or Relative/Absolute Import Error, etc.
You need to separate package for dev/qa (which can be experimental) & uat/prod (must be stable)
Follow PEP 440 guidelines to version your code.
- use X.Y.devN version for the package published from the develop branch.
- use X.Y.N version for the package published from the master branch.

Use %pip to install environment-specific private library.

  if environment == "dev" or environment == "qa":
      # --pre flag will install package having 'dev' label
      # NOTE - use your repository usrl appropriately
       %pip install --extra-index-url https://databricks-user:{PAT}@pkgs.dev.azure.com/dp-us-bus/_packaging/dp-pip/pypi/simple/ --pre --upgrade "<your package name>"

  if environment == "uat" or environment == "prod":
      %pip install --extra-index-url https://databricks-user:{PAT}@pkgs.dev.azure.com/dp-us-bus/_packaging/dp-pip/pypi/simple/ --upgrade "<your package name>"

  # Note - Databricks recommends to restart python to make sure we'll 
  # be using libraries that were just installed
  dbutils.restartPython()

Step 5: Running the setup notebook in the downstream job notebook

Use databricks’ magic %run command.

Read the TEMP VIEW & update values in the os environment

  # Running the notebook. Make sure to use correct relative path
  %run ../../databricks_environment_setup

  # Read environment variables from the TEMP VIEW & set the environment variables
  # for use in this notebook
  import os

  os.environ.update(spark.table("env_vars").first().asDict())

Streamlining Your Databricks Environment Setup

Subscribe to my newsletter

Akash Desarda

Akash Desarda