A Guide to Building a Databricks App

JamesJames
17 min read

If you’re a data engineer (or other data professional), work within Databricks, and haven’t utilized their apps yet, you really are missing out. Once I heard they were releasing a way to build and host apps (that utilizes their existing authentication), I got quite excited to find a good use case to start dipping my toes in.

This offers a good way to give your team, business consumers, or others a way to interact with data, and potentially even develop things like a CRUD data app (to avoid things like loading excel files - or to even allow business consumers to load the excel files themselves).

However, development and deployment (depending on your level of experience), might be quite intimidating, especially if you’ve not worked extensively in app development in python. As well, not to trash Databricks, but sometimes their documentation is base level at best, which can sometimes leave you lost when you’re neck deep in developing something. However, this isn’t atypical for most software or service providers, as they can’t cover every use case or development flow!

Given that, I figured I’d create this tutorial! Showing the steps needed, and giving some guidance on how to build your first Databricks app!

Let’s get started!

Requirements

The requirements are pretty extensive, but not overwhelming. All of them are pretty straightforward, so let’s go through them!

Databricks

Pretty self-explanatory, but you need a Databricks environment to develop an app. Thankfully Databricks has a pretty extensive free edition, that will allow you to do MOST things, for free. As long as your data isn’t huge, you should have all you need to practice!

You can get started here utilizing either a Google or Microsoft account: https://docs.databricks.com/aws/en/getting-started/free-edition

Next, I’d HIGHLY recommend the Databricks CLI. While you can probably find your way around it, if you’re going to work on Databricks, you need to learn to utilize the CLI. Documentation on the tool (including installation) located here: https://docs.databricks.com/aws/en/dev-tools/cli/

Go through setup, and authenticate to your workspace. Quick way to authenticate, is one the CLI is installed, utilize the command:

databricks configure

It will then take you through setting up your default profile to authenticate to your workspace. You’ll need your workspace URL and a personal access token.

You can create a token in your Databricks workspace go to: Settings > Developer > Access Tokens (Manage) > Generate New Token. Choose a decent amount of time for the token to live, adding a comment that it’s for your Databricks CLI, then you should be able to utilize the CLI freely!

💡
If you need to authenticate as a different user (or service principal), you can set up your CLI to authenticate via multiple accounts or methods. You’ll just need to set up an additional profile. You’ll just need to add the profile to your .databrickscfg file. Check here for more: https://docs.databricks.com/aws/en/dev-tools/cli/profiles

Python

You will also need python - unless you’d like to go the Javascript route, but I’ll be utilizing python for this tutorial. Databricks utilizes 3.11 but you’d probably be fine with a higher version, but if you can, I’d recommend installing 3.11 to ensure consistency between versions of what you’ll develop with, and what will be deployed.

Modules

For the purpose of this post, these are some recommended modules:

  • Streamlit - python -m pip install streamlit - this is the framework we’ll utilize to build our application

    💡
    There are LOTS of options for frameworks within Databricks, so always feel free to utilize a different framework (Flask, Dash, etc.)
  • Databricks SDK - python -m pip install databricks-sdk - this is the Databricks SDK for python, it has quite a few features (it also offers a decent alternative to the CLI within python), but we’ll primarily utilize this to authenticate to Databricks within our application and load data to the workspace

  • Databricks SQL connector - python -m pip install databricks-sql-connector - this is the Databricks python module for interacting with Databricks via SQL, we’ll utilize this to execute queries

  • Polars - python -m pip install polars - this is a newer dataframe library that is built in Rust, and can handle QUITE a bit of data efficiently, it’s become my new favorite compared to Pandas (however you can always use Pandas if you prefer)

  • Dotenv (Optional) python -m pip install python-dotenv - this is for setting up our .env files and storing our configuration/secrets versus storing it in code

    💡
    This one is HIGHLY recommended, otherwise you’d need to store your secrets in code, and that’s never recommended
  • Pip-chill (Optional) python -m pip install pip-chill - this is a very NICE alternate to pip that allows you to freeze your requirements with only the modules actually installed

Code Editor

Goes without saying, but you’ll need somewhere to edit code. VSCode is my preferred, it’s got a lot of features and plugins, all in a lightweight package. However, there’s always other options/alternatives like JetBrains, Cursor, or many others.

Getting Started

Getting started can take several paths. You can always create a template from Databricks, then sync it to your local with the CLI, however, we’re not going to go that way. We’re going to go through the entire process, with some added features that may seem complicated, but make things WAY simpler later on.

So, let’s go through each step, but before we get started:

  1. Make sure you have the Databricks CLI installed, and authentication set up

  2. Python installed

  3. Open your IDE of choice

  4. Create a local folder where you’ll store your app

  5. Open a terminal in your working folder

  6. Create/activate a virtual environment (optional but HIGHLY recommended)

    To create a virtual environment, open a terminal and navigate to your working folder, then run the command: python3 -m venv .venv this will create a virtual environment in the .venv folder. You can then activate it, which varies per operating system.

    Run the following commands in your terminal:

    • For linux/mac: source .venv/bin/activate

    • For windows: .venv/Scripts/activate

  7. Install the above modules in the requirements section

Once that’s done, we’ll jump into setting it up from scratch!

Setting up a DAB (Databricks Asset Bundle)

Databricks Asset Bundles are Databricks solution for infrastructure as code. They help define a group of resources to deploy. As well, they make it INCREDIBLY convenient to deploy resources differently based on environment (dev vs prod). They also offer a convenient way to run an app!

So to create our DAB, we just create a file databricks.yml - then we’re going to configure our resources within the yml file:

bundle:
  name: my-app

#setting up variables to be used in other places
variables:
  app_name: 
    description: "Name of the app"
    default: "my-app"
  sql_warehouse:
    description: "SQL warehouse ID"
    #this is the ID from the warehouse "overview" page - next to name
    default: "warehouse_id" 
  bundle_path:
    description: "Path to the application code"
    #this is the path where the bundle and files will be stored
    default: "/Workspace/Shared/${var.app_name}/${bundle.target}"


resources:
  apps:
    app:
      name: ${bundle.target}-${var.app_name}
      #when deploying a bundle, the files themselves will be stored in /files
      source_code_path: ${var.bundle_path}/files
      description: "My Application"
      resources:
        - name: "sql-warehouse"
          description: "SQL Warehouse for the application"
          sql_warehouse:
            id: ${var.sql_warehouse}
            permission: "CAN_USE"

targets:
  prod:
    mode: production
    default: true
    workspace:
      host: https://your-host.cloud.databricks.com
      root_path: ${var.bundle_path}

I’m not going to dive through all of the components of DABs, but I would highly recommend reading up on asset bundles: https://docs.databricks.com/aws/en/dev-tools/bundles/

However, at a high level, there are several components:

  • variables - under the variables section, we declare dynamic/reused components of the asset bundles

  • resources - this defines the resources for the asset bundle, since we’re developing an app, we build an “app” resource

    • source_code_path this is the path where the app points to pull the code from - when you deploy an asset bundle, the “files” of the asset bundle will always be contained in the /files folder

    • resources these are resources ASSOCIATED to the app - this is where you you’ll associate a warehouse, or keys from a secret scope

  • targets - these define the destination (dev vs. prod vs. stage vs. int), and you can define a LOT of things within these sections, and if you’re in a enterprise/production environment, I highly recommend taking advantage of these (and reading much more on them) - however, for our application we’re just going to deploy straight to prod

Once our databricks.yml is complete, we’ll test/validate our bundle: databricks bundle validate. This will validate our bundle and ensure it works. You might get a warning as we’re deploying the bundle to to a Shared space. That’s normal. If you want to change the destination to something locked down, always feel free!

Setting up the app.yml file

Next we’ll set up our app.yml file. This is the directions for Databricks on how to run our application. We’re building a Streamlit app, so the contents for a Streamlit app is pretty straightforward:

#this is the command the app will run on launch
#notice it's exactly the same as the `streamlit run app.py` above
command:
  - "streamlit"
  - "run"
  - "app.py"

#environmental variables passed to the application
env:
  - name: "DATABRICKS_WAREHOUSE_ID"
    valueFrom: "sql-warehouse"
  - name: STREAMLIT_BROWSER_GATHER_USAGE_STATS
    value: "false"

This will tell Databricks the “command” to run to launch the app, and as well under the env section, this will configure the environmental variables our app has access to. Really the only one we need for this our sql-warehouse, which will pull the sql-warehouse id from the app that was configured within our DAB.

💡
Do NOT store sensitive data within the env section here. Utilize secret scopes to access those, and make the keys/secrets available to the app under the resources for the app

For more information on the app.yml file you can refer to the documentation: https://docs.databricks.com/aws/en/dev-tools/databricks-apps/app-runtime

Databricks templates for apps also will create a app.yml file depending on the library utilized, so if you need a different framework, you can always create a new app based on their templates.

Building the Application

We’re going to utilize Streamlit for our app example. So first, we need to create a streamlit app!

💡
This post is focused around Databricks apps, so I won’t be going into extensive Streamlit functionality, just enough to demonstrate some features and how it integrates with Databricks. If you need some more information on Streamlit, definitely do some research!

We’ll create a app.py file in our working directory, then within we’ll just add some boilerplate code:

import streamlit as st

st.header("My Databricks App")
st.write("This is the main page of my Databricks application.")

We’ll then run our application streamlit run app.py - the application should run, and you can navigate to the page (there will be a link within your terminal that shows where the app is running).

Configuring .env file

Next we need to configure our .env file, this will allow us to pass in environmental variables while we develop locally, that our app would have access to while it was deployed on Databricks.

Create the .env file within your folder, and add the following keys:

DATABRICKS_WAREHOUSE_ID=warehouse_id
DATABRICKS_HOST=your_host
DATABRICKS_TOKEN=your_token

You’ll need to populate the sql-warehouse id from your databricks.yml file, as well you’ll need the host which will be the URL from your databricks workspace. Then you’ll need a token similar to the token we created when configuring the databricks CLI. I’d recommend using a different token than the one created for the CLI.

If you are syncing to a repo, MAKE SURE YOU IGNORE the .env file. You do not want to sync sensitive values to git. You can always use this template .gitignore file (just creating it in your working folder):

.databricks/
build/
dist/
__pycache__/
*.venv/*
*.env

Querying Data From Databricks

We’re going to create a page next to show how to query data from Databricks. Let’s create a file called query_page.py, we’ll start by putting some basic code inside:

import streamlit as st

st.header("Page for querying data")
st.write("This is the page where users can query data.")

Next we’ll add this as a page within our app. So we’ll alter our app.py file, and add some navigation - and remove our initial code:

import streamlit as st

pages = {
    "Databricks App Demo": [
        st.Page("./query_page.py",title="Query Page")
    ]
}

pg = st.navigation(pages)
pg.run()

Since this is our only page, we’ll only see this page within our app while running. Next we’ll extend out our query_page.py page, by querying data from Databricks, and rendering the data.

import streamlit as st
st.header("Page for querying data")
st.write("This is the page where users can query data.")

import os
import dotenv
#loading the .env file
dotenv.load_dotenv()

#importing required modules to query and read data
import polars as pl
from databricks import sql
from databricks.sdk.config import Config

#creating configuration for authentication to databricks
config = Config()
#pulling our required variables - host, warehouse, and creating http_path
host = os.getenv("DATABRICKS_HOST")
warehouse_id = os.getenv("DATABRICKS_WAREHOUSE_ID")
http_path = f"/sql/1.0/warehouses/{warehouse_id}"

#connecting to our databricks warehouse
with sql.connect(
    server_hostname=host,
    http_path=http_path,
    credentials_provider=lambda: config.authenticate,
) as con:
    #creating cursor objection
    cursor = con.cursor()
    #executing our query
    cursor.execute("select * from samples.nyctaxi.trips")
    #returning our result as pyarrow data
    arrow_results = cursor.fetchall_arrow()
    #creating a dataframe from our pyarrow data
    df = pl.from_arrow(arrow_results)

#rendering the data
st.data_editor(df)

That’s it! Run the app streamlit run app.py, and the data should load! That’s all that’s required to query data from Databricks and render it in Streamlit!

Uploading Data to Databricks

To upload data to Databricks there are several ways. You can always insert a single row, or you can upload a file then merge it in. I’d recommend the second method, as it’s a bit more scalable. This will require you to upload the data as a parquet file to a volume, then utilizing that parquet file to create a table or merge it in.

We’ll start by building a new page upload_page.py, then registering it in the app:

import streamlit as st

st.header("Upload Page")
st.write("This is the page where users can upload data.")

The updated app.py:

import streamlit as st

pages = {
    "Databricks App Demo": [
        st.Page("./query_page.py",title="Query Page")
        , st.Page("./upload_page.py",title="Upload Page")
    ]
}

pg = st.navigation(pages)
pg.run()

Next, we’re going to create a upload feature in the page utilizing Streamlit’s file uploader by adding this to our upload_page.py:

#importing polars for manipulating/reading data
import polars as pl

#uploader
file = st.file_uploader("Upload a file",type=["csv"],accept_multiple_files=False)
#if someone passed in a file, read it and render it
if file:
    df = pl.read_csv(file)
    st.data_editor(df)

I have a file I’m going to test uploading that I’ve included in the repo, but you can upload any CSV file (or other file):

id,age,state,a1c
1,45,TN,4.8
2,51,TX,5.4
3,35,TX,7.1
4,23,GA,6.3
5,43,AL,4.9
6,55,MS,5.2
7,65,AK,5.1
8,71,FL,7.2
9,14,ME,4.7

Before we upload the data, we also need to create the required destinations in Databricks. A catalog, schema, and volume (execute this sql within your workspace - assuming you don’t already have an apps catalog):

create catalog if not exists apps;
create schema if not exists apps.my_app;
create volume if not exists apps.my_app.tmp;

After we create the uploader, we’re then going to create a submit button that will post the file to Databricks, and create a destination table in our upload_page.py:

import os
import dotenv
#loading the .env file
dotenv.load_dotenv()

#import tempfile to manage writing the table from the app to databricks
import tempfile
#importing required modules to query data
from databricks import sql
from databricks.sdk import WorkspaceClient
from databricks.sdk.config import Config

#creating configuration for authentication to databricks
config = Config()
#create workspace client for uploading file to volumes
workspace_client = WorkspaceClient()
#pulling our required variables - host, warehouse, and creating http_path
host = os.getenv("DATABRICKS_HOST")
warehouse_id = os.getenv("DATABRICKS_WAREHOUSE_ID")
http_path = f"/sql/1.0/warehouses/{warehouse_id}"

#if a file has been uploaded
if file:
    #if the submit button has been clicked
    if st.button("Submit"):
        #will show a spinner while the rest of the code is executed
        with st.spinner("Uploading file and creating table..."):
            #creating a temporary file for the data to be stored before it's written to databricks
            with tempfile.NamedTemporaryFile(delete=True,dir="./",suffix=".parquet") as tmp_file:
                file_path = tmp_file.name
                file_name = file_path.split("/")[-1]

                #creating destination volume path for the file
                destination_volume_file = f"/Volumes/apps/my_app/tmp/{file_name}"

                #writing the parquet file out
                df.write_parquet(file_path,compression="gzip")

                #uploading the file to the databricks volume
                with open(file_path,"rb") as f:
                    workspace_client.files.upload(file_path=destination_volume_file,contents=f.read(),overwrite=True)

            with sql.connect(
                server_hostname=host,
                http_path=http_path,
                credentials_provider=lambda: config.authenticate,
            ) as con:
                #create your cursor
                cursor = con.cursor()
                #creating sql to create a databricks table utilizing the parquet
                create_table_sql = f"""
                    create or replace table apps.my_app.a1c_data as
                        select *
                        from parquet.`{destination_volume_file}`
                """
                #trying to create the table
                exception = None
                try:
                    cursor.execute(create_table_sql)
                except Exception as ex:
                    exception = ex
                finally:
                    workspace_client.files.delete(destination_volume_file)
                    if exception: raise exception
        #creating success message
        st.success("File uploaded and table created successfully!")
💡
You can also utilize Databricks SQL connector to upload files to volumes utilizing the ‘PUT INTO’ command. I did not utilize that here as I started to run into issues when deploying the app that it would timeout for some strange reason (even though I use it in an app for work with no problems - and it worked locally no issues). So just know there are several ways to upload files to Databricks volumes.

That’s it! Run the app, and test loading data to Databricks! Next we’ll deploy our application!

Deploying the Application

Now that our application is all built, let’s deploy it!

Freeze Requirements

First you need to freeze the requirements. This stores the python modules installed for your application, telling the app prior to launching it’s going to need these specific modules installed.

If you installed pip-chill (from requirments above), you can utilize pip-chill > requirements.txt, if you did not you can always utilize pip python3 -m pip freeze > requirements.txt. This will create a requirements.txt file that stores your modules you utilized.

Deploy the Application DAB

This is where DABs come in handy. We’re going to utilize the CLI to deploy our files/bundle automatically to Databricks:

databricks bundle deploy -t prod

Depending on where you want to deploy it, you can change the target by utilizing the -t argument.

This will deploy the files to the location within your databricks.yml file.

Granting the Application Permissions

You must ensure the application has permissions to access the required data (catalogs, schemas, tables, volumes). To do this, you must go to the catalog explorer, and grant access to the required objects. Always go with the principal of least privilege here, ensure the application ONLY has required permissions to perform actions of your app. For this use case, I’m going to grant the app full access on the schema I created for it. However, each use case varies, so just be wise when granting permissions.

Link here: https://docs.databricks.com/aws/en/data-governance/unity-catalog/manage-privileges/

Run the Application

Once the application has the required permissions, now we can finally run the app! This is where the DAB and CLI come in clutch one more time!

From command line, run the following:

databricks bundle run app -t prod

You should see your app starting. Once started, you should be able to navigate to the app. It should print the app URL within the command line, however, you can also go to the Databricks workspace, navigate to Compute > Apps - then you should see your app there. Open the app up, then you’ll see the corresponding link to the app!

That’s it! You’ve deployed your first Databricks app!

Other Considerations

That’s it for developing and deploying a basic Databricks app. That being said, there’s always many other considerations when developing an app like this!

Extract out Functionality

First and foremost, create functions to reuse across your app. Our example here, we did not do that, but you definitely should.

You can create functions to:

  • Establish a connection to Databricks SQL warehouse

  • Execute a query against Databricks

  • Upload a file

  • Read a file

  • Many many more…

Goes without saying, if you want your app to scale (and not have to constantly change code everywhere), always take a modular approach and extract out your functionality you utilize regularly into something reusable.

Folder Structure

Our application was very simple, but as you build larger applications, highly recommend coming up with a folder structure that makes sense.

May want to create a folder for:

  • Pages for your Streamlit app

  • Common functionality/utils

  • Themes (if you want to customize your app’s appearance)

Organization is key to having a scalable app. So just fiddle with it, and organize it in a way that works for your application.

Streamlit

Streamlit has it’s own set of considerations. Outside of the two above, there’s also very awesome features such as data caching, session management, navigation, and other considerations when developing a Streamlit app. Particularly, data caching is incredibly valuable as it allows you to cache data you return from a function, so you don’t have to constantly query it over and over again. So if you go the Streamlit route, make sure you read up on the documentation!

Cost

Databricks apps utilize a serverless cluster to run. For enterprise/premium within AWS, Azure, and GCP it’s ~0.475 per hour. This will equate to ~$12 a day ($360 a month). While this might not seem like a lot, if you’re in an enterprise where costs are monitored and controlled (depending on the size of your budget), running an app 24/7 might raise some eyebrows.

So a recommended solution (one we’ve taken within my organization) is to build a job that turns the app on/off and specific intervals utilizing the Databricks API:

We turn our app on at ~7am and turn it off at ~6pm, Monday through Friday via a Databricks workflow. This allows us to minimize the cost during off hours where no one will be using the app.

Conclusion

Databricks apps are an amazing tool in any Databricks focused professional’s toolbelt. They allow you to build an application that might feel a bit better to utilize to some stakeholders, and as well, build some functionality that might not be as intuitive for a notebook/SQL only UI. Even though there are quite a few steps, once you go through each step a few times, it follows a very similar pattern and becomes a bit of rinse and repeat.

Hope you found this helpful, and good luck in developing your Databricks app!

Full Solution

Located here!

0
Subscribe to my newsletter

Read articles from James directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

James
James

I am Data Engineer with a big passion for learning as much as he can. I enjoy the outdoors, mountain biking, finding cool ways to solve new coding problems, and teaching others to code.