Introduction

In the world of big data analytics, securing access to your data storage is paramount. As organizations increasingly adopt cloud-based solutions, the need for robust, scalable, and secure data access mechanisms becomes crucial. This blog post will guide you through the process of setting up secure access from Azure Databricks to Azure Data Lake Storage Gen2 using a service principal and Azure Key Vault. To help explain key concepts and provide a visual understanding, we'll walk through the steps using the Azure portal. This approach will make it easier to follow along and understand the relationships between different Azure services. Importantly, we'll emphasize that this entire setup can and should be implemented using Infrastructure as Code (IaC). By leveraging IaC tools like Terraform or Azure Resource Manager templates, you can ensure reproducibility, version control, and easier management of your cloud resources.

We'll cover: -

Creating an Azure Data Lake Storage Gen2 account
Setting up a service principal for secure, non-interactive authentication
Utilizing Azure Key Vault to safely store and manage secrets
Configuring Databricks to access Data Lake Storage using the service principal
Writing code to securely read data from your storage account

Whether you're a data engineer, a cloud architect, or a curious developer, this guide will provide you with the knowledge to implement a secure, scalable solution for accessing your data lake from Databricks. Let's dive in and explore how to leverage Azure's powerful tools to create a secure data analytics environment.

Step 1: Create Azure Data Lake Storage Gen2

Log in to the Azure Portal (https://portal.azure.com).
Click "Create a resource" and search for "Storage account".
Click "Create" on the Storage account page.
Fill in the basics:
- Choose your subscription and resource group
- Enter a unique storage account name
- Select a region
- Choose Performance option:
  - Standard (HDD-based): Suitable for most data lake scenarios, cost-effective for large datasets.
  - Premium (SSD-based): For high-performance needs, but more expensive.
- Select Redundancy option:
  - Locally Redundant Storage (LRS)
  - Zone-Redundant Storage (ZRS)
  - Geo-Redundant Storage (GRS)
  - Geo-Zone-Redundant Storage (GZRS)
On the "Advanced" tab, enable "Hierarchical namespace" under "Data Lake Storage Gen2".
For this demo, we can choose Standard performance with Locally Redundant Storage to save on costs . Review and create your storage account.

Step 2: Create a Service Principal

In the Azure Portal, go to Azure Active Directory.
Navigate to "App registrations" and click "New registration".
Give your app a name and click "Register".
Once created, note down the Application (client) ID and Directory (tenant) ID.

Under "Supported account types", you'll see several options.

a) Accounts in this organizational directory only (Single tenant):

b) Accounts in any organizational directory (Multi-tenant):

c) Accounts in any organizational directory (Multi-tenant) and personal Microsoft accounts:

d) Personal Microsoft accounts only

To know more about these supported account types visit https://learn.microsoft.com/en-us/security/zero-trust/develop/identity-supported-account-types
For this tutorial one can choose "Accounts in this organizational directory only (Single tenant)".
Go to "Certificates & secrets", create a new client secret, and immediately copy its value. This value will be used as the client secret value in step 4: ->3.c)

Step 3: Grant Access to the Storage Account

Go to your storage account in the Azure Portal.
Click on "Access Control (IAM)" and then "Add" > "Add role assignment".
Choose "Storage Blob Data Contributor" role.
In "Assign access to", choose "User, group, or service principal".
Search for and select your service principal.
Click "Review + assign" to finish.

Step 4: Set Up Azure Key Vault

In the Azure Portal, create a new Key Vault resource if you haven't already.
Once created, go to "Secrets" in your Key Vault.
Add new secrets for your service principal:

a) Client ID Secret:
- Click "Generate/Import"
- Name: "demo-client-id" (or your preferred name)
- Value: Paste your service principal's Application (client) ID
- Click "Create"

b) Tenant ID Secret:

Click "Generate/Import"
Name: "demo-tenant-id" (or your preferred name)
Value: Paste your Azure AD tenant ID
Click "Create"

c) Client Secret:

Click "Generate/Import"
Name: "demo-client-secret" (or your preferred name)
Value: Paste your service principal's secret value (this value can be found from step 2 -> 6.Go to "Certificates & secrets")
Click "Create"

Ensure all three secrets are successfully created and listed in your Key Vault.

Step 5: Configure Databricks Secret Scope

Go to your Databricks workspace.
Navigate to https://<your-databricks-url>#secrets/createScope (replace with your actual Databricks URL).
Create a new scope:
- Enter a name for your scope
- Choose "Azure Key Vault" as the backend
- Enter your Key Vault DNS name and resource ID

To find the Key Vault DNS name and resource ID:

a) Key Vault DNS name:

In the Azure Portal, go to your Key Vault resource
In the Overview page, look for "Vault URI"
The DNS name is this URI

b) Resource ID:

In the Azure Portal, go to your Key Vault resource
In the Overview page, look for "Resource ID"
Click the "Copy to clipboard" icon next to the Resource ID
It should look something like this: "/subscriptions/<subscription-id>/resourceGroups/<resource-group-name>/providers/Microsoft.KeyVault/vaults/<key-vault-name>"

Enter these values in the Databricks secret scope creation form to link your Databricks workspace with your Azure Key Vault.

Step 6: Access Data Lake from Databricks

Now, in your Databricks notebook, you can use the following code to access your Data Lake Storage:

# Set up the service principal credentials
client_id = dbutils.secrets.get(scope="your-scope-name", key="client-id")
client_secret = dbutils.secrets.get(scope="your-scope-name", key="client-secret")
tenant_id = dbutils.secrets.get(scope="your-scope-name", key="tenant-id")

# Configure Spark to use the service principal
spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account-name>.dfs.core.windows.net", client_id)
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account-name>.dfs.core.windows.net", client_secret)
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account-name>.dfs.core.windows.net", f"https://login.microsoftonline.com/{tenant_id}/oauth2/token")

# Now you can access your storage
container_name = "your-container-name"
file_name = "your-file-name"


# Read and display the first few rows of the CSV file
df = spark.read.format("csv").option("header", True).option('sep',',').load(f"abfss://{container_name}@<storage-account-name>.dfs.core.windows.net/{file_name}.csv")
display(df.head(5))  # Display the first 5 rows

Conclusion

By following these steps, you've created a secure pipeline to access your Azure Data Lake Storage Gen2 from Databricks. Using a service principal and Azure Key Vault ensures that your credentials are kept secure and can be easily rotated without changing your code. This approach is scalable and suitable for production environments. Remember to always follow security best practices, such as regularly rotating your secrets and applying the principle of least privilege when assigning roles and permissions.

Secure Databricks Access to Azure Data Lake Gen2 via Service Principal and Azure Key Vault

Table of contents