Strategies for Retrieving Files From Azure Cloud in Databricks
Introduction
When searching on your preferred search engine for accessing ADLS Gen 2 from a Databricks notebook, it will probably give you this great reference. In this blog, we aim to consolidate and simplify these methods, providing clear instructions and insights into various access techniques. Whether you're accessing data from a Databricks notebook or any other environment, this guide aims to be your comprehensive reference point.
This article explains how to connect to Azure Data Lake Storage Gen2 and Blob Storage from Azure Databricks
Pre-requirements
For this tutorial, you would need:
An Azure Data Lake Storage gen 2 or Blob Storage Reference to Create a storage account.
An Azure Databricks Service. create an Azure Databricks workspace
A Databricks Cluster (compute).
A Databricks Notebook.
Integrate ADLS with Databricks:
There are four ways of accessing Azure Data Lake Storage Gen2 in Databricks:
Mount an Azure Data Lake Storage Gen2 filesystem to DBFS .
Use a service principal directly.
Use the Azure Data Lake Storage Gen2 storage account access key or shared access signature(SAS)Token directly.
Pass your Azure Active Directory credentials, also known as a credential passthrough.
In this article we will integrate using SAS and Account Access Key(2)
The following credentials can be used to access Azure Data Lake Storage Gen2 or Blob Storage:
OAuth 2.0 with Microsoft Enterprise ID Service Principal: Databricks recommends using OAuth 2.0 with a Microsoft Enterprise ID (formerly Azure Active Directory) service principal for connecting to Azure Storage. This method provides secure access by leveraging authentication tokens and fine-grained access control. To create a service principal, users must have the necessary permissions, such as the Application Administrator role or the Application.ReadWrite.All permission in Microsoft Enterprise ID.
\* To create a Microsoft Entra ID service principal and provide it access to Azure storage accounts, go through this*reference.
Shared Access Signatures (SAS): Another option for accessing Azure Storage is using shared access signatures (SAS) tokens. SAS tokens allow users to restrict access to storage accounts using temporary tokens with specific permissions. However, users can only grant SAS token permissions that they themselves have on the storage account, container, or file. Generate a SAS Token
Account Keys: While storage account access keys provide full access to the configuration and data of a storage account, Databricks advises against using them due to security concerns. Instead, Databricks recommends using service principals or SAS tokens for connecting to Azure Storage. To view an account's access keys, users must have the Owner, Contributor, or Storage Account Key Operator Service role on the storage account.
Part 1: Quick and Insecure
Directly Access with SAS and Access Key
# Directly Using Account Key
spark.conf.set("fs.azure.account.key.<your-storage-account-name>.dfs.core.windows.net",
"<access-key>")
# Directly Using SAS Token
spark.conf.set("fs.azure.account.key.<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net",
<SECRET_ACCESS_KEY>)
For Account Access key Replace the following with actual names
<storage-account>
with the Azure Storage account name.<access-key>
with the name of the key containing the Azure storage access key.
For SAS token Replace the following with actual names
<storage-account>
with the Azure Storage account name.<scope>
with the Azure Databricks secret scope name.<sas-token-key>
with the name of the key containing the Azure storage SAS token.
Note that in Databricks documentation, if something is shown between <> brackets, you are supposed to replace the whole thing, including <> by something custom to your project. If something is shown between “<>”, this means the double quotes stay and the <> brackets have to go.
Now you can read your files which you stored in container in ADLS from your notebook
# to list the contents of a directory in Azure Data Lake Storage Gen2
dbutils.fs.ls('abfss://'+container_name+'@'+storage_account_name+'.dfs.core.windows.net/dump/')
# For example here I'm reading a CSV files
df = spark.read.csv("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path/to/file>")
# actual code
df = spark.read.csv('abfss://databricksdemo@bhanustoragedemo.dfs.core.windows.net/dump/file.csv')
Above methods are not secure because, Here the access key and SAS Token are present directly in the code. Storing credentials and sensitive information inside a code is not advisable in terms of Data Security and Privacy since anyone with access to code can get hold of authentication keys, which can allow them to fetch or update data without your permission.
Part 2: Quick and a little less Dirty and Secure
Access files Using Secret Scope
To gain a deeper understanding of Databricks Secret Scopes, please refer to the linked article.
Access files using Secret Scope(without mount point):
spark.conf.set("fs.azure.account.key.<storage_account>.dfs.core.windows.net", dbutils.secrets.get(scope = "<scope_name>", key = "<scope_key>"))
After executing the above command you can read the read file from the ADLS.
Now, we are going to create DataFrame using spark object and do simple transformation and save (with different file format) this back to Azure Data Lake which can store any type of files.
#reading a file as DataFrame
df = spark.read.csv("abfss://@<storage_account>.dfs.core.windows.net/path/to/file", header=True)
# Displaying the data(Print)
df1 = df.limit(10)
# Writing DF back to ADLS in Parquet format
df1.write.format('parquet').save("abfss://@<storage_account>.dfs.core.windows.net/output")
once it is successfully executed, you can see in the Azure storage , new folder ‘output’ created under specific container which you given.
Similarly We can read file using mount point name
csv_df= spark.read.csv(path='dbfs:/mnt/HRM_3000/flat_files/customers.csv',inferSchema=True,header=True)
Conclusion
In conclusion, accessing files from Azure Data Lake Storage (ADLS) in Databricks can be achieved using various methods, including direct SAS token or storage account access key, as well as leveraging Databricks secret scopes for secure credential management. Each approach offers its own set of advantages and considerations. Direct SAS tokens provide temporary and scoped access to specific resources, ideal for temporary tasks or one-time access. On the other hand, storage account access keys offer long-term access but require careful management to maintain security. Databricks secret scopes provide a secure and centralized way to manage credentials, enhancing security and compliance while simplifying access control. By understanding these methods, users can choose the approach that best fits their requirements for accessing and processing data in Databricks from Azure Data Lake Storage.
Stay Connected
I am working on follow up posts to give more details around the options of mounting and accessing by using OAuth 2.0 and a Service Principal.
In the meantime: if you're stuck, or something of the abovementioned info doesn't work for you, feel free to leave a comment with an ask as I or other Medium readers might be able to help out! :)
Subscribe to my newsletter
Read articles from Kiran Reddy directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Kiran Reddy
Kiran Reddy
Passionate Data Engineer with a degree from Lovely Professional University. Enthusiastic about leveraging data to drive insights and solutions.