Mount Points in Databricks
What is DBFS?
DBFS stands for Databricks File System. It's a distributed file system that's part of the Databricks Unified Data Analytics Platform. DBFS provides a scalable and reliable way to store data across various Databricks clusters. It's designed to be compatible with the Hadoop Distributed File System (HDFS), making it easy to migrate existing Hadoop workloads to Databricks.
Overall, DBFS simplifies data management and access within the Databricks environment, enabling users to focus on data analysis and insights rather than worrying about underlying data storage and infrastructure.
What are Mount Points in Databricks?
I'll illustrate the concept of mount points in Databricks by drawing an analogy to bridges connecting your Databricks File System (DBFS) to cloud object storage.
Think of your Databricks workspace as an island (DBFS) floating in the middle of a vast ocean of data (cloud object storage like AWS S3, Azure Blob Storage, or Google Cloud Storage). To access the resources on this island, you need bridges (mount points) connecting it to the mainland (cloud storage).
Each bridge (mount point) serves as a connection between the island (DBFS) and a specific location on the mainland (cloud object storage). When you create a bridge (mount point), you're essentially building a path that allows you to walk back and forth between the island and the mainland, bringing resources from the mainland into the island's territory.
Once a bridge (mount point) is set up, the data on the mainland (cloud storage) becomes easily reachable from the island (DBFS). You can explore the island's landscape (DBFS directory structure) and interact with the mainland data as if it were part of the island itself.
Azure Databricks mounts create a link between a workspace and cloud object storage, which enables you to interact with cloud object storage using familiar file paths relative to the Databricks file system. Mounts work by creating a local alias under the /mnt
directory(virtual directory within the DBFS) that stores the following information:
Location of the cloud object storage.
Driver specifications to connect to the storage account or container.
Security credentials required to access the data.
How Can I Mount a Cloud Object Storage on DBFS?
In this context, we're primarily focused on Azure Storage, one of the cloud storage providers supported by Databricks
When you create a mount point in Databricks, you typically have the option to authenticate using either an Access Key or Microsoft Entra ID or a Shared Access Signature (SAS) token, depending on the cloud storage provider you're connecting to.
What is the syntax for mounting storage?
dbutils.fs.mount(
source: str,
mount_point: str,
encryption_type: Optional[str] = "",
extra_configs: [dict[str:str]] = None
)
source
specifies the URI of the object storage.mount_point
specifies the local path in the/mnt
directory.extra_configs
access patterns you can pass additional configuration specifications as a dictionary.
Creating a mount point
1.1) creating mount point directly with SAS and Access Key
# Define your storage account name, container name, and access key
storage_account_name = "<storage-account-name>"
container_name = "<container-name>"
access_key = "<storage-account-access-key>"
sasToken = "<shared-access-signature>"
# Define the path for the mount point
mount_point_name = "/mnt/<mount-point-name>"
# Define the URL with storage account name and container name
url = "abfss://" + container_name + "@" + storage_account_name + ".dfs.core.windows.net/"
# Mount the storage using access key
dbutils.fs.mount(
source = url,
mount_point = mount_point_name,
#using SAS token
extra_configs = {"fs.azure.account.key." + storage_account_name + ".dfs.core.windows.net": sasToken}
#using Access key
extra_configs = {"fs.azure.account.key." + storage_account_name + ".dfs.core.windows.net": access_key}
)
# if you are using SAS token then comment 2nd extra_configs else comment the 1st extra_configs
source = f"wasbs://{container_name}@{storage_account_name}.blob.core.windows.net"
I provide two authentication methods for mount: Access Key and SAS token. You may use either (by choosing the 1st or 2nd line that starts with "extra_configs".
1.2) creating mount point using Secret Scope
# Retrieve the storage account access key from the secret scope
access_key = dbutils.secrets.get(scope=secret_scope_name, key="<storage-account-access-key-name>")
# Define the path for the mount point
mount_point_name = "/mnt/<mount-point-name>"
# Define the URL with storage account name and container name
url = "abfss://<container_name>@<storage_account_name>.dfs.core.windows.net/"
# Mount the storage using access key from secret scope
dbutils.fs.mount(
source = url,
mount_point = mount_point_name,
extra_configs = {"fs.azure.account.key." + storage_account_name + ".dfs.core.windows.net": access_key}
)
Remember, the configuration mentioned above, is specific to Azure ADLS Gen2 Storage Account. Adjustments are necessary for other cloud providers.
1.3) creating mount point to connect AWS S3 Bucket
bucket_name="ctsharedcontainer"
mount_point_name="/mnt/HRM_3000_S3"
AWSs3Access = dbutils.secrets.get(scope="ADLS-SHARED", key="aws-access-key")
AWSs3secreate = dbutils.secrets.get(scope="ADLS-SHARED", key="aws-secret-key")
dbutils.fs.mount(
source=f"s3a://{bucket_name}",
mount_point=mount_point_name,
extra_configs={
"fs.s3a.access.key": AWSs3Access,
"fs.s3a.secret.key": AWSs3secreate}
)
For additional information, please refer to the Databricks documentation on connecting to AWS resources
Access Data Stored In a Cloud Object Storage Using Mount Points?
Once mounted, accessing your data (e.g., Delta Table) is as straightforward as referencing the mount point in your data operations:
# Using spark, read delta table by the path
df = spark.read.load("/mnt/my_mount_point/my_data")
# Using spark, write back to the mount point
df.write.format("delta").mode("overwrite").save("/mnt/my_mount_point/delta_table")
Unmount a mount point
To unmount a mount point, use the following command:
dbutils.fs.unmount("/mnt/")
You can list your existing mount points using the below dbutils command:
dbutils.fs.mounts()
Key points
When doing mounting operations, manage your secrets using secret scopes and never expose raw secrets
Keep your mount points up-to-date
- In case a source doesn’t exist anymore in the storage account, remove the mount points from Databricks as wellUsing the same mount point name as your container name can make things easier if you have many mount points. Especially, if you come back to your workspace after some time, you can easily match them with the Azure Storage Explorer.
Don’t put non-mount point folders and other files in the /mnt/ directory. They will confuse you.
Conclusion
In summary, mount points in Databricks act as bridges between the platform and external data sources like AWS S3, Azure Blob Storage, or Google Cloud Storage. They enable seamless integration, allowing users to access and analyze data stored in these sources directly within Databricks. Mount points streamline data workflows, enhance collaboration, and maximize the platform's analytical capabilities.
References and Further Reading:
Subscribe to my newsletter
Read articles from Kiran Reddy directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Kiran Reddy
Kiran Reddy
Passionate Data Engineer with a degree from Lovely Professional University. Enthusiastic about leveraging data to drive insights and solutions.