Calculating Folder Size In The Lakehouse

2 min read
This is more of a personal note to self than a blog post. When running various test scenarios, I calculate the size of the lakehouse by calculating the size of its folders (Files and Tables). The code I use is below. I'm sharing it here to save myself the trouble of searching for it and to help others who may find it useful. It's nothing special; I simply use Pool to parallelize calculations for a large number of folders.
import os
import pandas as pd
from multiprocessing import Pool, cpu_count
def get_size_of_folder(folder_path):
"""
fabric.guru | 08-09-2023
Calculate the total size of all files in the given folder.
Args:
- folder_path (str): Path to the folder.
Returns:
- tuple: (folder_path, size in MB)
"""
total_size = sum(
os.path.getsize(os.path.join(dirpath, f))
for dirpath, _, filenames in os.walk(folder_path)
for f in filenames
)
size_in_mb = total_size / (1024 * 1024)
return folder_path, size_in_mb
def get_folder_sizes(base_path):
"""
fabric.guru | 08-09-2023
Get the sizes of all folders in the given base path.
Args:
- base_path (str): Base directory path.
Returns:
- DataFrame: DataFrame with columns 'Folder' and 'Size (MB)'.
"""
folders = [
os.path.join(base_path, folder)
for folder in os.listdir(base_path)
if os.path.isdir(os.path.join(base_path, folder))
]
with Pool(cpu_count()) as p:
sizes = p.map(get_size_of_folder, folders)
df = pd.DataFrame(sizes, columns=['Folder', 'Size (MB)'])
return df
## Return a pandas dataframe containing two columns Folder & Size (MB)
## This will scan the folders from the lakehouse mounted in the notebook
## Use File API path and not the ABFSS or https path
df = get_folder_sizes("/lakehouse/default/Files")
0
Subscribe to my newsletter
Read articles from Sandeep Pawar directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Sandeep Pawar
Sandeep Pawar
Microsoft MVP with expertise in data analytics, data science and generative AI using Microsoft data platform.