Uploading Large Files To Google Bucket Storage Using Django
As we work with systems that need to generate files and the data flow from these systems grows constantly, the size of these files will also increase. Therefore we can have some timeout or memory issues to upload these big files. A solution to deal with these issues is to upload these files in chunks, where we can break these files into small pieces and upload each one at a time. Here you will learn how to upload these large files to Google Cloud Storage in chunks using Python, with Django.
Before all, you must have a Google Cloud Platform (GCP) account, a GCP project and Google Cloud Storage Enabled. It is also good to have some knowledge of Django Framework.
To make it easier let's assume that you already have a Django project, and need to read a large data of your database to create the file.
Install the Google Storage dependency in your project.
pip install google.cloud
First, we will create a function that sends the pieces to the bucket.
We define the size of each piece in the CHUNK_SIZE
variable, as being 30Mb.
from google.cloud import storage
def upload_file_chunk(self, export_file_path, file, bucket_name, chunk_size):
try:
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
file.seek(0)
blob = bucket.blob(export_file_path, chunk_size=chunk_size)
blob.upload_from_file(file)
return True
except Exception as e:
return False
Now, how we are getting the data from the database, we will create a function to get the data and send it to the function that will write in the bucket.
The size of data sent to the upload_file_chunk
function should be the same as the defined in the CHUNK_SIZE
variable. If the size of the sent data is smaller than the chunk_size, Google Storage understands that is the last chunk. It will be shown in the function.
In example, we have a PostgreSQL database, where we will make a paged search.
CHUNK_SIZE = 1024 * 1024 * 30 # SHOULD BE MULTIPLE 256KB
PAGE_SIZE = 10000
def fetch_products_with_pagination(self, page, page_size, db_config):
conn = psycopg2.connect(**db_config)
cursor = conn.cursor()
offset = (page - 1) * page_size
query = """
SELECT id, name, price
FROM products
LIMIT %s OFFSET %s
"""
cursor.execute(query, (page_size, offset))
results = cursor.fetchall()
cursor.close()
conn.close()
return results
def create_chunks(self, chunk_size, page_size):
page = 1
last_fetch = False
CHUNK_SIZE = chunk_size
len_size_encoded = 0
rest = io.BytesIO()
while true:
# the function will bring all the result until the lsat page
result = fetch_products_with_pagination(page, page_size, db_config)
size_result = len(result)
page += 1
str_lines = ''
for line in result: # here the result is converted to string
product_id, name, price = line
str_lines += f'{product_id},{name},{price}\n'
if size_result < page_size:
last_fetch = True
else:
position_bytes = 0 # define the cursor position in file object
encoded = str_lines.encode() # convert the string
len_size_encoded += len(encoded)
# creates a binary file and write in the res variable
# this variable can have s part that was not written in file
rest.write((io.BytesIO(encoded)).getvalue())
code_bytes = rest
code_bytes.seek(0) # moves the cursor to the first position
while len_size_encoded > CHUNK_SIZE:
upload_file_chunk(
export_file_path='your_path',
file=code_bytes,
bucket_name='your_bucket_name',
chunk_size=chunk_size
)
len_size_encoded -= CHUNK_SIZE
position_bytes += CHUNK_SIZE
code_bytes.seek(position_bytes)
rest = io.BytesIO()
# the remaining content not yet written to the file is placed here,
# as it is smaller than the chunk_size
rest.write(code_bytes.getvalue())
# if is the last fetch, at this point the content to be written
# is smaller than chunk_size, so we can send all the rest
# being smaller than chunk_size, Google Storage understands
# that it is the last part of the file
if last_fetch:
rest.seek(0)
upload_file_chunk(
export_file_path='your_path',
file=rest,
bucket_name='your_bucket_name',
chunk_size=chunk_size
)
break
create_chunks(CHUNK_SIZE, PAGE_SIZE)
This way you can upload your files to a safe place like Google Storage without concern with timeout in upload or memory issues.
Thank you!
Subscribe to my newsletter
Read articles from Pedro Henrique directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Pedro Henrique
Pedro Henrique
Pedro has a degree in Information Systems and is a Software Engineer with more than 6 years of experience, who wants to share some of the knowledge acquired over that time.