Uploading Large Files To Google Bucket Storage Using Django

Pedro HenriquePedro Henrique
3 min read

As we work with systems that need to generate files and the data flow from these systems grows constantly, the size of these files will also increase. Therefore we can have some timeout or memory issues to upload these big files. A solution to deal with these issues is to upload these files in chunks, where we can break these files into small pieces and upload each one at a time. Here you will learn how to upload these large files to Google Cloud Storage in chunks using Python, with Django.

Before all, you must have a Google Cloud Platform (GCP) account, a GCP project and Google Cloud Storage Enabled. It is also good to have some knowledge of Django Framework.

To make it easier let's assume that you already have a Django project, and need to read a large data of your database to create the file.

Install the Google Storage dependency in your project.

pip install google.cloud

First, we will create a function that sends the pieces to the bucket.

We define the size of each piece in the CHUNK_SIZE variable, as being 30Mb.

from google.cloud import storage

def upload_file_chunk(self, export_file_path, file, bucket_name, chunk_size):
  try:
    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)

    file.seek(0)
    blob = bucket.blob(export_file_path, chunk_size=chunk_size)
    blob.upload_from_file(file)
    return True
  except Exception as e:
    return False

Now, how we are getting the data from the database, we will create a function to get the data and send it to the function that will write in the bucket.

The size of data sent to the upload_file_chunk function should be the same as the defined in the CHUNK_SIZE variable. If the size of the sent data is smaller than the chunk_size, Google Storage understands that is the last chunk. It will be shown in the function.

In example, we have a PostgreSQL database, where we will make a paged search.

CHUNK_SIZE = 1024 * 1024 * 30  # SHOULD BE MULTIPLE 256KB
PAGE_SIZE = 10000

def fetch_products_with_pagination(self, page, page_size, db_config):
    conn = psycopg2.connect(**db_config)
    cursor = conn.cursor()

    offset = (page - 1) * page_size
    query = """
      SELECT id, name, price
      FROM products
      LIMIT %s OFFSET %s
    """
    cursor.execute(query, (page_size, offset))
    results = cursor.fetchall() 
    cursor.close()
    conn.close()        
    return results

def create_chunks(self, chunk_size, page_size):
  page = 1
  last_fetch = False
  CHUNK_SIZE = chunk_size
  len_size_encoded = 0
  rest = io.BytesIO()

  while true:
    # the function will bring all the result until the lsat page
    result = fetch_products_with_pagination(page, page_size, db_config)
    size_result = len(result)
    page += 1
    str_lines = ''
    for line in result: # here the result is converted to string
      product_id, name, price = line
      str_lines += f'{product_id},{name},{price}\n'

    if size_result < page_size:
      last_fetch = True
    else:
      position_bytes = 0 # define the cursor position in file object
      encoded = str_lines.encode() # convert the string
      len_size_encoded += len(encoded)
      # creates a binary file and write in the res variable
      # this variable can have  s part that was not written in file
      rest.write((io.BytesIO(encoded)).getvalue())
      code_bytes = rest
      code_bytes.seek(0) # moves the cursor to the first position

      while len_size_encoded > CHUNK_SIZE:
        upload_file_chunk(
          export_file_path='your_path',
          file=code_bytes,
          bucket_name='your_bucket_name',
          chunk_size=chunk_size
        )
          len_size_encoded -= CHUNK_SIZE
          position_bytes += CHUNK_SIZE
          code_bytes.seek(position_bytes)

      rest = io.BytesIO()
      # the remaining content not yet written to the file is placed here,
      # as it is smaller than the chunk_size
      rest.write(code_bytes.getvalue())

      # if is the last fetch, at this point the content to be written
      # is smaller than chunk_size, so we can send all the rest
      # being smaller than chunk_size, Google Storage understands
      # that it is the last part of the file
      if last_fetch:
        rest.seek(0)
        upload_file_chunk(
          export_file_path='your_path',
          file=rest,
          bucket_name='your_bucket_name',
          chunk_size=chunk_size
        )
        break


create_chunks(CHUNK_SIZE, PAGE_SIZE)

This way you can upload your files to a safe place like Google Storage without concern with timeout in upload or memory issues.

Thank you!

0
Subscribe to my newsletter

Read articles from Pedro Henrique directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Pedro Henrique
Pedro Henrique

Pedro has a degree in Information Systems and is a Software Engineer with more than 6 years of experience, who wants to share some of the knowledge acquired over that time.