Bulk Uploading 25 million files to OCI Object Storage

Jon DixonJon Dixon
7 min read

Introduction

This post describes an approach I used when bulk uploading 25 million files (amounting to 6 terabytes of storage) to the Oracle Cloud Infrastructure (OCI) Object Store.

Use Case

A recent project involved archiving Accounts Payable (AP) invoice data and attachments from a legacy Laswon ERP to an Archive solution. The UI for the Archive solution was built in Oracle APEX and ran on an OCI Autonomous Transaction Processing (ATP) database. The attachment files were stored in OCI Object Storage. The 24 million invoice Metadata records were loaded using DBMS_CLOUD.COPY_DATA, which left us needing to copy 25 million invoice attachments (mainly PDF and Excel files) to Object Storage. You can read more about this project here «TBD».

OCI Configs

Before being able to upload files to object store, we must complete some setups on OCI.

Bucket

Create a new object store bucket. In my examples, the bucket is called ASCEND_INVOICES.

User and Group

Create a User, a Group, and assign the user to the new group. Edit user capabilities so the user only has ‘API keys’. The user should look something like this:

Generate API Keys

Next, we need to generate an API key for the OCI CLI. Navigate to your new user, click ‘API keys’, then click ‘Download private key’. After the key downloads, click the ‘Add’ button.

After clicking the ‘Add’ button, capture the configuration file details in a text file.

💡
We will use this information when configuring the OCI Client later in this post.

Policy

The final step on the OCI Console is to create a policy allowing the new group (and user) access to read and manage objects in your new bucket.

Allow group ascend_bucket_rw to read objects in tenancy 
  where target.bucket.name = 'ASCEND_INlVOICES'
Allow group ascend_bucket_rw to manage objects in tenancy where all 
  {target.bucket.name='ASCEND_INVOICES', 
   any {request.permission='OBJECT_CREATE', request.permission='OBJECT_INSPECT', 
   request.permission='OBJECT_READ',request.permission='OBJECT_OVERWRITE'}}
Allow group ascend_bucket_rw to read buckets in tenancy 
  where all {target.bucket.name='ASCEND_INVOICES'}

Install & Set up OCI CLI

Install

I followed this QuickStart guide when installing the CLI on Windows. If you use Homebrew on Mac, installing is even easier using this Homebrew Formula.

Set up the Config File

Setting up the config file is also straightforward. Just follow the steps outlined here.

If this is a fresh install of the CLI tool, the easiest way to set up the config file is to run the following command:

oci setup config

This will walk you through the configuration process. You will need the information from the ‘OCI Configs’ section above, specifically the ‘Configuration file preview’ and the Private Key you downloaded.

Result

If the setup goes well, you should end up with a hidden folder called ‘.oci’ in your user folder.

# Change to the .oci folder in my home directory
cd ~/.oci
# List the contents of the directory
ls -ltr
drwxr-xr-x  3 jdixon  staff    96 Feb 25 05:57 sessions
-rw-------@ 1 jdixon  staff  1742 Mar 30 08:52 ascend_bucket_rw_2025-03-30.pem
-rw-------  1 jdixon  staff   330 Mar 30 08:54 config
# Note: I copied the private key '.pem' file I downloaded from OCI to this folder.

# Show contents of the config file. Actual values replaced with 'YYY'
cat config 
[DEFAULT]
user = ocid1.user.oc1..YYY
fingerprint = cb:cd:YYY
tenancy = ocid1.tenancy.oc1.YYY
region = us-chicago-1
key_file = /Users/jdixon/.oci/ascend_bucket_rw_2025-03-30.pem

Test

You can run a quick test by using the CLI to get details about your bucket:

# Get details about the bucket
oci os bucket get --name "ASCEND_INVOICES" --fields approximateCount --fields approximateSize
# The result is JSON that starts 
{
  "data": {
    "approximate-count": 9663,
    "approximate-size": 1598743463,
# additional JSON with details about the bucket removed for brevity.

# You can get fancy and output details about the bucket as a table.
oci os bucket get --name "ASCEND_INVOICES" --query "data.{\"count\":\"approximate-count\",\"size\":\"approximate-size\",\"name\":name}" --output table --fields approximateCount --fields approximateSize
# The result is something like this:
+-------+-----------------+------------+
| count | name            | size       |
+-------+-----------------+------------+
| 9663  | ASCEND_INVOICES | 1598743463 |
+-------+-----------------+------------+

Bulk Uploading 25 Million Files

Now that we have the CLI set up and are working, let’s discuss the best way to upload 25 million files to Object Storage.

Break up the Upload

The first thing I did was break up the source folders into somewhat equal chunks for upload. Setting off one massive bulk upload has the following risks if the upload fails:

  • Figuring out where it failed will be challenging.

  • When you need to restart the process, you must run the bulk upload of all 25m files each time.

Source Folder Structure

In my case, the source folder structure looked like this:

E:\Ascend\M01\...
E:\Ascend\M01\...
E:\Ascend\M01\...
...
F:\Ascend\M01\...
F:\Ascend\M02\...
...

I used the ‘M’ folders to break the copy into 70 separate bulk upload commands.

Bulk Upload Commands

Two documents provide details of the bulk upload CLI command:

Here is an example of the bulk upload command I ended up using:

oci os object bulk-upload -bn "ASCEND_INVOICES" --src-dir "H:\Ascend\M10\" --overwrite --parallel-upload-count 500 --object-prefix "H/Ascend/M12/" --exclude *.Log > H_M10.json

Parameters Details

  • --src-dir - Indicates the source directory (in this case, on a Windows file system).

  • --overwrite - If we need to re-run the upload for a folder, this option tells the CLI to overwrite any existing files.

  • --parallel-upload-count - Indicates the degree of parallelism to use. The maximum value for this parameter is 1000 (more on why I chose 500 later in the post).

  • --object-prefix - Indicates how the files should be prefixed in the Object Store bucket. I wanted to transform the Windows drive into a folder in the object store. So H:\ became H/.

  • --exclude - Indicates any files you want to exclude from being copied. In my case, there were millions of ‘.Log’ files in the same folders as the PDF files. Excluding the files you don’t want reduces the time needed to upload them and the Object Storage costs on OCI. You can add multiple exclude parameters.

Output

You may have noticed I am piping the result of the CLI command to a file called ‘H_M10.json’. When the bulk upload runs, it generates JSON, including details of all the files it uploads. The content of this file will be used during the reconciliation step. Note: This JSON file can get pretty big.

{
  "skipped-objects": [],
  "upload-failures": {},
  "uploaded-objects": {
    "MyFile.txt": {      
    "etag": "e25f95e6-a2bd-435c-83d6-785f838134d5",
    "last-modified": "last-modified": "Sat, 12 Dec 2020 11:31:36 GMT",
    "opc-content-md5": "opc-content-md5": "vqglL/ToD0FxnqE83wBycw=="
  },
    "logFile.log": {
    "etag": "bbcf33dd-a177-4406-bed1-a4f7125da800",
    "last-modified": "Sat, 12 Dec 2020 11:31:36 GMT",
    "opc-content-md5": "K8vB8NVASIvtL2BE5ksUjw=="
    }
  }
}
💡
The key field in the output JSON is upload-failures. This provides a count of files that could not be uploaded.

Optimizing Parallel Upload Count

In the early stages of the project, I ran several simulations using different values for the parameter --parallel-upload-count. Of course, the first value I tried was 1000 (the maximum), but in the end, I found that 500 provided the best performance on the Windows machine from which I was uploading the files.

💡
This is related to the number of CPUs on the machine and the number of parallel threads it can handle. Ensure you test with several different values to determine the optimal number for your machine.

Reconciliation

Of course, we need to verify that we have the same files in Object Storage that we started with on the source machine.

The first thing to check after each bulk upload is that there were no upload errors. This means grepping the JSON file created by the bulk-upload process to verify that there are no upload failures. On the Windows machine, I used the following PowerShell command for this:

Select-String -Path "H_M10.json" -Pattern "upload-failures"

Next, we can query object storage to get a count of the etag fields in the output JSON:

(Get-Content "H_M10.json" | Select-String "etag").Count

Then, all you need to do is compare the resulting count to the files in the source folder.

Estimate Timings

The final step for me was to bulk upload three source folders and determine the average time it takes to upload a certain number of files per second. Take this number and multiply it by the total number of files, and you will get an estimated duration for your upload. In my case, I achieved an upload rate of 125 files per second or 58 MB per second, resulting in an overall duration of 30 hours.

The overall elapsed time was longer because you must start each bulk upload and verify the results before starting the next one.

Observations

  • Running a bulk upload where the files already exist in object storage takes about 40% longer than when the files are not there. This will happen if a batch fails and you re-run it with the --overwrite flag.

Conclusion

If you are doing a lot of work with OCI services, then the Command Line Interface is an essential tool for productivity. As we saw from this post, even if you only use OCI services occasionally, the CLI can come in very handy.

1
Subscribe to my newsletter

Read articles from Jon Dixon directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Jon Dixon
Jon Dixon

Hi, thanks for stopping by! I am focused on designing and building innovative solutions using the Oracle Database, Oracle APEX, and Oracle REST Data Services (ORDS). I hope you enjoy my blog.