Quick tip: Selecting from an Oracle table and creating a Parquet datafile

There are many sites on how to connect to an Oracle database with Python and using Pandas to create a parquet format datafile. This example is using the python-oracledb library to establish the connection. Searches for Python libraries for Oracle database connections included cx_Oracle but this library was obsoleted in 2022 https://oracle.github.io/python-cx_Oracle/ This example required additional features provided in the Oracle Client libraries and is located in the lib_dir directory.

In this example, I am using Visual Studio Code with with Jupyter Notebook on my Mac. Python 3.12.2 with Anaconda is being used as the kernel for the Jupyter notebook functionality. I also set-up a virtual environment for the Python development.

When you're working with large datasets in enterprise environments, it’s often helpful to pull data from an Oracle database and convert it into a Parquet file for analysis, especially with Python and Pandas. Parquet is an efficient, compressed columnar format that plays nicely with big data tools like Spark, DuckDB, and Dask.

Here’s a quick way to do just that using Python.

# Oracle Data Frames Writing Apache Parquet Files
import getpass
import oracledb
import pyarrow
import pyarrow.parquet as pq

oracledb.init_oracle_client(lib_dir="/opt/oracle/instantclient_23_3")
# pw = getpass.getpass("Enter password: ")

connection = oracledb.connect(
    user="demopython",
    password="***",
    dsn="localhost/freepdb1")

FILE_NAME = "sample.parquet"

# Tune the fetch batch size for your query
BATCH_SIZE = 10000

print("Successfully connected to Oracle Database")
    # Get a PyArrow table from the query results
sql = "select * from dba_users"
pqwriter = None
for odf in connection.fetch_df_batches(statement=sql, size=BATCH_SIZE):

    # Get a PyArrow table from the query results
    pyarrow_table = pyarrow.Table.from_arrays(
        arrays=odf.column_arrays(), names=odf.column_names()
    )

    if not pqwriter:
        pqwriter = pq.ParquetWriter(FILE_NAME, pyarrow_table.schema)

    pqwriter.write_table(pyarrow_table)

pqwriter.close()
0
Subscribe to my newsletter

Read articles from Christopher Youll directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Christopher Youll
Christopher Youll