How to effectively work with Databases in Python

Akash DesardaAkash Desarda
7 min read

Introduction

The age-old debate on the use of Raw SQL v/s ORM is still very much alive in today’s world. Let’s see some of the comparing points

AspectRaw SQLORM (Object-Relational Mapping)
Ease of UseRequires knowledge of SQL syntax and database-specific features.Abstracts SQL into Python objects, making it easier for those familiar with Python.
FlexibilityOffers complete control over the SQL queries and database interactions.Limited by the ORM's capabilities, but can be extended with raw SQL if needed.
PerformanceCan be optimized for performance by writing efficient SQL queries.May introduce overhead due to abstraction, but often optimized for common use cases.
PortabilityTied to specific SQL dialects, making it less portable across databases.Generally more portable as it abstracts database-specific details.
Learning CurveSteeper learning curve for those unfamiliar with SQL.Easier for Python developers, but it requires learning the ORM's API.
MaintenanceCan be harder to maintain due to verbose SQL code.Easier to maintain as changes in the database schema can be managed through models.
SecurityProne to SQL injection if not handled properly.Provides built-in protection against SQL injection through parameterized queries.

A seasoned database expert might claim that an ORM or a programming language isn't necessary for working with a database. However, in practice, this approach has significant downsides, such as:

  • For any complex problem, you might end up with multiple nested SELECT queries, which can be really tricky for others to debug and understand.

  • SQL has a fixed set of keywords, so you have to work within those limits. This means missing out on all the amazing possibilities that a full-stack programming language like Python offers.

  • You can't build a data pipeline using just SQL.

Therefore, the use of SQL should generally be limited to performing data analytics rather than developing data applications.

The World of ORMs

Python being a popular programming language provides many options for ORM. Let’s see them in action

Tech stack

  • For this article, I am using Postgres database & Python 3.13

  • Typically, an ORM transforms SQL data into Python data structures, which are then organized into either a dict, a tuple (a list of tuples), or even a namedtuple (a list of namedtuples).

  • These basic data structures may not simplify handling real-world problems.

  • I prefer using a Dataframe as the data structure to store the data. A Dataframe closely resembles a Database Table.

  • You can choose between Pandas or Polars. Personally, I use Polars. I won't dive into why I prefer Polars over Pandas here, as that's a topic for another time. But trust me, switching to Polars can really make your life easier!

Benchmark

The environment that I use includes:

  • Postgres 17

  • Python 3.13

  • Table of size 1,977,823

  • This is the query
SELECT
    *
FROM
    FACTOR_INVESTING.TICKER_HISTORY
WHERE
    TICKER IN ('INFY', 'TCS')
    AND DATE BETWEEN CAST('2010-01-01' AS DATE) AND CAST('2024-01-01' AS DATE)
ORDER BY
    DATE DESC NULLS LAST
  • The Benchmark query returns data of size 6910 rows

1. SQLAlchemy + Polars

from datetime import date
from about_time import about_time
from alive_progress import alive_it
from sqlalchemy.orm import declarative_base, mapped_column, Mapped, Session
from sqlalchemy import String, Date, DOUBLE_PRECISION, select, desc, create_engine
import polars as pl


engine = create_engine('postgresql+psycopg://akash:0330@localhost/playground')
Base = declarative_base()

class TickerHistory(Base):
    __tablename__ = "ticker_history"
    __table_args__ = {"schema": "factor_investing", "extend_existing": True}

    date: Mapped[date] = mapped_column(Date)
    ticker: Mapped[str] = mapped_column(String)
    key: Mapped[str] = mapped_column(String, primary_key=True, nullable=False)
    open: Mapped[float] = mapped_column(DOUBLE_PRECISION)
    high: Mapped[float] = mapped_column(DOUBLE_PRECISION)
    low: Mapped[float] = mapped_column(DOUBLE_PRECISION)
    close: Mapped[float] = mapped_column(DOUBLE_PRECISION)

query = (
    select(TickerHistory)
    .where(TickerHistory.ticker.in_(["INFY", "TCS"]))
    .where(TickerHistory.date.between(date(2010, 1, 1), date(2024, 1, 1)))
    .order_by(desc(TickerHistory.date))
)

with about_time() as t:
    with Session(engine) as session:
        # Running the same query 100 times
        for i in alive_it(range(100)):
            # Directly reading query in polars dataframe
            df = pl.read_database(query, session)

print(f"Total time taken: {t.duration_human}")

The result

So that’s 44.41 second for 100 queries & 2.25 queries per second throughout

2. Pony ORM, Peewee ORM, SQLModel ORM

Both of these popular ORMs currently do not support Psycopg3 and still require Psycopg2. The library authors recommend using Psycopg3 going forward, as mentioned here. Because of this, I decided to skip using both of these ORMs.

SQLModel ORM which is itself based on SQLAlchemy, so we won't any significant difference in results.

But this is not the end of the tunnel. There are other potentially good options too. Let’s check them out

You may have already got a feeling that I prefer ORM based solution compared to Raw SQL. But theoretically, Raw SQL query execution should have an edge over ORM execution.

To combine the power of both the world I use Sqlglot Library. This allows you to build the query programmatically as well as iteratively. Eliminates issues like typos/spelling errors, SQL Injection, etc. It supports 24 dialects.

3. Psycopg3 + Sqlglot + Polars

from datetime import date
from about_time import about_time
from alive_progress import alive_it
from psycopg import connect
import polars as pl
from sqlglot import select, condition, Dialects

query = (
    select("*")
    .from_("factor_investing.ticker_history")
    .where(condition("ticker").isin("INFY", "TCS"))
    .where(condition("date").between(date(2010, 1, 1), date(2024, 1, 1)))
    .order_by("date DESC")
    .sql(Dialects.POSTGRES)
)

conn = connect(
    host="localhost", port=5432, dbname="playground", user="akash", password="0330"
)

with about_time() as t:
    print("Starting benchmark...")
    with conn.cursor() as cursor:
        # Running the same query 100 times
        for i in alive_it(range(100)):
            # Directly reading query in polars dataframe
            df = pl.read_database(query, cursor)

print(f"Total time taken: {t.duration_human}")

So that’s 20.04 second for 100 queries & 5 queries per second throughout. This improves the result compared to SQLAlchemy.

4. ADBC + Sqlglot + Polars

Polars dataframe are backed by Arrow Table & uses PyArrow. Use of ADBC (Arrow Database Connectivity) can benefit from zero copy concept, since Polars doesn't need to convert table data returned by ADBC drivers.

from adbc_driver_postgresql.dbapi import connect
from datetime import date
from about_time import about_time
from alive_progress import alive_it
import polars as pl
from sqlglot import select, condition, Dialects

query = (
    select("*")
    .from_("factor_investing.ticker_history")
    .where(condition("ticker").isin("INFY", "TCS"))
    .where(condition("date").between(date(2010, 1, 1), date(2024, 1, 1)))
    .order_by("date DESC")
    .sql(Dialects.POSTGRES)
)

uri = "postgresql://akash:0330@localhost/playground"

with about_time() as t:
    with connect(uri) as conn:
       # Running the same query 100 times
        for i in alive_it(range(100)):
            # Directly reading query in polars dataframe
            df = pl.read_database(query, conn)

print(f"Total time taken: {t.duration_human}")

So that’s 17.94 second for 100 queries & 5.68 queries per second throughout. This improves the result compared to both SQLAlchemy & Psycopg

5. ConnectorX + Sqlglot + Polars

ConnectorX is yet another SQL driver/library that is making some rounds. It is using Rust. Let’s see in action. BTW it uses a slightly different approach. There is no connection or cursor concept here.

 from datetime import date
from about_time import about_time
from alive_progress import alive_it
import polars as pl
from sqlglot import select, condition, Dialects

query = (
    select("*")
    .from_("factor_investing.ticker_history")
    .where(condition("ticker").isin("INFY", "TCS"))
    .where(condition("date").between(date(2010, 1, 1), date(2024, 1, 1)))
    .order_by("date DESC")
    .sql(Dialects.POSTGRES)
)

uri = "postgresql://akash:0330@localhost/playground"
with about_time() as t:
    # Running the same query 100 times
        for i in alive_it(range(100)):
            # Directly reading query in polars dataframe
            df = pl.read_database_uri(query,uri, engine="connectorx")

print(f"Total time taken: {t.duration_human}")

So that’s 1:22.4 second for 100 queries & 12.1 queries per second throughout. This is the worst of the lot.

Conclusion

  • The thought that Raw SQL can give you better results holds true. But at the same time directly using it in Python will be never a good idea.

  • Sqlglot nicely fits this scenario. Allows us to use Raw SQL query & combine with Dataframe.

  • In terms of benchmarks, the combination of ADBC + Sqlglot + Polars is the winner because of the tight Arrow Integration. So this should be your first pick.

  • But at the same time sticking with Psycopg is also not a bad idea.

0
Subscribe to my newsletter

Read articles from Akash Desarda directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Akash Desarda
Akash Desarda