Optimizing SQL: Sometimes It’s About the Data, Not Just the Query

I recently came across a LinkedIn post where someone used two CTE
s to pull DISTINCT merchant_id
s from the same table—split by year—and then joined everything back to the orders
table. It worked! But it got me thinking. Sometimes, the real challenge isn't the query but the data we're working with. I built a quick PostgreSQL lab to explore the performance and clarity of different approaches using a realistic dataset. Here’s how I did it:
Setup
Create a merchant table with 5,000 companies
Create an orders table with 1,000,000 orders
#!/usr/bin/env python
import psycopg2
from faker import Faker
import random
from tqdm import tqdm
import datetime
fake = Faker()
# Use the service name defined in ~/.pg_service.conf
conn = psycopg2.connect(service="pg_lab")
cur = conn.cursor()
# Drop and recreate tables
cur.execute("DROP TABLE IF EXISTS orders")
cur.execute("DROP TABLE IF EXISTS merchant")
conn.commit()
cur.execute("""
CREATE TABLE merchant (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL
)
""")
cur.execute("""
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
merchant_id INTEGER REFERENCES merchant(id),
order_date DATE NOT NULL,
amount NUMERIC(10, 2) NOT NULL
)
""")
conn.commit()
# Insert merchants
merchant_count = 5000
print("Inserting merchants...")
cur.executemany(
"INSERT INTO merchant (name) VALUES (%s)",
[(fake.company(),) for _ in range(merchant_count)]
)
conn.commit()
# Insert orders
order_count = 1_000_000
batch_size = 10000
merchant_ids = list(range(1, merchant_count + 1))
print("Inserting orders...")
for _ in tqdm(range(order_count // batch_size)):
batch = []
for _ in range(batch_size):
merchant_id = random.choice(merchant_ids) if random.random() > 0.10 else None
year = random.choice([2016, 2017])
order_date = fake.date_between(
start_date=datetime.date(year, 1, 1),
end_date=datetime.date(year, 12, 31)
)
amount = round(random.uniform(10.0, 1000.0), 2)
batch.append((merchant_id, order_date, amount))
cur.executemany(
"INSERT INTO orders (merchant_id, order_date, amount) VALUES (%s, %s, %s)",
batch
)
conn.commit()
print("✅ Done!")
cur.close()
conn.close()
Query Patterns:
Approach 1: Common but Heavier Method
create index cvr_merchant_id_order_dt_idx on orders ( id, merchant_id, order_date );
\timing
explain analyze
WITH m2016 AS (
SELECT DISTINCT merchant_id
FROM orders
WHERE order_date BETWEEN '2016-01-01' AND '2016-12-31'
),
m2017 AS (
SELECT DISTINCT merchant_id
FROM orders
WHERE order_date BETWEEN '2017-01-01' AND '2017-12-31'
)
SELECT o.merchant_id, date_part('year', o.order_date ) order_year, sum(amount)
FROM orders o
LEFT JOIN m2016 USING (merchant_id)
LEFT JOIN m2017 USING (merchant_id)
group by o.merchant_id, date_part('year', o.order_date )
having date_part('year', o.order_date ) in ( 2016, 2017 );
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Finalize HashAggregate (cost=15620.42..15787.75 rows=9562 width=44) (actual time=9077.377..9195.207 rows=10002 loops=1)
Group Key: o.merchant_id, (date_part('year'::text, (o.order_date)::timestamp without time zone))
Batches: 1 Memory Usage: 5521kB
-> Gather (cost=14630.75..15537.08 rows=8334 width=44) (actual time=8165.085..8701.543 rows=30006 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial HashAggregate (cost=13630.75..13703.68 rows=4167 width=44) (actual time=8151.833..8269.454 rows=10002 loops=3)
Group Key: o.merchant_id, date_part('year'::text, (o.order_date)::timestamp without time zone)
Batches: 1 Memory Usage: 4241kB
Worker 0: Batches: 1 Memory Usage: 4241kB
Worker 1: Batches: 1 Memory Usage: 4241kB
-> Parallel Seq Scan on orders o (cost=0.00..13599.50 rows=4167 width=18) (actual time=0.094..4098.985 rows=333333 loops=3)
Filter: (date_part('year'::text, (order_date)::timestamp without time zone) = ANY ('{2016,2017}'::double precision[]))
Planning Time: 0.181 ms
Execution Time: 9298.977 ms
(15 rows)
Time: 9300.379 ms (00:09.300)
This works—and sometimes it’s all you need to get the job done. But it’s also a bit heavy: scanning the table twice, applying DISTINCT
, and rejoining it all back.
Approach 2: Consolidation of Database Calls for Efficient SQL
I can significantly simplify the query to its basic pieces; however, I could not find a better plan. Postgres 17 seems much more intelligent than earlier versions. The rows returned, and the costs are identical in this case. Postgres must re-write the query for optimal results. The only benefit of the following query over the query above is that it is even more organized than grouping data using CTEs. Sometimes, when the optimizer does not re-write, you may be out of luck.
-- Why get a distinct in the first place, you will auto-magically get it from
-- the group by. Also, you don't need to do some fancy CTE because your
-- constriants will protect you from getting any orphan records. If your data is
-- wonky, that means you didn't put constraints in when you should have.
explain analyze
select o.merchant_id
, date_part('year', o.order_date ) order_year
, sum(amount)
from orders o
group by o.merchant_id
, date_part('year', o.order_date )
having date_part('year', o.order_date ) in ( 2016, 2017 );
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Finalize HashAggregate (cost=15620.42..15787.75 rows=9562 width=44) (actual time=9178.468..9300.326 rows=10002 loops=1)
Group Key: merchant_id, (date_part('year'::text, (order_date)::timestamp without time zone))
Batches: 1 Memory Usage: 5521kB
-> Gather (cost=14630.75..15537.08 rows=8334 width=44) (actual time=8240.512..8791.474 rows=30006 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial HashAggregate (cost=13630.75..13703.68 rows=4167 width=44) (actual time=8226.467..8349.513 rows=10002 loops=3)
Group Key: merchant_id, date_part('year'::text, (order_date)::timestamp without time zone)
Batches: 1 Memory Usage: 4241kB
Worker 0: Batches: 1 Memory Usage: 4241kB
Worker 1: Batches: 1 Memory Usage: 4241kB
-> Parallel Seq Scan on orders o (cost=0.00..13599.50 rows=4167 width=18) (actual time=0.115..4138.469 rows=333333 loops=3)
Filter: (date_part('year'::text, (order_date)::timestamp without time zone) = ANY ('{2016,2017}'::double precision[]))
Planning Time: 0.181 ms
Execution Time: 9407.145 ms
(15 rows)
Time: 9408.159 ms (00:09.408)
This second pattern:
Is simpler for the database to execute.
Is easier to read and maintain.
It still gives us clear insight into which merchants were active in which years.
💡 Takeaway
SQL is a powerful tool, and it’s fantastic to see people share what they have learned. As we grow, we notice some patterns that show SQL prowess and others that require better data maintenance and architecture. For example, sometimes, we should fix a data problem rather than write a query to work around data problems. I always aim for clean, efficient queries and database architecture, but I also recognize that messy data is part of real-world systems. I lean toward ideal solutions—but not at the expense of reality. Thanks to that original post for sparking the idea! I would love to hear how others handle these types of challenges.
Subscribe to my newsletter
Read articles from John M Harper directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
