Postgres Indexes

saiful islamsaiful islam
9 min read

We all know about postgres Indexes. It’s one of the most sought after topics that are discussed on tech interviews. But we in the industry sometimes underestimate just how powerful postgres indexes can be and just how easier things get with each iteration of Postgres. I think this is why index designing is often ignored in favor with other scaling ideas when application lifecycle moves towards the later stages.

Postgres rows are saved as heaps meaning unordered collection of data. Without indexes, postgres insert data wherever there’s space and when searching it has to go through unordered location each time. When you create an index, Postgres builds a separate data structure (like a B-tree) from the table itself that stores the indexed column(s), stores the pointer to the actual row. Hence this tree is used to speed up data retrieval. It helps avoid full table scans and improve performance on lookups, filters, joins, sorting, full-text searches.

Now before we dig deep on some postgres index capabilitites, we first need to clarify we will only be taking the B-Tree index into consideration. There are other types of indexes available on Postgres such as Hash, Gist, SP-Gist, GIN, BRIN. But B-tree is by far the most used and supports the most operators. Most of these features can be used for other types, but be vigilant of index types when looking to apply these techniques in real world scenarios.

Multicolumn Indexes

An index can be defined on multiple columns of a table. If you frequent queries like:

SELECT * FROM test WHERE a = 5 AND b >= 42 AND c < 77;

then it might be appropriate to define an index on the columns a, b & c.

CREATE INDEX idx_abc ON test (a, b, c);

A multicolumn B-tree index can be used with query conditions that involve any subset of the index's columns. Query efficiency depends heavily on constraints involving the leading (leftmost) columns. Equality constraints (=) on leading columns help the most.

Now, for the condition WHERE a = 5 AND b >= 42 AND c < 77, the a = 5 in the first query clause allows the engine to jump directly to the subtree of the B-tree where a = 5. The b >= 42 clause helps it start the scan at (a = 5, b = 42). You can visualize the index pattern the below way:

//B-Tree
(a=5, b=40, c=100)
(a=5, b=41, c=60)
(a=5, b=42, c=70) ← Query can START HERE
(a=5, b=43, c=50)
(a=5, b=44, c=90)
...
(a=5, b=99, c=10)
(a=6, b=1, c=5)    ← END SCAN here (because a != 5)

The c < 77 condition does not affect the scan range — the database engine still has to walk through all (a = 5, b >= 42) entries and check each c value manually.

Lets then take into account another example that start with our leftmost column a with an inequality.

SELECT * FROM test WHERE a > 5 AND b >= 42 AND c < 77;

The a > 5 help to limit the scan range, but because the planner cannot jump directly to b = 42 without fixing a first — and since a is a range, we’re scanning multiple a values.

So, when equality filters are not present on the leading column, only the first inequality condition helps narrow the scan range. Remaining conditions do not reduce how much of the index is scanned — they just filter during the scan.

According to Postgres docs, Multicolumn indexes should be used sparingly. Indexes with more than three columns are unlikely to be helpful unless the usage of the table is extremely stylized.

Combining Multiple Indexes

If theres only one index avaialable on a table eg

CREATE INDEX idx_ab ON test (a, b);

This works well only within certain conditions:

  1. Query predicates involve only columns in the index.

  2. Conditions are combined with AND.

     SELECT * FROM table WHERE a = 5 AND b = 6;
    
  3. Operators used are supported by the index.

If we happen to have queries of the structure

-- Given index on (a, b)
SELECT * FROM table WHERE a = 5 OR b = 6;

then we wont be able to use the index directly. But Postgres can combine multiple indexes(and also use the same index multiple times) to handle such cases.

Consider this example

-- Given index on x
SELECT * FROM table WHERE x = 42 OR x = 47 OR x = 53 OR x = 99;

postgres will break down the query into four separate scan on the index on x and OR the results together. Another such example

-- CREATE INDEX idx_x ON table (x);
-- CREATE INDEX idx_y ON table (y);
SELECT * FROM table WHERE x = 42 AND y = 7;

postgres will use each index with the appropriate query clause and then AND together the index results to identify the result rows.

So, when required PostgreSQL scans each relevant index. It builds a bitmap of row locations for each scan.These bitmaps are ANDed or ORed. The actual rows are then fetched from disk in physical order.

When designing databases, we must make trade-offs to decide on a indexing strategy. The base guidelines are:

  1. Multicolumn indexes (INDEX(x, y)) are efficient for queries involving both x and y, but not for y alone.
  1. Single-column indexes more flexible when queries are mixed (some on x, some on y, some on both).

Indexes on Expressions

Postgres allows you to create indexes on a function or a scalar expressions computed on one or more columns. If you have frequent queries like:

SELECT * FROM people WHERE (first_name || ' ' || last_name) = 'John Doe';

This query can benefit from an index defined like below:

CREATE INDEX people_names ON people ((first_name || ' ' || last_name));

Index expressions are relatively expensive because because the derived expression(s) must be computed for each row insertion or update. However, the index expressions are not recomputed during a search since they are already stored in the index at the time of insertion. Thus, indexes on expressions are useful when retrieval speed is more important than insertion and update speed.

Partial Indexes

A partial index is an index that only covers a subset of a column value. The subset is defined by a conditiponal expression. An use case of a partial index is to avoid indexing common values. Since a query searching for a common value (one that accounts for more than a few percent of all the table rows) will not use the index anyway, there is no point in keeping those rows in the index at all.

Lets consider an example. If we have a table that contains both billed and unbilled orders, where the unbilled orders take up a small fraction of the total table and yet those are the most-accessed rows, we can improve performance by creating an index on just the unbilled rows.

CREATE INDEX orders_unbilled_index ON orders (order_nr) WHERE billed is not true;

Notice how the indexed column(order_nr) does not have to match the column on the predicate condition(billed).

Queries that use this index will look like this

SELECT * FROM orders WHERE billed is not true AND order_nr < 10000;

The index can also be used in queries that do not involve order_nr at all. ie

SELECT * FROM orders WHERE billed is not true AND amount > 5000.00;

But this query cannot use this index:

SELECT * FROM orders WHERE order_nr = 3501;

PostgreSQL supports partial indexes with arbitrary predicates, so long as only columns of the table being indexed are involved. The predicate must match the conditions used in the queries.

Index Only Scans

As mentioned, Postgres stores indexes separately from the main data area or heap. So, for a query to fetch results, postgres needs to fetch from both the index and heap. As heap stores data randomly so scanning for specific rows can be expensive.

To solve this performance problem, PostgreSQL supports index-only scans. The basic idea is to return values directly out of each index entry instead of consulting the associated heap entry.

We must meet 2 basic consitions for index only scan to work:

  1. Our index type must suport index only scans. B-tree indexes supports this. Other indexes have limited to no support.

  2. The query must reference only the columns from the index. Example below:

For index like the follwing

CREATE INDEX idx_xy ON test (x, y);

These follwing queries will be able to use index only scans since we query either x or x,y both cases have data present on the index itself.

SELECT x, y FROM tab WHERE x = 'key';
SELECT x FROM tab WHERE x = 'key' AND y < 42;

But these cant. since column z doesnt exist on the index.

SELECT x, z FROM tab WHERE x = 'key';
SELECT x FROM tab WHERE x = 'key' AND z < 42;

Covering Indexes

This is a feature designed to rise over limitations of index only scans. It is specifically designed to include the columns needed by a particular type of query that you run frequently. Queries often need to fetch more columns(select part) than the columns they search on (when part), Postgres allows you to create indexes where some columns are payload and not part of the search key.

If we have to frequently run queries like

SELECT first_name, last_name FROM users WHERE email = 'user@example.com';

then it might help to have covering indexes containing columns frequently fetched

CREATE INDEX idx_user_lookup ON users (email) INCLUDE (first_name, last_name);

There are understandably some caveats:

  1. Include columns cannot be used for index scans. the follwing query wont be able to use index scans.

     SELECT first_name, email FROM users WHERE last_name = 'Doe';
    
  2. Include columns cannot be used in index only scans unless the index covers all requested columns. Consider the follwing query for such as case:

     SELECT address FROM users WHERE email = 'user@example.com';
    
  3. Indexes with many INCLUDE columns can become larger and slower to update.

Official Postgres tips for Index Design

Postgres recommeds some experimentations or trial & error to decide on a specific index strategy. But some common tips are listed:

  1. Always run ANALYZE

    PostgreSQL uses statistics to estimate row counts and costs. If you don’t run ANALYZE, the planner uses default assumptions, and may avoid indexes entirely (thereby making indexes useless).

    ANALYZE my_table;

  2. Force index usage for diagnostics

    We can disable plan types like sequential scan or nested loop joins to test if PostgreSQL can use an index.

SET enable_seqscan TO off; 
SET enable_nestloop TO off;

If PostgreSQL still doesn’t use the index, the query likely doesn't match the index structure. If it does use the index, we can compare performance with and without it.

  1. Use EXPLAIN ANALYZE to measure actual performance

EXPLAIN ANALYZE SELECT ...;

This shows query plan and real execution time. Studying the data might give you more insight like weather the current indexing is indeed providing positive results. That might drive you to try out an alternative indexing strategy and compare performances.

0
Subscribe to my newsletter

Read articles from saiful islam directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

saiful islam
saiful islam