Optimizing Query Performance in Snowflake

Victor OhachorVictor Ohachor
6 min read

Overview

Snowflake's architecture is made up of three (3) layers: storage layer, compute layer, and cloud services layer. The cloud services layer has query optimization built into it, which helps whip out some caveats in your queries to yield quicker and most cost-effective results.

But it doesn't prevent you from unknowingly planting landmines or atomic explosives into your queries. This article aims (and might fail) to point out some of the common pitfalls your queries might be running towards because of your ignorance.

๐Ÿ’ก
Don't sweat it! We are all trying to save ourselves from our ignorance.

Common Query Performance Pitfalls (and How to Avoid Them)

What is a query performance pitfall?
"This refers to common mistakes or inefficiencies in writing and structuring database queries that can lead to suboptimal performance. This can cause queries to run slower, consume more resources, and ultimately increase costs," courtesy of ChatGPT.

Why don't we make our subheadings (from now on) more interesting?

SELECT name FROM queries_performance.pitfalls WHERE id = 1 => Exploding JOINs

This is usually caused by missing ON conditions or joining large tables without proper indexing. It causes every record from one table to match every record from another.

It is quite simple to avoid this particular pitfall. All you need to do this are:

  1. Properly indexed tables.

  2. Appropriate JOIN types: For example, neither use LEFT JOIN when you ought to use INNER JOIN nor use NATURAL JOIN when you ought to use LATERAL JOIN.

  3. Ensure that JOIN conditions are based on indexed columns and avoid using functions or expressions in them. Hey, don't ever forget to add a JOIN condition, okay?

    You can easily save yourself from cartesian products this way.

SELECT name FROM queries_performance.pitfalls WHERE id = 2 => Using UNION When UNION ALL Could Be Faster

UNION combines the results of two queries and removes duplicate rows, which requires additional processing to check for and eliminate these duplicates. While I do like using UNION for this, it tends to be resource-intensive and slow, especially when dealing with large datasets.

UNION ALL is UNION without the extra checks for duplicates, making it faster and more efficient.

When you are confident that the results of the queries you are combining do not contain duplicates, take advantage of the efficiency offered by UNION ALL.

SELECT name FROM queries_performance.pitfalls WHERE id = 3 => "Filters!" "Absent SIR!" "Limits!" "Absent SIR!!"

Filtering and limiting are an essential part of data retrieval. They enable you to extract ONLY the portion of the dataset you need and generally lead to faster queries. With filters and limits, queries are cost-effective and yield quicker results even for very large datasets (big data).

Use them as much as possible because there is no harm in using them.

SELECT name FROM queries_performance.pitfalls WHERE id = 4 => Overuse of SELECT *

SELECT * is quite appealing and powerful. It's so tempting to use because with less text, you get more. Doesn't it align with the Unix philosophy "Less is more," which emphasizes simplicity and minimalism in software design?

๐Ÿ’ก
With greater power comes greater responsibility.

Although I am not a data engineer (yet), I have come to realize one thing: writing complex queries is a common task for data engineers and ensuring that these queries are highly performant is just as important as writing them.

Data pipelines rely on this. Data warehouses rely on this. Data marts used by departments rely on this. Business decisions rely on this. Therefore, it is crucial that these queries are fast.

To achieve this, data engineers focus only on what is needed. Often, the number of columns in a table is large, and your queries typically don't require every single column. In this case, "less is more"**means restricting the columns you select to only what you truly need.** Don't be a visionary! No, not with the SELECT clause.

SELECT name FROM queries_performance.pitfalls WHERE id = 5 => Late Filtering

We discussed filtering above and described succinctly how important they are to writing cost-effective queries. Yet, late filtering is another explosive that slows down queries. In fact, I once thought they were beautiful until they weren't.

What do you think is the difference between the two code snippets below?

const result = items.map(item => item.id).filter(item => item.is_valid)
const result = items.filter(item => item.is_valid).map(item => item.id)

To a beginner, there is hardly any difference except that the horse is placed behind the cart in one and vice versa in the other.

However, the difference between the two snippets above can determine whether the result is fast or slow if items is a large array. How is that?

๐Ÿ’ก
Take a moment to consider this: What is the rate of growth (or Big O value) for each of the snippets?

$$items_{input\_size} = n$$

๐Ÿ’ก
The Big O notation for both is O(n). Surprising? Not really.

Let's Go Through the First Snippet

Assume that during the filtering process, at least half of the items are removed, with items.length being 1e4 or 10,000. This means JavaScript would iterate through all 10,000 items to return their IDs before filtering out approximately 5,000 invalid items.

To filter out these invalid items, JavaScript would still need to process all 10,000 items.

Let's Compare This with the Second Snippet

JavaScript would iterate through all 10,000 items, filter out approximately 5,000 invalid items, and then return the IDs of the remaining 5,000 items.

What did you notice?

In this average case scenario, the first snippet has a growth rate of n + n (or 2n), while the second snippet has a growth rate of n + log(n) (or n + logn).

This means that the first snippet's performance scales linearly with the size of the input, whereas the second snippet benefits from a more efficient logarithmic component, making it faster for larger datasets.

Now, apply this way of thinking to your queries.

  1. Why apply filters after JOINing your datasets (or tables) when you could do it beforehand?

  2. Why apply filters after aggregation when you could do it earlier?

  3. If your filter condition can be placed in the WHERE clause, why put it in the HAVING clause?

  4. Why apply filters after writing complex subqueries when you could do it during their construction?

The solution to this?

๐Ÿ’ก
EARLY FILTERING!

Conclusion

Another way to optimize your query, independent of the pitfalls discussed above, is by using query history. Snowflake offers a query_history view accessible through snowflake.account_usage. This view provides information about query start times, end times, and execution durations. By analyzing this data, you can identify slow queries and explore ways to enhance their performance.

Thanks for reading!

0
Subscribe to my newsletter

Read articles from Victor Ohachor directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Victor Ohachor
Victor Ohachor

I am a software engineer with nearly two years of professional experience. I specialize as a backend engineer but also work in full-stack capabilities. I use JavaScript/TypeScript, Python, and PHP to solve real-world problems every day.