Optimizing Query Performance in Snowflake
Table of contents
- Overview
- Common Query Performance Pitfalls (and How to Avoid Them)
- SELECT name FROM queries_performance.pitfalls WHERE id = 1 => Exploding JOINs
- SELECT name FROM queries_performance.pitfalls WHERE id = 2 => Using UNION When UNION ALL Could Be Faster
- SELECT name FROM queries_performance.pitfalls WHERE id = 3 => "Filters!" "Absent SIR!" "Limits!" "Absent SIR!!"
- SELECT name FROM queries_performance.pitfalls WHERE id = 4 => Overuse of SELECT *
- SELECT name FROM queries_performance.pitfalls WHERE id = 5 => Late Filtering
- Conclusion
Overview
Snowflake's architecture is made up of three (3) layers: storage layer, compute layer, and cloud services layer. The cloud services layer has query optimization built into it, which helps whip out some caveats in your queries to yield quicker and most cost-effective results.
But it doesn't prevent you from unknowingly planting landmines or atomic explosives into your queries. This article aims (and might fail) to point out some of the common pitfalls your queries might be running towards because of your ignorance.
Common Query Performance Pitfalls (and How to Avoid Them)
What is a query performance pitfall?
Why don't we make our subheadings (from now on) more interesting?
SELECT name FROM queries_performance.pitfalls WHERE id = 1
=> Exploding JOINs
This is usually caused by missing ON
conditions or joining large tables without proper indexing. It causes every record from one table to match every record from another.
It is quite simple to avoid this particular pitfall. All you need to do this are:
Properly indexed tables.
Appropriate JOIN types: For example, neither use
LEFT JOIN
when you ought to useINNER JOIN
nor useNATURAL JOIN
when you ought to useLATERAL JOIN
.Ensure that JOIN conditions are based on indexed columns and avoid using functions or expressions in them. Hey, don't ever forget to add a JOIN condition, okay?
You can easily save yourself from cartesian products this way.
SELECT name FROM queries_performance.pitfalls WHERE id = 2
=> Using UNION
When UNION ALL
Could Be Faster
UNION
combines the results of two queries and removes duplicate rows, which requires additional processing to check for and eliminate these duplicates. While I do like using UNION
for this, it tends to be resource-intensive and slow, especially when dealing with large datasets.
UNION ALL
is UNION
without the extra checks for duplicates, making it faster and more efficient.
When you are confident that the results of the queries you are combining do not contain duplicates, take advantage of the efficiency offered by UNION ALL
.
SELECT name FROM queries_performance.pitfalls WHERE id = 3
=> "Filters!" "Absent SIR!" "Limits!" "Absent SIR!!"
Filtering and limiting are an essential part of data retrieval. They enable you to extract ONLY the portion of the dataset you need and generally lead to faster queries. With filters and limits, queries are cost-effective and yield quicker results even for very large datasets (big data).
Use them as much as possible because there is no harm in using them.
SELECT name FROM queries_performance.pitfalls WHERE id = 4
=> Overuse of SELECT *
SELECT *
is quite appealing and powerful. It's so tempting to use because with less text, you get more. Doesn't it align with the Unix philosophy "Less is more," which emphasizes simplicity and minimalism in software design?
Although I am not a data engineer (yet), I have come to realize one thing: writing complex queries is a common task for data engineers and ensuring that these queries are highly performant is just as important as writing them.
Data pipelines rely on this. Data warehouses rely on this. Data marts used by departments rely on this. Business decisions rely on this. Therefore, it is crucial that these queries are fast.
To achieve this, data engineers focus only on what is needed. Often, the number of columns in a table is large, and your queries typically don't require every single column. In this case, "less is more"**means restricting the columns you select to only what you truly need.** Don't be a visionary! No, not with the SELECT
clause.
SELECT name FROM queries_performance.pitfalls WHERE id = 5
=> Late Filtering
We discussed filtering above and described succinctly how important they are to writing cost-effective queries. Yet, late filtering is another explosive that slows down queries. In fact, I once thought they were beautiful until they weren't.
What do you think is the difference between the two code snippets below?
const result = items.map(item => item.id).filter(item => item.is_valid)
const result = items.filter(item => item.is_valid).map(item => item.id)
To a beginner, there is hardly any difference except that the horse is placed behind the cart in one and vice versa in the other.
However, the difference between the two snippets above can determine whether the result is fast or slow if items
is a large array. How is that?
$$items_{input\_size} = n$$
O(n)
. Surprising? Not really.Let's Go Through the First Snippet
Assume that during the filtering process, at least half of the items are removed, with items.length
being 1e4
or 10,000. This means JavaScript would iterate through all 10,000 items to return their IDs before filtering out approximately 5,000 invalid items.
To filter out these invalid items, JavaScript would still need to process all 10,000 items.
Let's Compare This with the Second Snippet
JavaScript would iterate through all 10,000 items, filter out approximately 5,000 invalid items, and then return the IDs of the remaining 5,000 items.
What did you notice?
In this average case scenario, the first snippet has a growth rate of n + n
(or 2n
), while the second snippet has a growth rate of n + log(n)
(or n + logn
).
This means that the first snippet's performance scales linearly with the size of the input, whereas the second snippet benefits from a more efficient logarithmic component, making it faster for larger datasets.
Now, apply this way of thinking to your queries.
Why apply filters after
JOIN
ing your datasets (or tables) when you could do it beforehand?Why apply filters after aggregation when you could do it earlier?
If your filter condition can be placed in the
WHERE
clause, why put it in theHAVING
clause?Why apply filters after writing complex subqueries when you could do it during their construction?
The solution to this?
Conclusion
Another way to optimize your query, independent of the pitfalls discussed above, is by using query history. Snowflake offers a query_history
view accessible through snowflake.account_usage
. This view provides information about query start times, end times, and execution durations. By analyzing this data, you can identify slow queries and explore ways to enhance their performance.
Thanks for reading!
Subscribe to my newsletter
Read articles from Victor Ohachor directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Victor Ohachor
Victor Ohachor
I am a software engineer with nearly two years of professional experience. I specialize as a backend engineer but also work in full-stack capabilities. I use JavaScript/TypeScript, Python, and PHP to solve real-world problems every day.