Optimizing SQL queries in BigQuery
When it comes to tuning a #SQL query in #BigQuery for top performance and cost-efficiency, here's my starting point:
π 1. Early Filtering:
- Filter out unnecessary rows at the earliest.
- Only select columns that are required.
- Remove all the redundant tables.
π 2. Early Aggregation:
If you're joining tables that need to be aggregated, always aggregate to the target granularity early on, before the join.
π’ 3. De-duplication:
- De-duplicate early to reduce row count and operate at the desired granularity.
- Use explicit de-duplication, such as ROW_NUMBER() OVER (PARTITION BY a,b,c ORDER BY d asc/desc) coupled with QUALIFY for a compact form.
π 4. Partitioning & Clustering:
- Ensure joined tables are partitioned and clustered properly.
- Confirm partition pruning takes place: when using the partitioned column in a JOIN or WHERE clause, avoid on-the-fly transformations like DATETIME(left_table.timestamp_partitioning_column, timezone) = right_table.datetime_partitioning_column. This will hinder partition elimination.
- If clustering by multiple columns, ensure it mirrors the join and filter usage to be effective. If needed and possible, adjust clustering.
π 5. Data Types:
- Prefer integer keys over strings for joins.
- Transform columns to the appropriate data type: e.g. timestamps containing only a date to the date type.
π 6. Common Table Expressions (CTEs):
- Reuse of non-recursive CTE multiple times? Consider moving it to separate staging tables.
- Heavy operations in a CTE? Think about extracting it as a staging table, then partition and cluster it for subsequent joins.
π 7. Unnesting:
- While nested data is powerful, avoid unnesting when you can.
For intervals like [valid_from, valid_to], refrain from unnesting them and work with the interval if possible.
β οΈ 8. Cross Joins:
Limit their usage. BigQuery isn't fond of joins with more outputs than inputs.
π 9. Order By:
Only use ORDER BY in the last query or when necessary in window functions.
π 10. Leverage Nested Data:
Aggregate large rowsets of the same type into one ARRAY using ARRAY_AGG to capitalize on compression benefits.
π¬ 11. Experiment, Experiment, Experiment:
The optimization journey is paved with trials and iterations. Aim for the best results by employing multiple strategies and seeing which one emerges as the most efficient in terms of runtime, slot time usage, and bytes processed. Often, hands-on experimentation reveals insights that theory might miss.
Always remember, each query is unique. While this checklist provides a solid foundation, fine-tuning will often be specific to your individual use case.
Make sure to check out BigQuery Documentation on the best practices and the BigQuery Anti-pattern recognition tool.
Happy querying! πΌπ
Found it useful? Subscribe to my Analytics newsletter at notjustsql.com.
Subscribe to my newsletter
Read articles from Constantin Lungu directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Constantin Lungu
Constantin Lungu
Senior Data Engineer β’ Contractor / Freelancer β’ GCP & AWS Certified