BigQuery Study Reference

Beshoy SabriBeshoy Sabri
38 min read

I have just finished a study guide on BigQuery @ GCP. Below you can find it on Heptabase Whiteboard

In Summary, you can find it in here:

BigQuery SQL Statement Guide (Standard SQL & Legacy SQL)

Google BigQuery’s SQL dialect (Google Standard SQL) extends traditional SQL with unique features. Below, we explore BigQuery-specific SQL statements in the classic categories – DDL, DQL, DML, DCL, TCL – with examples and notes on legacy SQL where applicable, followed by additional topics like aggregations, window functions, and performance tips.

1. DDL – Data Definition Language

Definition: DDL statements are used to define or alter the structure of the database objects. In BigQuery, DDL can create, modify, and delete datasets (schemas), tables, views, materialized views, table snapshots, table clones, functions (UDFs), procedures, and row-level access policies (Learn All About DDL Commands). DDL operations affect the schema/metadata but not the actual data contents (Create and Delete Tables in BigQuery: A 2025 DDL Guide).

BigQuery’s legacy SQL did not support DDL in queries (table management had to be done via the UI or CLI). BigQuery Standard SQL (also called GoogleSQL) introduced in-query DDL support. Key BigQuery DDL statements include CREATE, ALTER, and DROP, each extended with BigQuery-specific options:

  • CREATE – Used to create new datasets, tables, views, routines, etc. BigQuery’s CREATE TABLE supports additional clauses for partitioning and clustering data for performance. You can also use CREATE TABLE ... AS SELECT (CTAS) to create a table from a query result in one statement (How to Create a Table in BigQuery: A Step-by-Step Guide). For example, the following creates a partitioned, clustered table:

      -- Create a partitioned table for sales data, partitioned by date and clustered by product
      CREATE TABLE `project_id.my_dataset.sales` (
        sale_date DATE,
        product STRING,
        quantity INT64,
        price NUMERIC
      )
      PARTITION BY sale_date
      CLUSTER BY product;
    

    In the above, PARTITION BY and CLUSTER BY are BigQuery-specific extensions to define date or integer range partitions and clustered indexing on a column, which help optimize query performance (BigQuery Best Practices to Optimize Cost and Performance) (BigQuery Best Practices to Optimize Cost and Performance). BigQuery also allows adding a table description or labels via the OPTIONS clause in DDL. For example, you could append OPTIONS(description="Sales data", labels=[("team","finance")]) to the CREATE TABLE to document the table.

    BigQuery datasets are analogous to schemas: use CREATE SCHEMA dataset_name (Standard SQL) to create a new dataset (Create and Delete Tables in BigQuery: A 2025 DDL Guide) (Create and Delete Tables in BigQuery: A 2025 DDL Guide). Similarly, CREATE VIEW creates a logical view from a query, and CREATE MATERIALIZED VIEW creates a view that caches results for faster reuse.

  • ALTER – Used to modify existing objects. BigQuery allows altering table schemas flexibly. For example, you can add a column to a table without rewriting it:

      ALTER TABLE `my_dataset.sales` 
      ADD COLUMN IF NOT EXISTS comments STRING;
    

    This adds a new nullable field comments to the sales table. Other BigQuery ALTER operations include ALTER TABLE SET OPTIONS (to change table metadata like description or default expiration), ALTER COLUMN (to change column options like descriptions), or renaming tables/columns. You can also alter datasets (schemas) – e.g. ALTER SCHEMA my_dataset SET OPTIONS(default_table_expiration_days=90) to set a default expiration for tables in a dataset.

  • DROP – Removes BigQuery objects. For example, DROP TABLE my_dataset.sales; will delete the sales table. BigQuery also supports DROP SCHEMA my_dataset CASCADE; to delete a dataset along with all its tables. (Legacy SQL required using the API or console for such operations, since it lacked DDL.)

Example – Creating and Dropping a Table:

-- Create a new table from a query (CTAS example):
CREATE TABLE `project_id.my_dataset.top_customers` AS
SELECT customer_id, SUM(total_spend) AS total_spend
FROM `project_id.my_dataset.orders`
GROUP BY customer_id
HAVING SUM(total_spend) > 1000;

-- Later, drop the table if no longer needed:
DROP TABLE `project_id.my_dataset.top_customers`;

The above CREATE TABLE ... AS SELECT will create top_customers with the results of the query (How to Create a Table in BigQuery: A Step-by-Step Guide). The DROP TABLE statement then deletes it. BigQuery DDL statements can include IF EXISTS/IF NOT EXISTS to guard against errors when objects may or may not exist (Create and Delete Tables in BigQuery: A 2025 DDL Guide).

2. DQL – Data Query Language

Definition: DQL refers to data query statements, primarily the SELECT statement and its clauses used to retrieve data. In BigQuery, SELECT queries support standard SQL semantics and many BigQuery-only extensions. (Legacy SQL mode also allowed SELECT queries but with different functions and handling of nested data.)

A basic BigQuery SELECT query supports the usual clauses: SELECT ... FROM ... JOIN ... WHERE ... GROUP BY ... HAVING ... ORDER BY ... LIMIT. BigQuery’s Standard SQL has full support for SQL joins, subqueries, and set operations. The WHERE clause filters rows before aggregation, and HAVING filters groups after aggregation (more on HAVING in Section 6). BigQuery’s implementation follows standard evaluation order: WHERE filters, then GROUP BY aggregates, then HAVING filters aggregated groups (A Complete Guide to WHERE, HAVING, and QUALIFY in SQL) (A Complete Guide to WHERE, HAVING, and QUALIFY in SQL).

BigQuery-specific Query Features:

  • SELECT * EXCEPT / REPLACE: BigQuery extends SELECT * with the ability to exclude or replace specific columns. For example, SELECT * EXCEPT(password) will select all columns except password, and SELECT * REPLACE(customer_id * 100 AS customer_id) would select all columns but use a calculated value in place of customer_id. This is unique to BigQuery’s SQL (and now adopted by some other databases). Usage is straightforward: SELECT * EXCEPT(col1, col2) FROM ... to drop columns (Select All Columns Except Some in Google BigQuery? - Stack Overflow). You can even combine them, e.g. SELECT * EXCEPT(id) REPLACE("widget" AS product_name) FROM products; (Select All Columns Except Some in Google BigQuery? - Stack Overflow). This saves time when you need most columns but not all.
  • Window Functions (OVER ... PARTITION BY): BigQuery fully supports SQL window functions for running totals, ranking, lag/leads, etc. You can include expressions like SUM(amount) OVER(PARTITION BY region ORDER BY date) in the SELECT clause to compute a running total, or RANK() OVER(PARTITION BY region ORDER BY sales DESC) to rank rows within partitions. (We cover window functions in detail in Section 8.) These use the OVER() syntax with optional PARTITION BY and ORDER BY. BigQuery even allows filtering by these in the same query via QUALIFY as noted. Legacy SQL did not support window functions; this is a Standard SQL feature.
  • Arrays and Structs (Nested Data): BigQuery can store nested and repeated fields (arrays and structs) in tables, and its SQL can directly query such structures. For example, a table can have a column which is an ARRAY of STRUCTs. In Standard SQL, you use the UNNEST() function in the FROM clause to flatten arrays for analysis. BigQuery’s ability to handle nested data is a distinguishing feature – it allows denormalized schemas (e.g., an order with an array of line-items) that you can still query with SQL. (Legacy SQL had a different approach using FLATTEN() for repeated fields.) For instance:

      SELECT order_id, item.product_name, item.quantity
      FROM `project.dataset.orders`, 
           UNNEST(items) AS item
      WHERE item.product_category = 'Electronics';
    

    Here, items is an ARRAY field in orders; UNNEST(items) produces a table of item structs, which we alias as item. This query will output one row per array element. BigQuery’s SQL treats each element as a row but still associates it with its parent order. This nested data capability allows BigQuery to avoid expensive joins by keeping related data together (BigQuery Best Practices to Optimize Cost and Performance) (the data is physically nested). It’s a powerful feature not found in many SQL dialects.

  • Wildcard Table Queries: BigQuery can query multiple tables in one go using wildcards or table decorators. For example, you might have monthly partitioned tables like sales_2019Jan, sales_2019Feb, ... and run a query across all of them with FROM \project.dataset.sales_2019*`. BigQuery provides a pseudo-column _TABLE_SUFFIXto filter which underlying tables to include. (In legacy SQL, functions likeTABLE_DATE_RANGE()` were used for similar effect.) This helps when data is sharded into many tables by date.
  • Set Operators: BigQuery supports standard SQL set operations: UNION [ALL], EXCEPT, and INTERSECT. These allow combining result sets from multiple SELECT queries. BigQuery follows the standard behavior (e.g., UNION deduplicates unless ALL is specified).
  • Advanced Grouping: BigQuery Standard SQL (recent versions) supports grouping sets, rollups, and cubes for aggregation. For example, you can do GROUP BY ROLLUP(region, product) to get subtotals by region and product combinations, or use GROUP BY CUBE or explicit GROUPING SETS. These were not available in legacy SQL. (Support for GROUP BY ROLLUP, CUBE, GROUPING SETS was added to BigQuery’s SQL (Google BigQuery now supports Cubes | by Christianlauer - Medium).)

Example – BigQuery Query with Unique Features:

SELECT 
  category,
  product,
  SUM(sales) AS total_sales,
  RANK() OVER(PARTITION BY category ORDER BY SUM(sales) DESC) AS product_rank_in_cat
FROM `my_dataset.product_sales`
GROUP BY category, product
HAVING total_sales > 1000
QUALIFY product_rank_in_cat = 1

In this (contrived) example, we use GROUP BY to aggregate sales by category and product, HAVING to keep only products with >1000 sales in a category, and a window function (RANK() OVER(PARTITION BY category ORDER BY SUM(sales) DESC)) to rank products by sales within each category. Finally, QUALIFY product_rank_in_cat = 1 filters the results to return only the top product per category. This single query finds the top-selling product in each category with over 1000 sales, demonstrating BigQuery’s ability to combine aggregation, window functions, and qualify filtering. (Note: The SUM(sales) inside the RANK() would actually be computed as an analytic function – BigQuery allows using aggregations in analytic functions by computing them over the partition – here it effectively ranks by the same SUM per partition.)

3. DML – Data Manipulation Language

Definition: DML statements modify table data (insert, update, delete rows). BigQuery’s DML enables adding or changing data in BigQuery tables via SQL, which was a major enhancement over the early append-only model. Standard DML in BigQuery includes INSERT, UPDATE, DELETE, and the combined MERGE statement (Bigquery SQL how to update rows AND Insert new data - Stack Overflow). These statements operate in BigQuery with some constraints (for example, BigQuery executes DML on a snapshot of the table to avoid conflicts, and prior to 2020 there were quotas on the number of DML operations per day, which have since been removed (DML without limits, now in BigQuery | Google Cloud Blog)).

INSERT: Adds new rows to a table. BigQuery supports two forms: inserting explicit values, or inserting the results of a query. For example:

-- Insert a single row with explicit values:
INSERT INTO `my_dataset.users` (user_id, name, signup_date)
VALUES (123, 'Alice', CURRENT_DATE);

-- Insert multiple rows using a subquery:
INSERT INTO `my_dataset.gold_customers` (customer_id, total_spend)
SELECT customer_id, SUM(amount)
FROM `my_dataset.sales`
GROUP BY customer_id
HAVING SUM(amount) > 10000;

The first INSERT adds one row to users. The second inserts the results of a query (all customers with >10000 total spend) into the gold_customers table. BigQuery’s Standard SQL does not require a VALUES list for each row if using a SELECT; it will append all rows from the query. (Legacy SQL did not support DML; data was usually loaded in bulk or via streaming inserts.)

UPDATE: Modifies existing rows, setting new values for some columns where a condition is met. BigQuery’s UPDATE syntax allows a FROM clause for more complex updates (e.g. updating a table based on a join with another table). Example:

UPDATE `my_dataset.users`
SET last_login = CURRENT_TIMESTAMP
WHERE user_id = 123;

This updates the last_login timestamp for the user with ID 123. If we needed to update based on another table (say we have a staging table of latest logins), we could do something like:

UPDATE `my_dataset.users` AS u
SET last_login = s.new_login_time
FROM `my_dataset.login_updates` AS s
WHERE u.user_id = s.user_id;

BigQuery executes the update in a single pass (it's atomic). Under the hood, BigQuery may rewrite the entire table or the modified partitions, but the user sees it as an in-place update.

DELETE: Removes rows that meet a condition. For example:

DELETE FROM `my_dataset.users`
WHERE is_active = FALSE AND last_login < DATE_SUB(CURRENT_DATE(), INTERVAL 1 YEAR);

This would delete users who are marked inactive and have not logged in for over a year. As with updates, deletes in BigQuery are atomic. (There is no need for a separate “TRUNCATE” in BigQuery; you can DELETE all rows or use DROP TABLE to quickly remove a table’s data.)

MERGE: BigQuery’s MERGE statement combines insert, update, and delete logic in one operation – essentially an “upsert” capability. MERGE allows you to compare a target table with a source (such as a stage or delta table) and specify actions for when the rows match or don’t match. BigQuery performs the specified inserts/updates/deletes atomically as one transaction (Bigquery SQL how to update rows AND Insert new data - Stack Overflow). This is particularly useful for incremental data pipelines where you apply changes (new records and updates) to a master table.

A typical MERGE example is merging daily new data into a master table:

MERGE `my_dataset.Inventory` AS T
USING `my_dataset.NewArrivals` AS S
ON T.product_id = S.product_id
WHEN MATCHED THEN 
  UPDATE SET T.quantity = T.quantity + S.quantity
WHEN NOT MATCHED THEN 
  INSERT(product_id, product_name, quantity) 
  VALUES(S.product_id, S.product_name, S.quantity);

In this example, the Inventory table is updated by adding quantities for products that already exist (matched on product_id), and inserting new rows for products that are not yet in Inventory (Transaction Management in Google Bigquery - Stack Overflow). The MERGE can have multiple WHEN clauses, including WHEN MATCHED [AND condition] THEN DELETE to delete rows that meet some condition. All these changes (updates to some rows, inserts of others) happen in one combined operation. This MERGE effectively keeps the Inventory table in sync with the NewArrivals table.

BigQuery’s MERGE syntax and behavior are similar to ANSI SQL MERGE (as in Oracle or SQL Server). Note that when using MERGE, the source and target can each be a table or subquery, giving flexibility to, for example, merge the result of an aggregation into a table. Because MERGE is atomic, it can be used within a multi-statement transaction (see Section 5) or as a standalone way to ensure data consistency.

Legacy SQL vs Standard SQL for DML: Legacy SQL did not support these DML statements in queries. Data manipulation had to be done outside of query jobs (e.g., via the BigQuery API, or by writing query results to a new table). With Standard SQL, BigQuery can handle transactional modifications. Initially BigQuery imposed a limit of 1,000 DML statements per table per day (to maintain performance) (Google Big Query: "Exceed rate limits" - How to upload large tables), but as of 2020 this limit has been removed, allowing unlimited DML on tables (DML without limits, now in BigQuery | Google Cloud Blog). Still, very large or complex DML operations may be slower than using BigQuery’s fast load jobs or using MERGE to batch changes, so usage should be planned accordingly.

4. DCL – Data Control Language

Definition: DCL statements control access permissions on database objects (tables, views, etc.) – typically GRANT and REVOKE. In BigQuery, access control is usually managed by Google Cloud IAM (Identity and Access Management), where you assign roles like BigQuery Data Viewer or Data Editor at the project or dataset level. However, BigQuery has introduced SQL DCL syntax to allow granting and revoking fine-grained permissions at the SQL level for convenience. These statements effectively map to IAM under the hood and let you manage table or dataset access with SQL commands (How to Create a Table in BigQuery: A Step-by-Step Guide).

Using BigQuery’s DCL requires appropriate permissions (you generally need to be a BigQuery admin or the owner of the dataset/table to change its IAM policy). You can grant permissions to users, groups, or service accounts.

  • GRANT: Gives a user or role certain privileges on a BigQuery resource (such as SELECT rights on a table). BigQuery’s syntax is similar to other SQL dialects. For example, to grant read access on a table to a specific user:

      GRANT SELECT ON TABLE `my_project.my_dataset.sales` 
      TO 'user:john.doe@example.com';
    

    This statement grants SELECT (read/query) privileges on the sales table to the user john.doe@example.com. In effect, this user would be able to query the table (equivalent to the BigQuery IAM role BigQuery Data Viewer on that table). Similarly, you could grant at the dataset (schema) level. For instance, to allow a group to read all tables in a dataset:

      GRANT SELECT ON SCHEMA `my_project.finance_data` 
      TO 'group:analysts@example.com';
    

    This would grant read access on all current and future tables in the finance_data dataset to all members of the analysts@example.com group. (Behind the scenes, BigQuery creates an entry in the dataset’s access list for that group.)

    BigQuery supports privileges like SELECT, INSERT, UPDATE at the table level. It also allows granting roles; for example, you can grant the predefined IAM role roles/bigquery.dataEditor at a dataset level via a GRANT statement if needed. (In the GRANT syntax, BigQuery identifies tables vs. schemas using the ON TABLE or ON SCHEMA keywords.)

  • REVOKE: Removes previously granted permissions. For example, if we want to revoke the SELECT permission we gave above:

      REVOKE SELECT ON TABLE `my_project.my_dataset.sales` 
      FROM 'user:john.doe@example.com';
    

    This will strip that user’s access to the table (assuming they didn’t have access via some other route). Similarly, REVOKE ... ON SCHEMA ... FROM ... would remove dataset-level access. Revokes take effect immediately – the next query the user tries on that table will fail with a permission error.

Managing Permissions Example: Suppose we have a dataset analytics and we want to grant a data scientist read-only access to a specific table analytics.marketing_data without giving access to the whole project. We can run:

GRANT SELECT ON TABLE `my_project.analytics.marketing_data` 
TO 'user:data.scientist@company.com';

Now that user can query marketing_data but no other tables (unless separately granted). If later we decide they shouldn’t see it, we do:

REVOKE SELECT ON TABLE `my_project.analytics.marketing_data` 
FROM 'user:data.scientist@company.com';

Under the hood, these DCL statements are updating BigQuery’s IAM policy for that table (or dataset). Note that some privileges (like dataset-level roles) might map to broader IAM roles. Also, BigQuery has the concept of authorized views (a view that exposes certain data and can be shared instead of the raw table) – those are managed by granting a view access to a dataset, which can also be done via DCL by granting to a special identifier representing the view.

Legacy SQL: Legacy BigQuery had no SQL GRANT/REVOKE; all permission management was via Cloud IAM in the console or command-line. The introduction of DCL in Standard SQL makes it easier to script and automate access control. It’s important to adhere to the principle of least privilege (only grant the minimum required access) (How to use grant role in BigQuery?), especially since BigQuery often holds sensitive data.

5. TCL – Transaction Control Language

Definition: TCL statements manage transactions – units of work that can be committed or rolled back together. In traditional SQL, this includes commands like BEGIN (start transaction), COMMIT (save changes), and ROLLBACK (undo changes). BigQuery historically operated on an append-only model without transactions (each query was its own atomic operation). However, BigQuery now supports multi-statement transactions inside BigQuery scripting (and within stored procedures or via API) for executing multiple DML statements as a single atomic unit (Transaction Management in Google Bigquery - Stack Overflow).

BigQuery Multi-statement Transactions: In BigQuery scripting, you can start a transaction with BEGIN TRANSACTION;, execute a series of statements (DML or even certain DDL on temporary tables), then end with COMMIT TRANSACTION; to apply changes atomically, or ROLLBACK TRANSACTION; to cancel if something went wrong (Now Available: Multi Statements Transactions in BigQuery - Medium) (BigQuery transactions over multiple queries, with sessions). All the statements in between will either all succeed (on commit) or all be undone (on rollback). This is crucial when you need to ensure consistency across multiple operations – for example, updating two tables in sync, or doing a “delete then insert” (replace) safely.

Example Transaction: Suppose we want to transfer inventory from one warehouse to another. We need to deduct from one table and add to another, and ensure both succeed or neither does. In a BigQuery script, we could do:

BEGIN TRANSACTION;

-- 1. Insert new stock into Warehouse B's table
INSERT INTO `my_dataset.warehouseB_inventory` (product_id, qty)
SELECT product_id, qty 
FROM `my_dataset.new_stock_delivery`
WHERE warehouse = 'B';

-- 2. Remove that stock from Warehouse A's table
DELETE FROM `my_dataset.warehouseA_inventory`
WHERE product_id IN (
    SELECT product_id FROM `my_dataset.new_stock_delivery` WHERE warehouse = 'B'
);

COMMIT TRANSACTION;

In this example, we start a transaction. Then we INSERT rows for products delivered to Warehouse B and DELETE those products from Warehouse A’s inventory. Only when both statements execute successfully do we call COMMIT to finalize the changes. If any error occurred in between (or if a condition we check indicates a problem), we could issue ROLLBACK TRANSACTION; instead, and BigQuery would undo any partial changes from the insert/delete (in practice, BigQuery would not have made them permanent yet until commit).

BigQuery’s multi-statement transactions can span multiple tables (even across datasets or projects) and multiple DML operations (Transaction Management in Google Bigquery - Stack Overflow) (Transaction Management in Google Bigquery - Stack Overflow). All locks are managed behind the scenes by BigQuery. Note that BigQuery transactions are scoped within a single script execution or session – you cannot yet have an interactive multi-step transaction across separate query jobs; it must be done in one script or procedure call.

Inside a transaction, you can also create temporary tables or use SELECT queries to assist your logic. BigQuery currently does not allow arbitrary DDL on permanent tables inside a transaction (you can only create temp tables or do certain DDL like creating a temp function).

If you don’t explicitly use BEGIN/COMMIT, each DML statement in BigQuery is by default atomic on its own. For many use cases, a single MERGE is enough to apply multiple changes atomically (obviating the need for an explicit transaction). But if you do need to break a complex operation into multiple steps, BigQuery transactions ensure all-or-nothing execution (Transaction Management in Google Bigquery - Stack Overflow).

Legacy SQL: There was no concept of user-controlled transactions in legacy BigQuery. Every query was standalone. The introduction of scripting and transactions in Standard SQL (around 2020) was a significant improvement for ETL workflows that require multi-step operations with rollback on failure.

6. Aggregations and GROUP BY in BigQuery

Aggregations (using functions like SUM, AVG, COUNT, etc. along with GROUP BY clauses) work as in standard SQL, with some BigQuery enhancements.

When you use an aggregate function in a SELECT, you normally need a GROUP BY clause to define how rows are grouped (except when aggregating the entire table). BigQuery Standard SQL follows the standard rule: every non-aggregated select expression must be either in the GROUP BY or be an aggregation of a group. (Legacy SQL was more permissive: it would implicitly treat non-grouped fields as ANY_VALUE, but this could lead to indeterminate results. Standard SQL is stricter, enforcing proper grouping or aggregation.)

Basic Example:

SELECT department, AVG(salary) AS avg_salary, MAX(salary) AS max_salary
FROM `company.employees`
GROUP BY department;

This returns one row per department with the average and maximum salary in each. BigQuery handles large-scale aggregations efficiently in its distributed engine.

BigQuery-Specific Behavior and Functions:

  • COUNT(*) returns 0 rows vs 0: If no rows match the WHERE clause, aggregate functions like COUNT return 0 (not null). This is standard SQL behavior. BigQuery adheres to that.
  • Approximate Aggregation: For very large datasets, exact distinct counts or quantiles can be expensive. BigQuery provides approximate aggregate functions that trade a tiny error for big speed gains. For example, APPROX_COUNT_DISTINCT(column) uses HyperLogLog++ to estimate the number of distinct values in a column (How to reduce BigQuery costs: 10 effective life hacks for optimization | Lead Panda Media). The result is usually within about 1% of the true value, but the computation is faster and uses less memory, which can be crucial at petabyte scale. Similarly, BigQuery has APPROX_TOP_COUNT and APPROX_TOP_SUM for approximate heavy-hitter analysis. Use these when exact precision isn’t required – it can dramatically reduce query cost on huge tables (How to reduce BigQuery costs: 10 effective life hacks for optimization | Lead Panda Media). (Legacy SQL had COUNT(DISTINCT x) but would error if the number of distinct elements was too high; approximate functions solve that problem in Standard SQL.)
  • COUNTIF and SUMIF: BigQuery includes convenient conditional aggregation functions. COUNTIF(condition) counts rows where the condition is true (How to reduce BigQuery costs: 10 effective life hacks for optimization | Lead Panda Media), and SUMIF(expr, condition) sums expr over rows where the condition is true. For example, COUNTIF(status = "ERROR") counts only error status rows. These are shorthand; you can always use COUNT(CASE WHEN ... END) or SUM(IF(..., value, 0)) in standard SQL, but BigQuery provides these for simplicity.
  • ANY_VALUE(): BigQuery supports ANY_VALUE(field) which returns an arbitrary value of field from within each group. This is useful when you know all values in the group are the same (or you don’t care which one is picked) and you want to avoid grouping by it. (MySQL has a similar function.) For example, if you group by user_id and want to pick any single email for that user (assuming email doesn’t vary per row), you could do: SELECT user_id, ANY_VALUE(email) FROM ... GROUP BY user_id;. This function was introduced to help with migrating queries from legacy SQL’s behavior of non-grouped columns (Select All Columns Except Some in Google BigQuery? - Stack Overflow).
  • Grouping Sets / Rollup / Cube: As mentioned, BigQuery allows advanced grouping. For instance:

      SELECT region, product, SUM(sales) as total_sales
      FROM sales
      GROUP BY ROLLUP(region, product);
    

    This query would produce aggregated totals at multiple levels: by (region, product), by region (overall product total per region), and a grand total (all regions/products) (Google BigQuery now supports Cubes | by Christianlauer - Medium). GROUP BY CUBE(a, b) would produce all combinations (by a, by b, by both, by neither). And GROUPING SETS allows explicit list of group combinations. BigQuery also provides the GROUPING() function to identify subtotal rows (returning 1 for subtotal vs 0 for detail row). These features let you produce pivot-table style summaries in a single query. (These capabilities did not exist in legacy SQL; they are part of Standard SQL improvements.)

  • Distinct Aggregates: BigQuery can do COUNT(DISTINCT col) like other SQLs. One nuance: BigQuery in some cases allows multiple distinct aggregates in the same query (e.g., COUNT(DISTINCT col1), COUNT(DISTINCT col2)), which some SQL databases do not unless you use more complex workarounds. BigQuery can handle multiple distinct counts by internally using approximation or additional query stages. Just be mindful of performance: each distinct aggregate may add overhead.

Example – Using APPROX_COUNT_DISTINCT:

SELECT 
  country, 
  APPROX_COUNT_DISTINCT(user_id) AS approx_unique_users
FROM `my_dataset.web_logs`
GROUP BY country;

This query quickly estimates the number of unique users per country in the web_logs table using HyperLogLog++ (How to reduce BigQuery costs: 10 effective life hacks for optimization | Lead Panda Media). The approximation significantly reduces computational load for extremely large log tables with minimal loss of accuracy (usually ~1% error). If exact counts are needed, you could use COUNT(DISTINCT user_id) at higher cost.

Example – Grouping Sets (Rollup):

SELECT 
  region, product, 
  SUM(sales) AS total_sales,
  GROUPING(region) AS region_total_flag,
  GROUPING(product) AS product_total_flag
FROM `my_dataset.sales`
GROUP BY ROLLUP(region, product);

This will produce rows for each (region, product) combination, plus subtotal rows where product is NULL (total per region) and a grand total row where both region and product are NULL (Google BigQuery now supports Cubes | by Christianlauer - Medium). The GROUPING() function returns 1 when the column is NULL because of a subtotal. For example, the grand total row will have region_total_flag=1 and product_total_flag=1. Such SQL constructs can eliminate the need for manual UNION of multiple grouping queries.

7. Using HAVING with Aggregations

The HAVING clause is used to filter aggregated results. It is applied after the GROUP BY step, unlike WHERE which filters before grouping. In BigQuery (as in standard SQL), you use HAVING to impose conditions on aggregate values (SUM, COUNT, AVG, etc.) for each group (A Complete Guide to WHERE, HAVING, and QUALIFY in SQL).

For example, if we want to find departments with more than 3 employees, we would use HAVING on a COUNT:

SELECT department, COUNT(employee_id) AS num_employees
FROM `owox-analytics.myDataset.employee_data`
WHERE status = 'active'
GROUP BY department
HAVING COUNT(employee_id) > 3;

(A Complete Guide to WHERE, HAVING, and QUALIFY in SQL)

This query groups employees by department (considering only active employees due to the WHERE clause), then the HAVING clause filters out any groups that have 3 or fewer employees (A Complete Guide to WHERE, HAVING, and QUALIFY in SQL). The result will only include departments with count > 3.

Let’s break down the logic:

  • The WHERE filter (status = 'active') runs first, limiting rows to active employees before grouping.
  • GROUP BY department then aggregates the data so we have one row per department.
  • COUNT(employee_id) computes the number of employees in each department.
  • The SELECT outputs the department name and the count (alias num_employees).

So, if “HR” had 5 active employees and “Sales” had 2, the result would include “HR, 5” but not “Sales, 2”.

Important notes about HAVING in BigQuery Standard SQL:

  • You can reference aggregate expressions by alias in HAVING. In our example, we could have written HAVING num_employees > 3 because we alias COUNT(employee_id) as num_employees. BigQuery (like many SQLs) allows this alias usage in HAVING.
  • If there is no GROUP BY, a HAVING still can be used – it would treat the entire result as one group. E.g., SELECT SUM(x) as total FROM table HAVING total > 100 would return nothing if the total is not >100. But usually, HAVING is paired with GROUP BY.
  • Legacy SQL in BigQuery also had HAVING, with similar usage, but in legacy SQL you might see HAVING used without GROUP BY as a workaround to filter on an aggregate. In Standard SQL, you can often use a window function + QUALIFY or a subquery instead, but HAVING remains the direct way to filter grouped results.

Example – HAVING vs WHERE:

Suppose you want all products that have total sales > 1000 in a sales table:

-- Incorrect: This will fail or return wrong results
SELECT product, SUM(amount) 
FROM sales
WHERE SUM(amount) > 1000   -- not allowed, aggregate in WHERE
GROUP BY product;

This is invalid because you cannot use an aggregate (SUM(amount)) in a WHERE clause. The correct approach is to use HAVING:

SELECT product, SUM(amount) as total_sales
FROM sales
GROUP BY product
HAVING total_sales > 1000;

Now, HAVING total_sales > 1000 will filter the grouped products, and only those with sum > 1000 remain. In BigQuery, this works as expected (you could also write HAVING SUM(amount) > 1000 directly) (A Complete Guide to WHERE, HAVING, and QUALIFY in SQL). If you had an initial filter on individual rows (say, WHERE region = 'US' to consider only US sales), that would be applied before the grouping.

To summarize, use WHERE for conditions on raw rows (especially non-aggregated columns) and HAVING for conditions on aggregated results. BigQuery enforces this order of execution, just like standard SQL, ensuring that HAVING only sees grouped results (A Complete Guide to WHERE, HAVING, and QUALIFY in SQL) (A Complete Guide to WHERE, HAVING, and QUALIFY in SQL).

8. Window Functions (OVER, PARTITION BY) – Advanced SQL Analytics

Window functions (also known as analytic functions) are powerful in BigQuery for performing calculations across sets of rows related to the current row, without collapsing those rows into a single result. They use the OVER(...) clause to define a “window” of rows for the calculation. BigQuery’s implementation is fully compliant with standard SQL window functions and adds a few functions of its own.

What Are Window Functions?
Unlike GROUP BY aggregations which reduce rows (one result per group), window functions produce a value for each row while looking at a window of multiple rows (which you define using PARTITION BY and ORDER BY in the OVER clause). This lets you compute running totals, ranks, moving averages, percentiles, lead/lag comparisons, etc., all while retaining the detail of each row (Using Window Functions in BigQuery: A 2025 Guide) (Using Window Functions in BigQuery: A 2025 Guide). They are called “window” functions because each calculation considers a frame of rows (the window) relative to the current row’s position.

Basic Syntax:

function_name(expression) 
OVER (
  [PARTITION BY partition_columns...] 
  [ORDER BY sort_columns [ASC|DESC]] 
  [window_frame_clause]
)
  • ORDER BY defines the ordering within each partition that the window function will use (e.g., for running totals or ranking).
  • The optional frame clause (like ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) further refines which rows in the partition are considered for that row’s calculation.

BigQuery supports three main categories of window functions (Using Window Functions in BigQuery: A 2025 Guide):

  1. Aggregate functions as window functions: e.g. SUM(), AVG(), MIN(), MAX(), COUNT() can be used as window functions by adding OVER(). This produces a running or total aggregate value per row rather than one per group. By default, without a frame clause, SUM() OVER (PARTITION BY X) gives a total sum per partition (like a group total attached to each row of the group), and SUM() OVER (ORDER BY Y) gives a cumulative sum from the start up to the current row (the default frame is UNBOUNDED PRECEDING ... CURRENT ROW if ORDER BY is present).
  1. Ranking functions: ROW_NUMBER(), RANK(), DENSE_RANK(), NTILE(n). These assign rank numbers based on ordering within partitions. For example, ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) will give 1 to the highest-paid person in each department, 2 to the next, and so on.
  1. Analytic functions for navigation: e.g. LAG(value, N), LEAD(value, N) to pull data from previous or next rows, FIRST_VALUE(value)/LAST_VALUE(value) to get first/last in the window, etc. These allow comparisons across row boundaries (e.g., compare this row’s value to last week’s value in a time series).

Example 1 – Ranking: Let’s rank employees by salary within each department:

SELECT 
  employee_id,
  department,
  salary,
  RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS salary_rank
FROM `company.employees`;

This will output each employee and their rank within their department by salary (1 = highest salary in that dept, etc.). The PARTITION BY department means ranking restarts for each department (Using Window Functions in BigQuery: A 2025 Guide). The ORDER BY salary DESC means highest salary gets rank 1. Employees with equal salaries get the same rank number (and rank numbers will have gaps if there's a tie, since RANK is being used; use DENSE_RANK() if you want no gaps). If we wanted a global rank ignoring departments, we’d omit the partition clause. If we wanted row number instead (no gaps, strict ordering), we’d use ROW_NUMBER(). This kind of query could help, for example, to find the top 3 earners in each department (you would then add QUALIFY salary_rank <= 3 in BigQuery to filter the top 3 per dept).

Example 2 – Running Total: Suppose we have a table of daily sales and we want a running cumulative sales amount by date:

SELECT 
  sales_date,
  amount,
  SUM(amount) OVER (ORDER BY sales_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM `my_dataset.daily_sales`
ORDER BY sales_date;

This uses SUM(amount) OVER (ORDER BY sales_date ... CURRENT ROW) to calculate a cumulative sum up to the current row (assuming one row per date, this is a running total over time). The frame ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW is actually the default for an ORDER BY if not specified, so we could omit it and BigQuery assumes cumulative sum from the start. The result will list each date, the sales for that date, and the total sales from the beginning up through that date. Unlike a GROUP BY, we still see one row per date (not one final total). This is very useful for time-series analysis. In BigQuery, window aggregate functions like this allow things like moving averages by using a frame of a certain width (e.g., last 7 days).

Example 3 – Partitioned Window: Using the same sales example, if we partition by year and compute running totals within each year:

SELECT 
  YEAR(sales_date) as year,
  sales_date,
  amount,
  SUM(amount) OVER (
       PARTITION BY YEAR(sales_date) 
       ORDER BY sales_date 
       ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) AS year_running_total
FROM `my_dataset.daily_sales`
ORDER BY sales_date;

Here PARTITION BY YEAR(sales_date) means the running total resets at the start of each year (each year treated separately) (Using Window Functions in BigQuery: A 2025 Guide). So Jan 1 of each year starts a new cumulative sum. Partitioning is helpful when you want window calculations within subgroups (like per department, per year, per region, etc.).

Window vs. GROUP BY: It’s worth noting the difference: GROUP BY collapses rows, but window functions produce values that are attached to each row. For instance, a GROUP BY year to get total sales per year would return 1 row per year. The window SUM...PARTITION BY year returns the total per year on each row of that year’s data (Using Window Functions in BigQuery: A 2025 Guide) (Using Window Functions in BigQuery: A 2025 Guide). You might use that to calculate each day’s share of the annual total, for example. Window functions let you mix detail and summary in one result.

BigQuery Specifics: BigQuery supports all standard window functions. In terms of performance, window functions are efficient in BigQuery, but do consider the data volume – partitioning by a high-cardinality field or not partitioning at all (window over the entire table) means a lot of data to process in each function. If you only need a grouped result, use GROUP BY; use window functions when you need the detail with the analytic result. BigQuery also now supports the QUALIFY clause (as discussed) to filter the output of window functions easily, which is particularly useful with ranking and row_number queries (e.g., QUALIFY RANK() ... = 1 to get top N per group).

Practical Example – Combining Window and Aggregation:

A common use-case is to find the contribution of each row to a group total. We can use a window sum (total per group on each row) along with the row’s value:

SELECT 
  department,
  employee,
  salary,
  SUM(salary) OVER (PARTITION BY department) AS total_dept_salary,
  salary * 100.0 / SUM(salary) OVER (PARTITION BY department) AS percent_of_department
FROM `company.employees`;

This produces each employee’s salary, the total salary of their department (repeated for all in dept), and the percentage of the department’s payroll that the employee’s salary represents. Here SUM(salary) OVER (PARTITION BY department) gives the department total attached to each row (Using Window Functions in BigQuery: A 2025 Guide), and we use it in a calculation. This kind of query would be difficult to do with pure GROUP BY (you’d have to join the aggregated results back to the detail), but window functions make it straightforward.

Legacy SQL: Legacy BigQuery did not have window functions like this. Users often had to do self-joins or subqueries to achieve similar results, which is more cumbersome and sometimes less efficient. The introduction of Standard SQL with window functions greatly simplified analytical queries in BigQuery.

9. BigQuery Performance Optimization Tips

BigQuery is a columnar, massively parallel query engine. Query cost and performance are largely determined by how much data you scan and process. Here are some BigQuery-specific optimization tips to make queries efficient:

  • Filter Early and Specifically: Use WHERE clauses to restrict data as much as possible. BigQuery will prune partitions (if partitioned) and skip irrelevant data. Also, filtering on clustered columns leverages sorted storage to read less. For example, if your table is partitioned by date, always include a date range filter in the WHERE clause so BigQuery can scan only the needed partitions.
  • Partition Your Tables: When creating tables, use BigQuery’s partitioning feature for large fact tables. Partition by a date or integer range that you commonly filter on (e.g., PARTITION BY date for a daily log table). Queries with a filter on the partition column will then scan only the relevant partition rather than the entire table, greatly reducing I/O (BigQuery Best Practices to Optimize Cost and Performance). Time-unit partitioning (DAY, MONTH, etc. on a TIMESTAMP/DATE) or integer range partitioning can be chosen based on data. For example, a web logs table partitioned by date might cut a year’s data (365 days) into 365 partitions; a query on one month would only read ~30 partitions (1/12 of data) (BigQuery Best Practices to Optimize Cost and Performance). Partitioning also helps manage data lifecycle (you can set expiration for partitions).
  • Cluster Your Tables: Clustering sorts data on specified columns, which can dramatically speed up filtering and aggregating on those columns (BigQuery Best Practices to Optimize Cost and Performance). For example, if you cluster a table by user_id, all rows with the same user_id are stored close together. A query like WHERE user_id = X will only read a small portion of each partition (only the blocks for that user). Clustering is often used in combination with partitioning: e.g., partition by date and cluster by user_id or product category. This way, BigQuery first prunes partitions by date, then within each partition it can binary search on the clustered column to find relevant data (BigQuery Best Practices to Optimize Cost and Performance). Clustering also helps GROUP BY performance on the clustered columns, because data comes pre-sorted (reducing shuffle).
  • Denormalize Data (Use Nested Fields): BigQuery performs best with fewer large tables rather than many joins. Where appropriate, denormalize your schema – e.g., instead of storing user info in one table and events in another (and joining), you might store user info as repeated/nested fields within the events table. BigQuery’s storage can handle repeated (array) fields efficiently. This avoids join overhead; querying nested data is essentially a local unpacking operation, not a distributed join (BigQuery Best Practices to Optimize Cost and Performance) (Transaction Management in Google Bigquery - Stack Overflow). Joins in BigQuery are still fine for reasonably sized dimensions, but for very large datasets, reducing the number of joins (through nesting or pre-joining data) can improve performance. As a rule: small lookup tables (dimensions) are fine to join, but consider denormalizing big fact tables or using nested structures for one-to-many relations (like an order with an array of items) (BigQuery Best Practices to Optimize Cost and Performance). This trades some storage space for speed, which is usually worthwhile in BigQuery’s cost model (storage is cheap; computation is comparatively expensive) (Transaction Management in Google Bigquery - Stack Overflow).
  • Avoid Excessive JOINs on Huge Tables: If you must join large tables, ensure the join keys are well-partitioned or clustered if possible. Use JOIN each (the default in Standard SQL) which is a shuffle join. In legacy SQL there was a concept of “JOIN EACH” required for big joins; in Standard SQL, BigQuery automatically shuffles data for large joins. However, joining two enormous tables will multiply the data that needs to be processed. Wherever possible, filter both sides of the join heavily and only select necessary columns before joining (possibly using subqueries or CTEs for pre-filtering).
  • Use WITH (CTE) or Temp Tables to Break Queries: BigQuery will happily execute extremely complex queries in one go, but sometimes breaking a query into stages can help the optimizer or reduce duplicate work. For example, if you have a subquery that is used multiple times, consider materializing it as a CTE (Common Table Expression) or temp table so it’s computed once and reused. BigQuery will not automatically reuse results from identical subqueries unless you explicitly CTE them. Using CTEs can also improve readability, but note BigQuery does not materialize CTEs by default – they are inlined. If you want to materialize intermediate results (to reduce data scanned in subsequent steps), you might need to write to a temporary table in one step, then query it in the next.
  • Use Materialized Views for Frequent Aggregations: BigQuery supports materialized views which automatically cache the results of a query (typically an aggregation on a table) and incrementally update as the base table changes. If you have a very expensive aggregation that many queries run (like daily totals per category), a materialized view can serve those results quickly and save cost. Queries that can use the materialized view (e.g., they request data that the view pre-computed) will automatically do so. Design the materialized view on the heavy computation part of your data.
  • Consider BI Engine for Dashboards: If you are doing a lot of repetitive queries (especially dashboard queries that hit the same tables with filter variations), Google’s BI Engine can cache data in-memory for super-fast responses. It’s not a query tip per se, but a feature to be aware of for performance tuning in BigQuery when used with tools like Data Studio/Looker.
  • Use EXPLAIN to Understand Query Plans: BigQuery now has an EXPLAIN statement that shows the query execution plan and stages. This can help identify if a query is doing a large shuffle or scan that you didn't expect. For instance, it can reveal if a partition filter is being applied or if a join is causing a huge data shuffle. Using EXPLAIN ANALYZE will actually run the query and provide timing and resource usage statistics for each step. This is useful for performance debugging – if a certain stage takes 90% of time, focus optimization efforts there.
  • Use TABLESAMPLE for Testing: When developing queries on huge tables, use the TABLESAMPLE SYSTEM() clause to run on a fraction of the data (How to reduce BigQuery costs: 10 effective life hacks for optimization | Lead Panda Media). For example, FROM big_table TABLESAMPLE SYSTEM (1 PERCENT) will read a 1% random sample of the table. This lets you test query logic quickly and cheaply on large data (with the caveat that results are approximate because it’s a sample). Once you are satisfied with the query on 1%, you can run it on 100%. This can save a lot of time and cost during development iterations.
  • Approximate Results for Big Data: As mentioned, if you only need an estimate (like approx distinct count, top frequencies, etc.), use BigQuery’s approximate functions. For example, APPROX_QUANTILES can get percentile estimates without sorting all data, and HLL_COUNT.INIT / HLL_COUNT.MERGE functions allow creating your own HyperLogLog distinct counts over streams of data (advanced usage). Using these can cut down CPU time for large-scale analytics (How to reduce BigQuery costs: 10 effective life hacks for optimization | Lead Panda Media).
  • Monitoring and Slot Tuning: BigQuery automatically manages resources, but if you have a critical query or consistent workload, consider reserving slots (with BigQuery Reservations) and using concurrency tuning. For one-off performance, ensure you’re using the default parallelism effectively by not having overly complex single-threaded UDFs or external calls that serialize things.

In summary, BigQuery performance is about scanning less data and distributing work efficiently. Techniques like partitioning, clustering, filtering, and pre-aggregating data (or using materialized views) directly reduce the amount of data scanned. Others, like using denormalized schemas and caching, reduce the amount of work per query. Following these best practices can lead to massive improvements in query speed and cost – for example, a query that originally scanned 1 TB daily could scan only 100 GB with proper partitioning and clustering, and further drop to 10 GB if you only select needed columns and use a WHERE clause, resulting in faster execution and 1/100th the cost.

Lastly, always test your optimizations. BigQuery’s web UI or CLI will show you how many bytes a query will process before you run it (when you click “Query Validator” or use dry-run flag). Use that as a guide: small changes in the query can sometimes accidentally increase bytes scanned. The goal is to minimize that while still getting correct results. BigQuery’s design (columnar + parallel) will handle large data if you follow these patterns to help it avoid unnecessary scans and shuffles.

0
Subscribe to my newsletter

Read articles from Beshoy Sabri directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Beshoy Sabri
Beshoy Sabri