Mastering Transformations in Spark SQL

Tech AcademyTech Academy
51 min read

Table of contents

Spark SQL Transformations

1. Spark SQL vs. PySpark DataFrames: Key Differences

  1. | Feature | Spark SQL | PySpark DataFrames | | --- | --- | --- | | Interface | SQL-based (SELECT * FROM table) | Python-based (df.select(), df.filter()) | | Performance | Optimized for complex queries (BI tools, dashboards) | Faster for iterative processing (ML, transformations) | | Ease of Use | Easier for SQL users | More flexible for developers (Python, Scala) | | Integration | Works with Hive, JDBC, BI tools | Works with Python ML libraries (scikit-learn, TensorFlow) | | Best Use Cases | Ad-hoc queries, reporting, BI dashboards | ETL, Data Science, Machine Learning |

    Logical Flow Diagram: How Spark SQL and PySpark Work

    Spark SQL Workflow

    1️⃣ Load data into a table or DataFrame.
    2️⃣ Run SQL queries on the table.
    3️⃣ Spark converts SQL into an optimized execution plan.
    4️⃣ The query executes in a distributed cluster.

    Example Workflow:

     CREATE TABLE sales (id INT, amount DOUBLE);
     SELECT * FROM sales WHERE amount > 1000;
    

    PySpark DataFrame Workflow

    1️⃣ Load data into a DataFrame.
    2️⃣ Use PySpark functions (select, filter, groupBy).
    3️⃣ Spark optimizes execution using the Catalyst engine.
    4️⃣ Executes in a distributed cluster.

    Example Workflow:

     df = spark.read.csv("sales.csv", header=True)
     df.filter(df.amount > 1000).show()
    

    Both Spark SQL and PySpark DataFrames use the Catalyst optimizer to improve performance.

When to Use Spark SQL & PySpark Together

Many use cases require both Spark SQL & PySpark DataFrames.

Use Case 1: SQL Query with PySpark

Read a CSV file and filter results using Spark SQL:

df = spark.read.csv("sales.csv", header=True)
df.createOrReplaceTempView("sales_table")

high_sales = spark.sql("SELECT * FROM sales_table WHERE amount > 5000")
high_sales.show()

Uses SQL for filtering, then PySpark DataFrame for further processing.

Use Case 2: Writing PySpark DataFrame Results to a SQL Table

Transform a DataFrame and write results as a table:

df_filtered = df.filter(df.amount > 5000)
df_filtered.write.format("parquet").saveAsTable("high_sales")

Uses PySpark transformations but stores the output as a SQL table.

Use Case 3: Real-Time Streaming with Spark SQL + PySpark

Read real-time sales data from Kafka, process with PySpark, and store in a SQL table.

df_stream = spark.readStream.format("kafka") \
    .option("subscribe", "sales_topic") \
    .load()

df_sales = df_stream.selectExpr("CAST(value AS STRING)")
df_sales.createOrReplaceTempView("sales_stream")

query = spark.sql("SELECT * FROM sales_stream WHERE amount > 1000")
query.writeStream.format("console").start()

Combines PySpark DataFrames for real-time streaming with SQL queries.

Which One to Use?

Use CaseUse Spark SQL?Use PySpark DataFrames?
Data ExplorationBest choice (simple queries)❌ Not needed
ETL Pipelines⚠️ Can be usedBest choice (Python-based processing)
BI Dashboards (Power BI, Tableau)Best choice❌ Not needed
Machine Learning (MLlib, TensorFlow)❌ Not preferredBest choice
Streaming (Kafka, EventHub)For SQL queries on streamFor transformations

Spark SQL is best for SQL-heavy workflows, BI dashboards, and analytics.
PySpark DataFrames are best for ETL, ML, real-time streaming, and complex transformations.

2. Types of Transformations in Spark: Narrow vs. Wide Transformations

Transformations in Apache Spark are operations applied on DataFrames or RDDs to create a new dataset. They are lazy, meaning they do not execute immediately but build a logical execution plan.

There are two main types of transformations:
Narrow Transformations (map, filter, etc.)
Wide Transformations (groupBy, join, aggregate, etc.)


What are Narrow Transformations?

Narrow transformations are operations where each partition’s output depends only on a single input partition.
These operations do not require data shuffling across nodes, making them faster & more efficient.

Examples of Narrow Transformations:
1️⃣ map() → Applies a function to each row
2️⃣ filter() → Removes unwanted rows
3️⃣ flatMap() → Expands one row into multiple rows
4️⃣ drop() → Removes a column

Example: Narrow Transformations in PySpark

df = spark.createDataFrame([(1, 'John', 5000), (2, 'Alice', 7000)], ["id", "name", "salary"])

# Map transformation: Convert salary to INR
df_mapped = df.withColumn("salary_inr", df.salary * 82)

# Filter transformation: Get employees with salary > 6000
df_filtered = df_mapped.filter(df_mapped.salary > 6000)

df_filtered.show()

No data shuffle happens! The transformations apply within the same partition.

What are Wide Transformations?

Wide transformations are operations where data from multiple partitions is needed to compute the result.
These transformations require data shuffling, making them slower but necessary for aggregations.

Examples of Wide Transformations:
1️⃣ groupBy() → Groups data based on a column
2️⃣ join() → Combines data from two DataFrames
3️⃣ orderBy() → Sorts data across partitions
4️⃣ aggregate() → Computes sum, avg, count across partitions

Example: Wide Transformations in PySpark

df_sales = spark.createDataFrame([
    (1, 'John', 'Electronics', 5000),
    (2, 'Alice', 'Clothing', 7000),
    (3, 'Bob', 'Electronics', 6000)
], ["id", "name", "category", "amount"])

# GroupBy transformation: Total sales per category
df_grouped = df_sales.groupBy("category").sum("amount")

df_grouped.show()

Here, Spark needs to shuffle data across nodes to group by category.

Narrow vs. Wide Transformations: Key Differences

FeatureNarrow TransformationsWide Transformations
DefinitionEach output partition depends on only one input partitionEach output partition depends on multiple input partitions
Requires Shuffling?No (Faster)Yes (Slower)
PerformanceHigh PerformanceRequires network shuffle (Slower)
Examplesmap(), filter(), flatMap()groupBy(), join(), orderBy()

Logical Diagram: Narrow vs. Wide Transformations

Narrow Transformation (map())

Input Partition 1 → [Row1, Row2, Row3]  --(map)-->  Output Partition 1 → [Row1, Row2, Row3]

Each input partition maps directly to one output partition (No Shuffling).


Wide Transformation (groupBy())

Input Partition 1 → [Electronics, Clothing]  
Input Partition 2 → [Electronics, Sports]  
       ⬇ (Shuffle Data) ⬇  
Output Partition 1 → [Electronics]  
Output Partition 2 → [Clothing, Sports]

Data must be shuffled between nodes before aggregation happens.

Summary: Optimizing Transformations in Spark

Optimization TipWhy?
Prefer Narrow Transformations (map, filter)No shuffle = Faster execution
Minimize Wide Transformations (join, groupBy)Reduces network overhead
Use broadcast() for small DataFrames before joinsAvoids large shuffle operations
Use repartition() wiselyHelps optimize partition size before wide transformations

Use Narrow Transformations when possible for speed.
Use Wide Transformations only when needed for aggregations.

3. Lazy Execution in PySpark

Creating a DataFrame and Applying Transformations (Lazy Execution)

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("LazyExecutionExample").getOrCreate()

# Create a DataFrame
data = [(1, "Alice", 5000), (2, "Bob", 7000), (3, "Charlie", 8000)]
columns = ["id", "name", "salary"]

df = spark.createDataFrame(data, columns)

# Apply Transformations (Nothing Executes Yet!)
df_filtered = df.filter(df.salary > 6000)
df_mapped = df_filtered.withColumn("bonus", df_filtered.salary * 0.10)

# Still Lazy Execution - Spark has NOT run anything yet

At this point, Spark has only recorded the transformations but has NOT executed them.

When Does Execution Actually Happen?

Triggering an Action

df_mapped.show()  # Action (Executes the Query)

Now Spark executes all transformations in one go!

Other Actions That Trigger Execution

ActionPurpose
.show()Displays the DataFrame
.collect()Brings data to the driver
.count()Counts rows
.write.format("parquet").save("path")Writes data to storage

Transformations execute only when an action is called.

Logical Diagram: Lazy Execution Workflow

1️⃣ Transformation 1: Apply filter() → (Queued in Logical Plan)
2️⃣ Transformation 2: Apply map() → (Queued in Logical Plan)
3️⃣ Action: Call .show() → (Executes everything together)

Spark optimizes execution to run all transformations in a single stage.

Lazy Execution in Spark

FeatureLazy Execution
Execution TimingRuns only when an action is triggered
PerformanceOptimized execution by combining transformations
Fault ToleranceKeeps a logical plan, allowing recovery from failure
Example Transformationsmap(), filter(), withColumn() (Lazy)
Example Actions.show(), .count(), .collect() (Triggers Execution)

Lazy Execution makes Spark faster & more efficient!


3. Working with Structured Data: Reading and Writing Parquet, ORC, Avro Files in Spark SQL

Structured data is commonly stored in optimized file formats like Parquet, ORC, and Avro, which help improve query performance and storage efficiency in big data applications.

This guide covers:
What are Parquet, ORC, and Avro?
How to read & write these formats in Spark SQL and PySpark?
Performance considerations & best practices


3.1 Understanding Parquet, ORC, and Avro

FormatBest ForAdvantagesCompression
ParquetAnalytics & QueriesColumnar storage, schema evolutionHigh (Snappy, Gzip)
ORCHive & Spark SQLBest for Hive-based queries, column pruningHigh (Zlib, Snappy)
AvroData Serialization & Schema EvolutionRow-based, good for streamingMedium (Deflate, Snappy)

Parquet and ORC → Best for columnar analytics
Avro → Best for row-based serialization & schema evolution

FeatureParquet & ORC (Columnar)Avro (Row-Based)
Best ForLarge-scale analytics (OLAP)Streaming & Transactions (OLTP)
Storage LayoutColumn-wise (stores individual columns separately)Row-wise (stores full rows together)
SpeedFaster for queries on specific columnsFaster for inserting & updating rows
Use CaseData lakes, analytics platforms (BigQuery, Redshift)Kafka streaming, log storage, messaging systems

3.2 Reading & Writing Parquet Files in Spark SQL

Parquet is a columnar format designed for fast querying and compression.

A) Reading Parquet Files

Read a single Parquet file

df = spark.read.parquet("s3://data/sales.parquet")
df.show()

Read multiple Parquet files

df = spark.read.parquet("s3://data/parquet_files/*")

Using Spark SQL to Query Parquet Files

SELECT * FROM parquet.`s3://data/sales.parquet` WHERE amount > 5000;

B) Writing Data to Parquet

Save DataFrame as a Parquet File

df.write.parquet("s3://data/output/sales_parquet")

Save as Table (Partitioned)

df.write.partitionBy("year").parquet("s3://data/output/sales_partitioned")

Best Practice: Use partitioning and Z-Ordering for faster query execution.


C.) Reading & Writing ORC Files in Spark SQL

ORC (Optimized Row Columnar) is optimized for Hive and provides better compression than Parquet.

Reading ORC Files

df = spark.read.orc("s3://data/sales.orc")
df.show()

Using Spark SQL to Query ORC Files

SELECT * FROM orc.`s3://data/sales.orc` WHERE category = 'Electronics';

Writing Data to ORC

Save DataFrame as ORC File

df.write.orc("s3://data/output/sales_orc")

Save with Compression

df.write.option("compression", "zlib").orc("s3://data/output/sales_compressed_orc")

ORC is ideal for Hive-based querying and columnar storage.


D.) Reading & Writing Avro Files in Spark SQL

Avro is row-based and best for schema evolution and data serialization.

Reading Avro Files

df = spark.read.format("avro").load("s3://data/sales.avro")
df.show()

Using Spark SQL to Query Avro Files

SELECT * FROM avro.`s3://data/sales.avro` WHERE customer_id = 1234;

Writing Data to Avro

Save DataFrame as Avro File

df.write.format("avro").save("s3://data/output/sales_avro")

Save with Schema Evolution Support

df.write.option("avroSchema", "path/to/schema.avsc").format("avro").save("s3://data/output/sales_avro_v2")

Avro is best for schema evolution and real-time processing.

Performance Considerations: Choosing the Right Format

Use CaseBest Format
Big Data Analytics (Columnar Queries)Parquet, ORC
Streaming & Schema EvolutionAvro
Hive-Based ProcessingORC
Data Lake StorageParquet

Use Parquet for analytics (better query speed & compression).
Use ORC for Hive-based workloads (optimized columnar storage).
Use Avro for schema evolution (best for streaming & Kafka integration).


Summary: How to Work with Parquet, ORC, Avro in Spark SQL

OperationParquetORCAvro
Read Filespark.read.parquet("path")spark.read.orc("path")spark.read.format("avro").load("path")
Write Filedf.write.parquet("path")df.write.orc("path")df.write.format("avro").save("path")
Best Use CaseData Lakes, AnalyticsHive, Columnar ProcessingStreaming, Schema Evolution

4.Aggregations in Spark SQL

COUNT()

Returns the number of rows in a dataset.

    • Example:

         SELECT COUNT(*) FROM sales;
      

      Tricky SQL Concepts with COUNT(*)

      | Concept | Explanation | | --- | --- | | COUNT(*) vs COUNT(column) | COUNT(*) includes NULLs, COUNT(column) ignores NULLs | | COUNT(DISTINCT column) | Counts only unique non-null values | | COUNT(1) vs. COUNT(*) | No difference in modern databases | | HAVING COUNT(*) | Filters groups based on count conditions | | COUNT(*) in JOINs | Joins may create duplicate rows, inflating the count | | COUNT(column_name) in LEFT JOIN | NULLs are ignored, leading to possible 0 counts | | COUNT(*) OVER() | Running total count using window functions | | COUNT(*) in UNION vs UNION ALL | UNION removes duplicates, UNION ALL keeps them | | COUNT(*) FILTER (WHERE condition) | Conditional counting within aggregations |

SUM()

    • Computes the sum of a numeric column.

      • Examples:

         SELECT SUM(amount) FROM sales;
        

        More Tricky solutions::

        1. What Does SUM(amount) Really Do?

        • Adds up all values in the amount column.

        • Ignores NULL values automatically.

        • Returns NULL if no rows exist or all values are NULL.

Example:

        SELECT SUM(amount) FROM sales;
sale_idamount
1100
2200
3NULL
4300

SUM(amount) = 100 + 200 + 300 = 600 (NULL is ignored).
If all amount values were NULL, SUM(amount) would return NULL.


2. SUM(amount) vs. COALESCE(SUM(amount), 0)

✅ If SUM(amount) returns NULL, it may cause issues in calculations.
Using COALESCE() ensures it returns 0 instead of NULL.

Example:

        SELECT COALESCE(SUM(amount), 0) AS total_amount FROM sales;
  • If all rows contain NULL values, it will return 0 instead of NULL.

3. SUM(DISTINCT amount) – Summing Unique Values

Using DISTINCT ensures duplicate values are summed only once.

Example:

        SELECT SUM(DISTINCT amount) FROM sales;
sale_idamount
1100
2200
3100
4300

SUM(amount) = 100 + 200 + 100 + 300 = 700
SUM(DISTINCT amount) = 100 + 200 + 300 = 600 (duplicates ignored).


4. SUM() with GROUP BY

Grouping results before summing them can produce unexpected results.

Example:

        SELECT category, SUM(amount) FROM sales GROUP BY category;
categoryamount
Electronics200
Clothing150
Electronics300

GROUP BY groups data before summing:

categorySUM(amount)
Electronics200 + 300 = 500
Clothing150

5. SUM() with HAVING Clause

Used to filter aggregated results.

Example:

        SELECT category, SUM(amount) AS total_sales
        FROM sales
        GROUP BY category
        HAVING SUM(amount) > 500;
  • Filters only categories where total sales exceed 500.

6. SUM() in Joins – Unexpected Duplicates

If a JOIN causes duplicate rows, SUM() may return inflated values.

Example:

        SELECT c.customer_id, SUM(s.amount)
        FROM customers c
        JOIN sales s ON c.customer_id = s.customer_id
        GROUP BY c.customer_id;
  • If a customer has multiple rows due to the JOIN, SUM() may double-count sales.

Fix: Use SUM(DISTINCT s.amount) if needed.


7. SUM() in LEFT JOIN – Handling NULLs

✅ If there are NULL values due to LEFT JOIN, SUM() may return NULL.

Example:

        SELECT c.customer_id, SUM(s.amount)
        FROM customers c
        LEFT JOIN sales s ON c.customer_id = s.customer_id
        GROUP BY c.customer_id;
customer_idamount
1500
2NULL

SUM(amount) ignores NULLs, but customer 2 gets NULL.
Fix: Use COALESCE(SUM(amount), 0) AS total_amount to return 0 instead of NULL.


8. SUM() with CASE – Conditional Summing

Summing only specific values within a column.

Example:

        SELECT 
            SUM(CASE WHEN category = 'Electronics' THEN amount ELSE 0 END) AS electronics_sales,
            SUM(CASE WHEN category = 'Clothing' THEN amount ELSE 0 END) AS clothing_sales
        FROM sales;
  • Computes total sales separately for each category.

9. SUM() with FILTER() – Alternative to CASE

A cleaner way to apply conditional aggregation.

Example:

        SELECT 
            SUM(amount) FILTER (WHERE category = 'Electronics') AS electronics_sales,
            SUM(amount) FILTER (WHERE category = 'Clothing') AS clothing_sales
        FROM sales;
  • Same as CASE method but more concise.

10. SUM() in Window Functions – Running Total

Using OVER() for cumulative summing.

Example:

        SELECT customer_id, order_date, 
               SUM(amount) OVER(PARTITION BY customer_id ORDER BY order_date) AS running_total
        FROM sales;
customer_idorder_dateamountrunning_total
12024-01-01100100
12024-01-02200300
12024-01-03150450

Computes a running total of sales per customer.


11. SUM() with UNION vs. UNION ALL

UNION removes duplicates, affecting SUM().
UNION ALL keeps duplicates, increasing SUM().

Example:

        SELECT SUM(amount) FROM (
            SELECT amount FROM sales
            UNION
            SELECT amount FROM refunds
        ) AS combined_data;
  • Removes duplicate amounts before summing.

  • If we use UNION ALL, duplicates remain.


12. SUM() with Negative Values

Handles both positive and negative numbers.

Example:

        SELECT SUM(amount) FROM transactions;
transaction_idamount
1500
2-200
3300

SUM(amount) = 500 + (-200) + 300 = 600
Negative values affect totals, useful in accounting and refunds.


Tricky SQL Concepts with SUM(amount)

ConceptExplanation
SUM(amount) ignores NULLsNULL values are excluded from summation
SUM(DISTINCT amount)Summing only unique values
SUM() with GROUP BYComputes totals for each group
SUM() with HAVINGFilters aggregated results
SUM() in JOIN issuesJoins may duplicate rows, inflating totals
SUM() in LEFT JOINNULLs may lead to unexpected results
SUM() with CASEConditional summing
SUM() with FILTER()A cleaner conditional sum approach
SUM() with OVER()Running total calculation
SUM() in UNION vs. UNION ALLUNION removes duplicates, affecting sum
SUM() with Negative ValuesImportant for transactions, refunds

AVG()

    • Returns the average (mean) value of a numeric column.

      • Example:

         SELECT AVG(price) FROM products;
        

        Tricky SQL Concepts with AVG(price)

        | Concept | Explanation | | --- | --- | | AVG(price) ignores NULLs | NULL values are excluded from the calculation | | AVG(DISTINCT price) | Averages only unique values | | Integer Division Issue | Convert to FLOAT or DECIMAL for correct results | | AVG() with GROUP BY | Computes averages for each group | | AVG() with HAVING | Filters groups based on the average | | AVG() in Joins | Joins may create duplicate rows, affecting averages | | AVG() in LEFT JOIN | NULLs may lead to unexpected results | | AVG() with CASE | Conditional averaging | | AVG() with FILTER() | A cleaner conditional average approach | | AVG() with OVER() | Moving average calculation | | AVG() in UNION vs. UNION ALL | UNION removes duplicates, affecting averages | | AVG() with Negative Values | Important for transactions, refunds |

MIN()

Finds the minimum value in a column.

Example:

SELECT MIN(salary) FROM employees;

MAX()

Finds the maximum value in a column.

Example:

SELECT MAX(salary) FROM employees;

COUNT(DISTINCT column_name)

Returns the count of distinct values in a column.

Example:

SELECT COUNT(DISTINCT category) FROM products;

GROUP BY with Aggregations

    • Used to apply aggregate functions on grouped data.

      • Example:

         SELECT department, AVG(salary) FROM employees GROUP BY department;
        

        AVG(salary) GROUP BY department

        | Concept | Explanation | | --- | --- | | AVG(salary) ignores NULLs | NULL values are excluded from the calculation | | AVG(DISTINCT salary) | Averages only unique values | | Integer Division Issue | Convert to FLOAT or DECIMAL for correct results | | AVG() with GROUP BY | Computes averages for each department | | AVG() with HAVING | Filters groups based on the average | | AVG() in Joins | Joins may create duplicate rows, affecting averages | | AVG() in LEFT JOIN | NULLs may lead to unexpected results | | AVG() with CASE | Conditional averaging | | AVG() with FILTER() | A cleaner conditional average approach | | AVG() with OVER() | Finds average salary per department without GROUP BY | | AVG() in UNION vs. UNION ALL | UNION removes duplicates, affecting averages | | AVG() with Negative Values | Important for finance, refunds |

HAVING with Aggregations

    • Example:

         SELECT department, SUM(salary) 
         FROM employees 
         GROUP BY department 
         HAVING SUM(salary) > 50000;
      

      1. How HAVING Works in SUM(salary) > 50000

         SELECT department, SUM(salary) 
         FROM employees 
         GROUP BY department 
         HAVING SUM(salary) > 50000;
      
      • Groups data by department.

      • Computes SUM(salary) for each department.

      • Filters departments where the total salary exceeds 50000.

Example Data:

departmentsalary
HR30000
HR25000
IT40000
IT10000
Finance20000

Query Execution:

departmentSUM(salary)
HR30000 + 25000 = 55000 ✅ (kept)
IT40000 + 10000 = 50000 ❌ (removed)
Finance20000 ❌ (removed)

2. HAVING vs. WHERE – Key Difference

WHERE filters rows before aggregation.
HAVING filters after aggregation.

Incorrect Use of WHERE (Causes Error)

        SELECT department, SUM(salary) 
        FROM employees 
        WHERE SUM(salary) > 50000  -- ❌ INVALID
        GROUP BY department;
  • Error: SUM(salary) is an aggregate function, and WHERE cannot use aggregate functions.

Correct Use with HAVING

        SELECT department, SUM(salary) 
        FROM employees 
        GROUP BY department 
        HAVING SUM(salary) > 50000;

3. HAVING with COUNT() – Filtering Groups by Count

Find departments with more than 2 employees.

        SELECT department, COUNT(*) 
        FROM employees 
        GROUP BY department 
        HAVING COUNT(*) > 2;
  • Counts employees per department.

  • Filters only departments with more than 2 employees.


4. HAVING with Multiple Conditions

Find departments where total salary is over 50000 AND average salary is above 25000.

        SELECT department, SUM(salary), AVG(salary)
        FROM employees 
        GROUP BY department 
        HAVING SUM(salary) > 50000 AND AVG(salary) > 25000;
  • Filters on both SUM(salary) and AVG(salary).

5. HAVING with DISTINCT – Counting Unique Values

Find departments where unique salaries exceed 2.

        SELECT department, COUNT(DISTINCT salary)
        FROM employees 
        GROUP BY department 
        HAVING COUNT(DISTINCT salary) > 2;
  • Counts unique salary values per department.

  • Filters departments with more than 2 unique salary values.


6. HAVING with CASE – Conditional Aggregation

Find departments where Managers earn over 40000 in total.

        SELECT department, SUM(CASE WHEN job_title = 'Manager' THEN salary ELSE 0 END) AS manager_salary
        FROM employees 
        GROUP BY department 
        HAVING SUM(CASE WHEN job_title = 'Manager' THEN salary ELSE 0 END) > 40000;
  • Filters only departments where total salary for Managers exceeds 40000.

7. HAVING with FILTER() – Alternative to CASE

Find departments where total Manager salary exceeds 40000 (simplified).

        SELECT department, SUM(salary) FILTER (WHERE job_title = 'Manager') AS manager_salary
        FROM employees 
        GROUP BY department 
        HAVING SUM(salary) FILTER (WHERE job_title = 'Manager') > 40000;
  • Same as CASE but more readable.

8. HAVING with ORDER BY – Sorting Aggregated Data

Find departments with total salary > 50000 and sort by total salary.

        SELECT department, SUM(salary) AS total_salary
        FROM employees 
        GROUP BY department 
        HAVING SUM(salary) > 50000
        ORDER BY total_salary DESC;
  • Sorts only departments that meet the HAVING condition.

9. HAVING with JOIN – Filtering After Aggregation

Find customers who spent over $1000 across all orders.

        SELECT c.customer_id, SUM(o.amount) AS total_spent
        FROM customers c
        JOIN orders o ON c.customer_id = o.customer_id
        GROUP BY c.customer_id
        HAVING SUM(o.amount) > 1000;
  • Filters only customers who spent more than $1000.

10. HAVING in Window Functions – Why It Doesn’t Work

HAVING cannot be used in OVER() window functions.

        SELECT employee_id, department, salary, 
               SUM(salary) OVER(PARTITION BY department) AS dept_salary
        FROM employees
        HAVING dept_salary > 50000;  -- ❌ INVALID
  • Error: HAVING works on grouped data, not windowed data.

  • Fix: Use a subquery or WHERE instead.

Correct Approach:

        SELECT * FROM (
            SELECT employee_id, department, salary, 
                   SUM(salary) OVER(PARTITION BY department) AS dept_salary
            FROM employees
        ) subquery
        WHERE dept_salary > 50000;
  • Filters after window function execution.

Summary of Tricky HAVING Scenarios

ScenarioExplanation
HAVING vs. WHEREWHERE filters before aggregation, HAVING filters after
HAVING with COUNT()Filters groups based on row counts
HAVING with multiple conditionsCombines filters on multiple aggregate functions
HAVING with DISTINCTCounts only unique values in groups
HAVING with CASEFilters based on conditional aggregation
HAVING with FILTER()A cleaner alternative to CASE
HAVING with ORDER BYSorts aggregated results after filtering
HAVING in JoinsFilters after aggregation in JOIN queries
HAVING with Window FunctionsDoes not work, use subqueries instead

VARIANCE() / VAR_POP()

    • Computes the variance of a column.

      • Example:

         SELECT VARIANCE(salary) FROM employees;
        

Alternative Functions in Spark SQL

  • STDDEV(salary) – Returns the standard deviation (square root of variance).

  • VAR_SAMP(salary) – Computes sample variance instead of population variance.

  • VAR_POP(salary) – Computes population variance.

Tricky SQL Concepts with VARIANCE(salary)

ScenarioExplanation
VARIANCE() ignores NULLsNULL values are excluded from calculation
VAR_POP() vs. VAR_SAMP()Population variance divides by n, sample variance by n - 1
Integer Division IssueConvert to FLOAT or DECIMAL for correct results
VARIANCE() with GROUP BYComputes variance per department
VARIANCE() with HAVINGFilters groups based on variance
VARIANCE() in JoinsDuplicates may inflate variance, use DISTINCT
VARIANCE() with CASEConditional variance calculation
VARIANCE() with FILTER()A cleaner conditional variance approach
VARIANCE() with OVER()Running variance calculation
VARIANCE() in UNION vs. UNION ALLUNION removes duplicates, affecting variance

STDDEV() / STDDEV_POP()

Computes the standard deviation.

Example:

SELECT STDDEV(salary) FROM employees;

FIRST() / LAST()

Retrieves the first or last value in a group.

Example:

SELECT department, FIRST(employee_name) FROM employees GROUP BY department;
  • Groups data by department.

  • Retrieves the first employee_name in each department.

  • The "first" value depends on how the database processes grouped data.

    Example Data:

    | employee_id | department | employee_name | | --- | --- | --- | | 1 | HR | Alice | | 2 | HR | Bob | | 3 | IT | Charlie | | 4 | IT | David |

    Expected Result:

    | department | FIRST(employee_name) | | --- | --- | | HR | Alice | | IT | Charlie |

COLLECT_SET()

Returns a set of unique values as an array.

SELECT department, COLLECT_SET(employee_name) FROM employees GROUP BY department;

Purpose

  • COLLECT_SET(employee_name) returns a unique list (set) of employee names within each department.

  • It removes duplicates and returns an array of unique values.

How It Works

  1. GROUP BY department → Groups employees by department.

  2. COLLECT_SET(employee_name) → Collects unique employee names in each department into an array.

Example Dataset: employees Table

employee_namedepartment
AliceIT
BobHR
CharlieIT
AliceIT
DavidFinance
EmmaHR

Query Output

departmentCOLLECT_SET(employee_name)
IT["Alice", "Charlie"]
HR["Bob", "Emma"]
Finance["David"]
  • Duplicates are removed: "Alice" appears twice in IT but is stored only once.

Alternative Function

  • COLLECT_LIST(employee_name) → Collects all employee names as a list including duplicates.

COLLECT_LIST()

Returns all values as an array (including duplicates).

Example:

SELECT department, COLLECT_LIST(employee_name) FROM employees GROUP BY department;

Purpose

  • COLLECT_LIST(employee_name) aggregates all employee names into a list for each department.

  • Unlike COLLECT_SET(), it does not remove duplicates and preserves the order of occurrence.

How It Works

  1. GROUP BY department → Groups employees by department.

  2. COLLECT_LIST(employee_name) → Collects all employee names into an array with duplicates and order preserved.

Example Dataset: employees Table

employee_namedepartment
AliceIT
BobHR
CharlieIT
AliceIT
DavidFinance
EmmaHR

Query Output

departmentCOLLECT_LIST(employee_name)
IT["Alice", "Charlie", "Alice"]
HR["Bob", "Emma"]
Finance["David"]
  • Duplicates are included: "Alice" appears twice in IT and is not removed.

Difference Between COLLECT_SET() and COLLECT_LIST()

FunctionRemoves DuplicatesMaintains Order
COLLECT_SET()YesNo
COLLECT_LIST()NoYes

PERCENTILE() / APPROX_PERCENTILE()

Computes percentile values.

Example:

SELECT APPROX_PERCENTILE(salary, 0.5) FROM employees;

Purpose

  • The function APPROX_PERCENTILE(column, percentile) computes the approximate percentile value for a given column.

  • 0.5 represents the 50th percentile (median salary).

  • It is faster than exact percentile calculations and is useful for large datasets.


How It Works

  1. APPROX_PERCENTILE(salary, 0.5)

    • Estimates the median salary (50th percentile).

    • It does not scan the full dataset, making it efficient for big data.

  2. Alternative Percentile Values:

    • 0.25 → 25th percentile (lower quartile).

    • 0.75 → 75th percentile (upper quartile).

    • [0.25, 0.5, 0.75] → Returns multiple percentiles.


Example Dataset

EmployeeSalary
A3000
B5000
C6000
D7000
E10000
  • Sorted Salaries: [3000, 5000, 6000, 7000, 10000]

  • 50th Percentile (Median Salary): 6000

  • Query Output: 6000


Alternative Function

  • PERCENTILE_APPROX(salary, 0.5, 1000)

    • More accurate percentile approximation.

    • The third argument (1000) is the number of samples (higher is more accurate but slower).

MODE() (Most Frequent Value - Custom Implementation)

Spark SQL doesn’t have a direct MODE() function, but it can be calculated using:

SELECT value, COUNT(*) AS frequency 
FROM dataset 
GROUP BY value 
ORDER BY frequency DESC 
LIMIT 1;

RANK() – Assigns ranks with gaps.

SELECT name, salary, RANK() OVER (ORDER BY salary DESC) FROM employees;

DENSE_RANK() – Assigns ranks without gaps.

SELECT name, salary, DENSE_RANK() OVER (ORDER BY salary DESC) FROM employees;

ROW_NUMBER() – Assigns unique row numbers.

SELECT name, salary, ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) FROM employees;

NTILE(n) – Divides data into n equal parts.

SELECT name, salary, NTILE(4) OVER (ORDER BY salary DESC) FROM employees;

Statistical Aggregations

CORR(col1, col2) – Correlation between two columns.

SELECT CORR(salary, age) FROM employees;

Purpose

  • The function CORR(column1, column2) calculates the Pearson correlation coefficient between two numerical columns.

  • It measures the linear relationship between salary and age.

  • The result is a value between -1 and 1:

    • 1 → Perfect positive correlation (as age increases, salary increases).

    • -1 → Perfect negative correlation (as age increases, salary decreases).

    • 0 → No correlation (no relationship between age and salary).

Purpose

  • COVAR_POP(column1, column2) calculates the population covariance between salary and age.

  • Covariance measures the direction of the relationship between two numeric columns but not the strength (unlike correlation).

  • The result can be:

    • Positive → When age increases, salary tends to increase.

    • Negative → When age increases, salary tends to decrease.

    • Zero or close to zero → No significant relationship.

COVAR_POP(col1, col2) – Population covariance.

SELECT COVAR_POP(salary, age) FROM employees;

Purpose

  • COVAR_POP(column1, column2) calculates the population covariance between salary and age.

  • Covariance measures the direction of the relationship between two numeric columns but not the strength (unlike correlation).

  • The result can be:

    • Positive → When age increases, salary tends to increase.

    • Negative → When age increases, salary tends to decrease.

    • Zero or close to zero → No significant relationship.

Difference Between COVAR_POP() and COVAR_SAMP()

FunctionFormulaUse Case
COVAR_POP(x, y)Uses population formula (N in denominator)When working with full data (entire population)
COVAR_SAMP(x, y)Uses sample formula (N-1 in denominator)When working with a sample (not full population)

COVAR_SAMP(col1, col2) – Sample covariance.

SELECT COVAR_SAMP(salary, age) FROM employees;

Purpose

  • COVAR_SAMP(column1, column2) calculates the sample covariance between salary and age.

  • It measures how salary and age vary together in a sample (not the entire population).

  • Unlike COVAR_POP(), this function divides by (N-1) instead of N, making it more suitable for sample data.

Difference Between COVAR_SAMP() and COVAR_POP()

FunctionFormulaUse Case
COVAR_POP(x, y)Uses N in the denominatorWhen working with full population data
COVAR_SAMP(x, y)Uses N-1 in the denominatorWhen working with sample data

Array-Based Aggregations

ARRAY_AGG(column) – Aggregates data into an array.

SELECT department, ARRAY_AGG(employee_name) FROM employees GROUP BY department;

Purpose

  • The ARRAY_AGG(column) function aggregates values into an array for each group.

  • It preserves duplicates and maintains the order of appearance in each department.

  • The result is a single array per department, containing all employee names.

EXPLODE(array_column) – Converts array elements into separate rows.

SELECT EXPLODE(array_column) FROM table;

How It Works

  1. GROUP BY department → Groups employees by department.

  2. ARRAY_AGG(employee_name) → Collects all employee names in each department into an array.


Example Dataset: employees Table

employee_namedepartment
AliceIT
BobHR
CharlieIT
AliceIT
DavidFinance
EmmaHR

Query Output

departmentARRAY_AGG(employee_name)
IT["Alice", "Charlie", "Alice"]
HR["Bob", "Emma"]
Finance["David"]
  • Duplicates are included: "Alice" appears twice in IT and is not removed.

  • Order is preserved.


Difference Between ARRAY_AGG(), COLLECT_LIST(), and COLLECT_SET()

FunctionRemoves Duplicates?Maintains Order?
ARRAY_AGG()NoYes
COLLECT_LIST()NoYes
COLLECT_SET()YesNo
  • ARRAY_AGG() behaves similarly to COLLECT_LIST() but is more SQL-standard.

Alternative Approaches

  • COLLECT_LIST(employee_name) → Similar to ARRAY_AGG(), collects values into a list.

  • COLLECT_SET(employee_name) → Removes duplicates while collecting values.

Advanced Aggregations

Grouping Sets and Rollup

GROUPING SETS() – Computes multiple groupings in a single query.

SELECT department, job_role, SUM(salary) 
FROM employees 
GROUP BY GROUPING SETS ((department), (job_role), (department, job_role));

Purpose

  • GROUPING SETS allows multiple grouping combinations in a single query.

  • It is more flexible than GROUP BY, producing subtotal and grand total rows.

  • Each tuple in GROUPING SETS defines a specific aggregation level.


How It Works

GROUPING SETS ((department), (job_role), (department, job_role))

(department) → Groups only by department (ignores job role).

(job_role) → Groups only by job role (ignores department).

(department, job_role) → Groups by both department and job role (normal GROUP BY).

This produces multiple aggregation levels.

Example Dataset: employees Table

employee_namedepartmentjob_rolesalary
AliceITDeveloper5000
BobHRManager7000
CharlieITDeveloper6000
DavidITAnalyst8000
EmmaHRAnalyst6000

Query Output

departmentjob_roleSUM(salary)
ITNULL19000
HRNULL13000
NULLDeveloper11000
NULLManager7000
NULLAnalyst14000
ITDeveloper11000
ITAnalyst8000
HRAnalyst6000

Understanding NULL Values in Output

  • When job_role is NULL, it means the row represents a department-level total.

  • When department is NULL, it means the row represents a job role-level total.

Tricky Edge Cases

  1. Using GROUPING_ID() to Identify Aggregate Levels

     SELECT department, job_role, SUM(salary), GROUPING_ID(department, job_role) 
     FROM employees 
     GROUP BY GROUPING SETS ((department), (job_role), (department, job_role));
    
    • This adds a column (GROUPING_ID) that identifies the level of aggregation.
  2. Adding a Grand Total (Full Aggregation)

     SELECT department, job_role, SUM(salary) 
     FROM employees 
     GROUP BY GROUPING SETS ((department), (job_role), (department, job_role), ());
    
    • Adding () at the end includes a grand total row.

Alternative Approaches

ApproachUse Case
GROUPING SETS()Selective groupings (flexible subtotals)
ROLLUP()Hierarchical aggregation (e.g., department → job role → total)
CUBE()All possible group combinations

CUBE() – Aggregates across all possible combinations of specified columns.

SELECT department, job_role, SUM(salary) 
FROM employees 
GROUP BY CUBE(department, job_role);

Purpose

  • The CUBE() function computes all possible combinations of groupings.

  • It generates totals (aggregates) at multiple levels, including:

    • Individual department totals

    • Individual job_role totals

    • Combinations of both

    • Grand total (sum of all salaries)


How It Works

  • CUBE(department, job_role) creates a power set of grouping columns, meaning:

    1. (department, job_role) → Normal GROUP BY on both columns.

    2. (department, NULL) → Groups only by department (ignores job_role).

    3. (NULL, job_role) → Groups only by job_role (ignores department).

    4. (NULL, NULL) → Grand total (sum of all salaries).


Example Dataset: employees Table

employee_namedepartmentjob_rolesalary
AliceITDeveloper5000
BobHRManager7000
CharlieITDeveloper6000
DavidITAnalyst8000
EmmaHRAnalyst6000

Query Output (CUBE(department, job_role))

departmentjob_roleSUM(salary)
ITDeveloper11000
ITAnalyst8000
HRManager7000
HRAnalyst6000
ITNULL19000
HRNULL13000
NULLDeveloper11000
NULLManager7000
NULLAnalyst14000
NULLNULL32000

Difference Between CUBE(), ROLLUP(), and GROUPING SETS()

FunctionGenerates Grand Total?Generates Subtotals?Best Use Case
GROUPING SETS()No (unless manually included)Custom groupingsSelective subtotals
ROLLUP(department, job_role)YesHierarchical subtotals (department → job_role → total)Hierarchical aggregation
CUBE(department, job_role)YesAll possible group combinationsAll possible aggregations

Tricky Edge Cases

  1. Using GROUPING_ID() to Identify Aggregation Level

     SELECT department, job_role, SUM(salary), GROUPING_ID(department, job_role) 
     FROM employees 
     GROUP BY CUBE(department, job_role);
    
    • This helps differentiate full totals, partial totals, and raw values.
  2. Filtering Out Grand Totals

     SELECT department, job_role, SUM(salary) 
     FROM employees 
     GROUP BY CUBE(department, job_role)
     HAVING department IS NOT NULL OR job_role IS NOT NULL;
    

    Removes the (NULL, NULL) grand total row.


ROLLUP() – Computes hierarchical aggregations.

SELECT department, job_role, SUM(salary) 
FROM employees 
GROUP BY ROLLUP(department, job_role);

Purpose

  • The ROLLUP() function is used to generate hierarchical aggregations.

  • It computes subtotals at different levels, following a top-down hierarchy.

  • Unlike CUBE(), ROLLUP() does not generate all possible combinations—it follows a hierarchy from left to right.


How It Works

  • ROLLUP(department, job_role) generates:

    1. (department, job_role) → Normal GROUP BY (Department + Job Role).

    2. (department, NULL) → Groups by Department only (Subtotal per department).

    3. (NULL, NULL) → Grand total (Sum of all salaries).

Hierarchy Breakdown

mathematicaCopyEdit┌──────────────┐
│ Grand Total  │
├──────────────┤
│ Department 1 │
│   ├─ Job A   │
│   ├─ Job B   │
├──────────────┤
│ Department 2 │
│   ├─ Job C   │
│   ├─ Job D   │
└──────────────┘
  • No (NULL, job_role) rows (Unlike CUBE(), it doesn't group by job roles alone).

Example Dataset

employee_namedepartmentjob_rolesalary
AliceITDeveloper5000
BobHRManager7000
CharlieITDeveloper6000
DavidITAnalyst8000
EmmaHRAnalyst6000

Query Output (ROLLUP(department, job_role))

departmentjob_roleSUM(salary)
ITDeveloper11000
ITAnalyst8000
HRManager7000
HRAnalyst6000
ITNULL19000
HRNULL13000
NULLNULL32000

Difference Between ROLLUP(), CUBE(), and GROUPING SETS()

FunctionGenerates Grand Total?Generates Subtotals?Generates All Combinations?Best Use Case
GROUPING SETS()❌ No (unless manually included)✅ Custom groupings❌ NoSelective subtotals
ROLLUP(department, job_role)✅ Yes✅ Hierarchical subtotals (department → job_role → total)❌ NoHierarchical aggregation
CUBE(department, job_role)✅ Yes✅ All possible group combinations✅ YesAll possible aggregations

Tricky Edge Cases

  1. Using GROUPING_ID() to Identify Aggregation Level

     sqlCopyEditSELECT department, job_role, SUM(salary), GROUPING_ID(department, job_role) 
     FROM employees 
     GROUP BY ROLLUP(department, job_role);
    
    • Helps differentiate subtotals and grand total rows.
  2. Filtering Out Grand Total

     sqlCopyEditSELECT department, job_role, SUM(salary) 
     FROM employees 
     GROUP BY ROLLUP(department, job_role)
     HAVING department IS NOT NULL;
    
    • Removes the (NULL, NULL) grand total row.

Approximate Aggregations (For Big Data)

APPROX_COUNT_DISTINCT(column) – Faster unique count estimation.

  1. SELECT APPROX_COUNT_DISTINCT(employee_id) FROM employees;
    
  2. APPROX_PERCENTILE(column, percentile) – Faster percentile computation.

    SELECT APPROX_PERCENTILE(salary, 0.9) FROM employees;
    

    Custom Aggregations Using User-Defined Functions (UDAF)

    User-Defined Aggregation Function (UDAF)

    • You can define a custom aggregation using PySpark.
    from pyspark.sql.functions import udf
    from pyspark.sql.types import DoubleType
    from pyspark.sql import SparkSession

    spark = SparkSession.builder.appName("UDAF Example").getOrCreate()

    def custom_avg(salary_list):
        return sum(salary_list) / len(salary_list)

    avg_udf = udf(custom_avg, DoubleType())

    df.withColumn("custom_avg", avg_udf("salary")).show()

Time-Series Aggregations

DATE_TRUNC(unit, column) – Aggregation based on time intervals.

SELECT DATE_TRUNC('month', order_date), SUM(sales) FROM sales GROUP BY 1;

WINDOW() – Used for time-based aggregations in structured streaming.

SELECT window(time_column, '1 hour'), COUNT(*) FROM events GROUP BY 1;

Quantile-Based Aggregations

PERCENT_RANK()

Computes the relative rank of a row.

SELECT name, salary, PERCENT_RANK() OVER (ORDER BY salary DESC) FROM employees;

Explanation of PERCENT_RANK() OVER (ORDER BY salary DESC) in Spark SQL

The query:

SELECT name, salary, PERCENT_RANK() OVER (ORDER BY salary DESC) 
FROM employees;

Purpose

  • PERCENT_RANK() computes the relative rank of each row as a percentage between 0 and 1.

  • It compares a row’s position within the dataset without assigning equal ranks to duplicates.

  • Ordering:

    • ORDER BY salary DESC → Ranks employees based on salary in descending order (highest salary gets rank 0).

    • If multiple employees have the same salary, they receive the same rank, but PERCENT_RANK() continues counting.

  • The PERCENT_RANK() function calculates the relative rank of a row as a percentage between 0 and 1 using the formula:

    \text{PERCENT_RANK} = \frac{\text{Rank} - 1}{\text{Total Rows} - 1}

    Where:

    • Rank = The rank of the current row (using RANK() function).

    • Total Rows = The total number of rows in the result set.


How It Works

  1. Ranks employees by salary in descending order.

  2. Computes the relative rank as a percentage from 0 to 1.

  3. Formula ensures the highest-ranked employee gets 0, and the lowest-ranked gets 1.


Example Dataset

EmployeeSalary
Alice9000
Bob8000
Charlie7000
David7000
Emma6000

Query Output (PERCENT_RANK() OVER (ORDER BY salary DESC))

EmployeeSalaryPERCENT_RANK
Alice90000.00
Bob80000.25
Charlie70000.50
David70000.50
Emma60001.00

Understanding the Values

  • Alice (9000) → Rank = 1PERCENT_RANK = (1-1)/(5-1) = 0.00

  • Bob (8000) → Rank = 2PERCENT_RANK = (2-1)/(5-1) = 0.25

  • Charlie & David (7000) → Tied Rank = 3PERCENT_RANK = (3-1)/(5-1) = 0.50

  • Emma (6000) → Rank = 5PERCENT_RANK = (5-1)/(5-1) = 1.00


Difference Between PERCENT_RANK(), RANK(), and DENSE_RANK()

FunctionHandles Duplicates?Skips Ranks?Output Range
PERCENT_RANK()✅ Yes✅ Yes0 to 1 (Percentage)
RANK()✅ Yes✅ Yes1, 2, 2, 4, ... (Skips ranks)
DENSE_RANK()✅ Yes❌ No1, 2, 2, 3, ... (No gaps)

Alternative Approaches

  1. Ranking Employees with RANK()

     SELECT name, salary, RANK() OVER (ORDER BY salary DESC) AS rank
     FROM employees;
    
    • Outputs actual ranking instead of a percentage.
  2. Using PERCENT_RANK() with Partitioning

     SELECT department, name, salary, 
            PERCENT_RANK() OVER (PARTITION BY department ORDER BY salary DESC) 
     FROM employees;
    
    • Ranks employees within each department separately.

CUME_DIST()

Computes cumulative distribution of values.

SELECT name, salary, CUME_DIST() OVER (ORDER BY salary DESC) FROM employees;

Data & Calculation

namesalary
Alice90,000
Bob80,000
Charlie80,000
David70,000
Emma60,000

Step-by-Step Calculation of CUME_DIST()

namesalaryRankRows ≤ current rowTotal RowsCUME_DIST()
Alice90,0001151/5 = 0.20
Bob80,0002353/5 = 0.60
Charlie80,0002353/5 = 0.60
David70,0004454/5 = 0.80
Emma60,0005555/5 = 1.00

Interpretation:

  • Alice’s salary is the highest (1 out of 5 people, 0.20 cumulative distribution).

  • Bob and Charlie share the same salary, so they have the same CUME_DIST() value (3/5 = 0.60).

  • The lowest salary has a CUME_DIST() of 1.0 since 100% of salaries are ≤ that value.


MEDIAN() (Custom Implementation)

Spark doesn’t have a built-in median, but it can be calculated as:

SELECT percentile_approx(salary, 0.5) FROM employees;

HISTOGRAM() – Creating Binned Distributions

Computes a histogram (frequency distribution) of numerical data.

  • Example:

      SELECT histogram_numeric(salary, 10) FROM employees;
    
    • This bins salary into 10 equal-sized ranges and returns frequency counts.

Purpose

  • HISTOGRAM_NUMERIC(column, num_bins) approximates the histogram distribution of a numeric column.

  • The second parameter (10 in this case) defines the number of bins (ranges).

  • It returns an array of (bucket, frequency) pairs, showing how salaries are distributed across the bins.


How It Works

  1. Divides the salary values into num_bins equally spaced intervals.

  2. Counts the number of values falling into each bin.

  3. Returns an array of tuples (bin_value, frequency), where:

    • bin_value → Midpoint or representative value of the bin.

    • frequency → Count of salaries falling into that bin.


Example Dataset

EmployeeSalary
A3000
B5000
C6000
D7000
E10000

Query Output (HISTOGRAM_NUMERIC(salary, 3))

    SELECT HISTOGRAM_NUMERIC(salary, 3) FROM employees;
Bin (Midpoint)Frequency (Count)
3000.01
6000.03
10000.01

Explanation:

  • The 3 bins divide the range (3000 - 10000) into 3 groups.

  • The first bin (3000.0) → Contains 1 salary (3000).

  • The second bin (6000.0) → Contains 5000, 6000, 7000 (3 salaries).

  • The third bin (10000.0) → Contains 1 salary (10000).


Using More Bins (HISTOGRAM_NUMERIC(salary, 5))

    SELECT HISTOGRAM_NUMERIC(salary, 5) FROM employees;
Bin (Midpoint)Frequency (Count)
3000.01
5000.01
6000.01
7000.01
10000.01
  • Now each salary falls into its own bin since num_bins = 5 (matching the number of salaries).

Key Observations

HISTOGRAM_NUMERIC() helps visualize data distribution in a structured way.
More bins increase granularity, fewer bins group data together.
Useful for histogram visualizations, anomaly detection, and trends.


Alternative Approaches

  1. Using PERCENTILE_APPROX() for Quartile-Based Distribution

     SELECT PERCENTILE_APPROX(salary, ARRAY(0.25, 0.5, 0.75)) FROM employees;
    
    • Returns quartile distributions instead of a histogram.
  2. Grouping Data Manually

     SELECT 
         CASE 
             WHEN salary < 5000 THEN 'Low'
             WHEN salary BETWEEN 5000 AND 8000 THEN 'Medium'
             ELSE 'High'
         END AS salary_range,
         COUNT(*)
     FROM employees
     GROUP BY salary_range;
    
    • Creates custom salary groups instead of bins.

BIT_OR() / BIT_AND() – Bitwise Aggregations

  • Performs bitwise OR / AND operations across grouped values.

  • Example:

      SELECT department, BIT_OR(permissions) FROM users GROUP BY department;
    

STRING_AGG() – Concatenating Strings in Aggregation

  • Concatenates multiple string values into a single string.

  • Example:

      SELECT department, STRING_AGG(employee_name, ', ') FROM employees GROUP BY department;
    
    • Outputs comma-separated lists of employee names per department.

HYPERLOGLOG_PLUS()

Advanced Cardinality Estimation

  • Similar to APPROX_COUNT_DISTINCT(), but optimized for large-scale unique counting.

  • Example:

      SELECT HYPERLOGLOG_PLUS(user_id) FROM website_visits;
    

ENTROPY() – Information Theory Aggregation

  • Measures data randomness or distribution balance.

  • Example:

      SELECT ENTROPY(sales_category) FROM transactions;
    

MODE() (Built-in in Some Versions)

  • Finds the most frequently occurring value.

  • Example (alternative implementation for older versions):

      SELECT value FROM dataset GROUP BY value ORDER BY COUNT(*) DESC LIMIT 1;
    

GEOMETRIC_MEAN() – Multiplicative Mean

  • Computes the geometric mean (useful in financial applications).

  • Example:

      SELECT EXP(AVG(LOG(salary))) FROM employees;
    

HARMONIC_MEAN() – Reciprocal Mean Calculation

  • Computes the harmonic mean for rates or ratios.

  • Example:

      SELECT COUNT(*) / SUM(1.0 / salary) FROM employees;
    

MEDIAN() (Alternative Approach)

  • Spark SQL does not have a built-in MEDIAN() function, but it can be approximated using:

      sqlCopyEditSELECT percentile_approx(salary, 0.5) FROM employees;
    
    • Accurate for large datasets.

PRODUCT() – Multiplication Aggregation

  • Returns the product of column values (alternative implementation in Spark SQL).

  • Example:

      SELECT EXP(SUM(LOG(value))) FROM numbers;
    

SKEWNESS() – Skewness of a Distribution

  • Measures the asymmetry of the distribution of values.

  • Example:

      SELECT SKEWNESS(salary) FROM employees;
    

KURTOSIS() – Peakedness of a Distribution

  • Measures whether the dataset has light/heavy tails.

  • Example:

      SELECT KURTOSIS(salary) FROM employees;
    

RATIO_TO_REPORT() – Proportional Contribution

  • Calculates the percentage of each value over the total sum.

  • Example:

      SELECT department, salary, salary / SUM(salary) OVER (PARTITION BY department) AS ratio FROM employees;
    

LEAD() / LAG() – Comparing Aggregates Over Rows

  • Accesses previous (LAG) or next (LEAD) row values.

  • Example:

      SELECT name, salary, LAG(salary, 1) OVER (ORDER BY salary DESC) FROM employees;
    

PIVOT() – Transforming Rows into Columns

  • Used for pivoting categorical data.

  • Example:

      SELECT * FROM ( 
        SELECT department, job_role, salary FROM employees 
      ) PIVOT (
        SUM(salary) FOR job_role IN ('Manager', 'Developer', 'Analyst')
      );
    

UNPIVOT() – Transforming Columns into Rows

  • Reverse of PIVOT().

  • Example:

      SELECT department, job_role, salary 
      FROM employees 
      UNPIVOT (salary FOR job_role IN (Manager, Developer, Analyst));
    

ANY_VALUE() – Random Value in Group

  • Returns an arbitrary value from the group without enforcing order.

  • Example:

      SELECT department, ANY_VALUE(employee_name) FROM employees GROUP BY department;
    

GROUPING() – Detecting Aggregation Levels

  • Identifies whether a column is aggregated in a GROUPING SETS(), ROLLUP(), or CUBE().

  • Example:

      SELECT department, job_role, SUM(salary), GROUPING(department), GROUPING(job_role) 
      FROM employees 
      GROUP BY ROLLUP(department, job_role);
    

TOP N Records using ROW_NUMBER()

  • Fetch Top N records per group.

  • Example:

      SELECT * FROM (
        SELECT name, department, salary, ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rank
        FROM employees
      ) WHERE rank <= 3;
    

    MODE() (Efficient Alternative Using Window Functions)

    • Instead of the earlier workaround for MODE(), we can use DENSE_RANK().

    • Example:

        SELECT value 
        FROM (
          SELECT value, COUNT(*) AS frequency, DENSE_RANK() OVER (ORDER BY COUNT(*) DESC) AS rnk 
          FROM dataset 
          GROUP BY value
        ) WHERE rnk = 1;
      
      • This method works efficiently for datasets where multiple values may have the highest frequency.

GROUP_CONCAT() – Concatenation of Grouped Values

  • Alternative to STRING_AGG(), but not natively available in Spark SQL (can be simulated).

  • Example using COLLECT_LIST():

      SELECT department, CONCAT_WS(', ', COLLECT_LIST(employee_name)) FROM employees GROUP BY department;
    

BIT_XOR() – Bitwise XOR Aggregation

  • Computes the bitwise XOR across grouped values.

  • Example:

      SELECT department, BIT_XOR(permissions) FROM users GROUP BY department;
    

    Understanding BIT_XOR(permissions) in SQL

    The BIT_XOR() function performs a bitwise XOR (Exclusive OR) operation on all values in a group. It is commonly used in cryptographic applications, permission calculations, checksum operations, and error detection mechanisms.


    1. How BIT_XOR() Works

      SELECT department, BIT_XOR(permissions) 
      FROM users 
      GROUP BY department;
    
    • Groups data by department.

    • Performs a bitwise XOR operation across all permissions values.

    • Returns a single XOR'd value per department.


2. What is Bitwise XOR ()?

  • XOR (Exclusive OR) follows these rules:

    • 1 ⊕ 1 = 0

    • 0 ⊕ 0 = 0

    • 1 ⊕ 0 = 1

    • 0 ⊕ 1 = 1

Example Table:

departmentpermissions (Binary)permissions (Decimal)
HR110113
HR101111
IT111115
IT00102

Common Use Case:

  • Read Permission = 1 (0001 in binary)

  • Write Permission = 2 (0010 in binary)

  • Execute Permission = 4 (0100 in binary)

  • Admin Permission = 8 (1000 in binary)

3. Step-by-Step Calculation

For HR Department:

           1101 (13)
        ⊕ 1011 (11)
        ------------
          0110 (6)  --> BIT_XOR(permissions) = 6

For IT Department:

           1111 (15)
        ⊕ 0010 (2)
        ------------
          1101 (13) --> BIT_XOR(permissions) = 13

Final Result:

departmentBIT_XOR(permissions)
HR6
IT13

Interpretation:

  • HR department's XOR'd permissions result in 6.

  • IT department's XOR'd permissions result in 13.


4. BIT_XOR() vs. Other Bitwise Aggregations

FunctionDescription
BIT_XOR()Bitwise XOR of all values
BIT_AND()Bitwise AND of all values
BIT_OR()Bitwise OR of all values

5. BIT_XOR() with NULL Values

If permissions contains NULL, it is ignored.
If all values in a group are NULL, BIT_XOR() returns NULL.

Example:

        SELECT department, BIT_XOR(permissions) FROM users GROUP BY department;
departmentpermissions
HR13
HRNULL
IT15

HR department ignores NULL and computes BIT_XOR(13) = 13.

If all values were NULL, the result would be NULL.


6. BIT_XOR() with HAVING – Filtering Groups

Find departments where XOR'd permissions exceed 5.

        SELECT department, BIT_XOR(permissions) AS xor_permissions
        FROM users
        GROUP BY department
        HAVING BIT_XOR(permissions) > 5;
  • Keeps only departments where the computed XOR result is greater than 5.

7. BIT_XOR() in Joins – Ensuring Correct Aggregation

Joins can introduce duplicate values, affecting BIT_XOR().

Incorrect Query (Duplicates Affect XOR)

        SELECT d.department_name, BIT_XOR(u.permissions)
        FROM departments d
        JOIN users u ON d.department_id = u.department_id
        GROUP BY d.department_name;
  • If a user appears multiple times due to the join, their permission values are XOR'd multiple times.

  • Fix: Use DISTINCT to avoid duplicates:

      SELECT d.department_name, BIT_XOR(DISTINCT u.permissions)
      FROM departments d
      JOIN users u ON d.department_id = u.department_id
      GROUP BY d.department_name;
    

8. BIT_XOR() with CASE – Conditional XOR

Find XOR'd permissions only for Managers.

        SELECT department, 
               BIT_XOR(CASE WHEN role = 'Manager' THEN permissions ELSE 0 END) AS manager_xor
        FROM users
        GROUP BY department;
  • Computes XOR only for employees with the role "Manager".

9. BIT_XOR() with UNION vs. UNION ALL

UNION removes duplicates, which may affect BIT_XOR() results.
UNION ALL keeps duplicates, potentially changing XOR calculations.

Example:

        SELECT department, BIT_XOR(permissions) FROM (
            SELECT department, permissions FROM users
            UNION
            SELECT department, permissions FROM admins
        ) AS combined_data
        GROUP BY department;
  • Merges users and admins before computing XOR.

  • If using UNION ALL, duplicate values remain and may alter results.


10. BIT_XOR() with Binary Data

XOR operations are commonly used in cryptography, security, and access control.A real-world use case could be storing permission bitmasks (e.g., Read = 1, Write = 2, Execute = 4).

Example:

        SELECT department, BIT_XOR(permissions) AS xor_permissions
        FROM users
        GROUP BY department;
departmentXOR Permissions (Binary)XOR Permissions (Decimal)
HR01106
IT110113

This allows us to quickly calculate access control bitmasks for departments.


11. Summary of Tricky SQL Concepts with BIT_XOR()

ScenarioExplanation
BIT_XOR() computes bitwise XORXORs all values in a group
BIT_XOR() ignores NULLNULL values are skipped, but all NULLs return NULL
BIT_XOR() vs. BIT_AND() vs. BIT_OR()Different bitwise aggregations
BIT_XOR() with HAVINGFilters groups based on XOR results
BIT_XOR() in JoinsDuplicate values may alter XOR results
BIT_XOR() with CASEConditional XOR within groups
BIT_XOR() in UNION vs. UNION ALLUNION removes duplicates, affecting XOR

LOG_SUM_EXP()

Logarithmic Summation for Stability

  • Useful in machine learning applications (avoiding underflow).

  • Example:

      SELECT LOG(SUM(EXP(value))) FROM dataset;
    

COUNT_IF()

Conditional Counting

  • A cleaner way to count values meeting a condition.

  • Example (alternative using SUM()):

      SELECT SUM(CASE WHEN salary > 50000 THEN 1 ELSE 0 END) FROM employees;
    

APPROX_MEDIAN() – Faster Median Computation for Big Data

  • Instead of PERCENTILE_APPROX(salary, 0.5), an alternative efficient approach.

  • Example:

      SELECT percentile_approx(salary, 0.5, 10000) FROM employees;
    

RATIO()

Ratio Calculation Within Groups

  • Useful in reporting the percentage of a group within a total.

  • Example:

      SELECT department, COUNT(*) * 100.0 / SUM(COUNT(*)) OVER () AS percentage 
      FROM employees 
      GROUP BY department;
    

REGR_SLOPE(), REGR_INTERCEPT()

Regression Aggregations

  • Used for linear regression modeling directly in SQL.

  • Example:

      SELECT REGR_SLOPE(salary, experience), REGR_INTERCEPT(salary, experience) FROM employees;
    
    • Finds the best-fit line between salary and experience.

REGR_R2()

Coefficient of Determination (R² Score)

  • Measures how well data fits a regression model.

  • Example:

      SELECT REGR_R2(salary, experience) FROM employees;
    

UNIQUE_COUNT_ESTIMATE()

Faster Approximate Unique Counting

  • A variant of APPROX_COUNT_DISTINCT().

  • Example:

      SELECT UNIQUE_COUNT_ESTIMATE(user_id) FROM website_visits;
    

MAD()

Median Absolute Deviation for Outlier Detection

  • Measures the spread of data around the median.

  • Example:

      SELECT PERCENTILE_APPROX(ABS(salary - median_salary), 0.5) 
      FROM (SELECT salary, PERCENTILE_APPROX(salary, 0.5) AS median_salary FROM employees);
    

GREATEST() and LEAST()

Row-wise Maximum/Minimum

  • Finds the highest or lowest value among multiple columns.

  • Example:

      SELECT GREATEST(salary_2022, salary_2023, salary_2024) FROM employees;
    
      SELECT LEAST(salary_2022, salary_2023, salary_2024) FROM employees;
    

COUNT_MIN_SKETCH()

Frequency Approximation for Streaming Data

  • Useful for real-time aggregation.

  • Example:

      SELECT COUNT_MIN_SKETCH(salary, 10000, 0.01) FROM employees;
    
    • Helps with approximate top-K element tracking.

QUANTILE()

Exact Percentiles for Small Datasets

  • Unlike PERCENTILE_APPROX(), this is for exact quantiles on smaller datasets.

  • Example:

      SELECT QUANTILE(salary, 0.75) FROM employees;
    

GROUPING_ID()

Identifying Aggregation Levels in GROUPING SETS()

  • Helps distinguish which columns are aggregated vs. not aggregated.

  • Example:

      SELECT department, job_role, SUM(salary), GROUPING_ID(department, job_role) 
      FROM employees 
      GROUP BY GROUPING SETS ((department), (job_role), (department, job_role));
    

INVERSE_PERCENTILE()

Reverse Lookup of Percentiles

  • Given a percentile, finds the corresponding value.

  • Example:

      SELECT INVERSE_PERCENTILE(salary, 0.9) FROM employees;
    

EXPONENTIAL_MOVING_AVG()

Exponential Smoothing for Time Series

  • Used for time-series forecasting.

  • Example:

      SELECT EXPONENTIAL_MOVING_AVG(sales, 0.3) OVER (ORDER BY order_date) FROM sales_data;
    

DECILE()

Dividing Data into 10 Equal Parts

  • Similar to quartiles and percentiles.

  • Example:

      SELECT NTILE(10) OVER (ORDER BY salary) FROM employees;
    

MOVING_AVERAGE()

Rolling Window Aggregation

  • Calculates a rolling average for time-series data.

  • Example:

      SELECT order_date, AVG(sales) OVER (ORDER BY order_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS moving_avg 
      FROM sales;
    

STDDEV_SAMP() and STDDEV_POP()

Sample & Population Standard Deviation

  • We previously covered STDDEV() but Spark SQL provides two variants:

    • STDDEV_SAMP() – Sample standard deviation.

    • STDDEV_POP() – Population standard deviation.

  • Example:

      SELECT STDDEV_SAMP(salary), STDDEV_POP(salary) FROM employees;
    

VAR_SAMP() and VAR_POP()

Sample & Population Variance

  • Similar to standard deviation, but returns variance instead.

  • Example:

      SELECT VAR_SAMP(salary), VAR_POP(salary) FROM employees;
    

BLOOM_FILTER_AGG()

Probabilistic Set Membership

  • Efficiently tests whether an element is in a dataset (without full scan).

  • Example:

      SELECT BLOOM_FILTER_AGG(user_id, 100000, 0.01) FROM transactions;
    

TOP_K()

Approximate Most Frequent Elements

  • Helps in identifying top K elements efficiently.

  • Example:

      SELECT TOP_K(product_id, 5) FROM sales_data;
    

EVENT_RATE_ESTIMATE()

Approximate Count of Rare Events

  • Used in fraud detection & anomaly detection.

  • Example:

      SELECT EVENT_RATE_ESTIMATE(transaction_id) FROM fraud_logs;
    

HASH_AGG()

Hash-based Aggregation for Large Datasets

  • Performs a hash-based approach for aggregating data faster.

  • Example:

      SELECT HASH_AGG(salary) FROM employees;
    

STRING_AGG() with ORDER BY (Concatenation in Sorted Order)

  • A sorted version of STRING_AGG().

  • Example:

      SELECT department, STRING_AGG(employee_name ORDER BY employee_name ASC, ', ') 
      FROM employees 
      GROUP BY department;
    

CARDINALITY_ESTIMATE() – Fast Approximate Cardinality

  • Similar to APPROX_COUNT_DISTINCT(), but optimized for streaming data.

  • Example:

      SELECT CARDINALITY_ESTIMATE(user_id) FROM website_logs;
    

BIT_COUNT() – Counting Number of 1s in a Bit Representation

  • Useful in low-level analytics, such as binary data processing.

  • Example:

      SELECT BIT_COUNT(binary_column) FROM bitwise_table;
    

UNIQUE_PERCENTILE()

Approximate Unique Distribution

  • Helps in estimating unique distribution across percentile ranges.

  • Example:

      SELECT UNIQUE_PERCENTILE(salary, 0.8) FROM employees;
    

FREQUENT_ITEMS()

Finding Frequent Elements in Large Datasets

  • Similar to TOP_K(), but optimized for streaming & big data.

  • Example:

      SELECT FREQUENT_ITEMS(transaction_category, 10) FROM sales_data;
    

ROLLING_PERCENTILE()

Moving Window Percentile

  • Time-series percentile calculations.

  • Example:

      SELECT order_date, ROLLING_PERCENTILE(sales, 0.9) OVER (ORDER BY order_date ROWS BETWEEN 30 PRECEDING AND CURRENT ROW) 
      FROM sales;
    

BIT_SET_AGG()

Bitwise Aggregation Across Rows

  • Aggregates binary bitmask values.

  • Example:

      SELECT BIT_SET_AGG(permissions) FROM user_roles;
    

MOST_RECENT()

Last Non-Null Value

  • Similar to LAST(), but ensures no NULL values.

  • Example:

      SELECT department, MOST_RECENT(salary) FROM employees GROUP BY department;
    

ARLIEST()

First Non-Null Value

  • Similar to FIRST(), but excludes NULL values.

  • Example:

      SELECT department, EARLIEST(salary) FROM employees GROUP BY department;
    

DEVIATION()

Absolute Deviation from the Mean

  • Computes absolute deviation from mean (useful for outlier detection).

  • Example:

      SELECT ABS(salary - AVG(salary) OVER ()) FROM employees;
    

PERMUTATION_AGG() – Aggregate Across Permutations

  • Used in graph algorithms & combinatorial aggregations.

  • Example:

      SELECT PERMUTATION_AGG(node_id) FROM graph_data;
    

JACCARD_SIMILARITY() – Set-Based Similarity Calculation

  • Computes similarity between sets using Jaccard Index.

  • Example:

      SELECT JACCARD_SIMILARITY(set1, set2) FROM dataset;
    

COSINE_SIMILARITY() – Similarity Between Feature Vectors

  • Machine learning similarity function.

  • Example:

      SELECT COSINE_SIMILARITY(feature_vector_1, feature_vector_2) FROM embeddings;
    

TOP_N_PER_GROUP()

Top N Elements Within Each Group

  • Similar to TOP_K(), but per group.

  • Example:

      SELECT * FROM (
        SELECT name, department, salary, ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rank
        FROM employees
      ) WHERE rank <= 3;
    

L2_NORM()

Euclidean Distance for Feature Engineering

  • Computes L2 norm (square root of squared sum).

  • Example:

      SELECT SQRT(SUM(POW(feature_value, 2))) FROM feature_dataset;
    

HAVERSINE_DISTANCE() – Distance Between Two Coordinates

  • Used in geospatial calculations.

  • Example:

      SELECT HAVERSINE_DISTANCE(lat1, lon1, lat2, lon2) FROM location_data;
    

Summary Table

CategoryFunction
Basic AggregationsCOUNT(), SUM(), AVG(), MIN(), MAX()
Intermediate AggregationsVARIANCE(), STDDEV(), RANK(), DENSE_RANK(), ROW_NUMBER(), NTILE()
Advanced AggregationsCUBE(), ROLLUP(), GROUPING SETS(), APPROX_COUNT_DISTINCT(), APPROX_PERCENTILE()
Statistical FunctionsCORR(), COVAR_POP(), COVAR_SAMP()
Array-Based AggregationsARRAY_AGG(), COLLECT_SET(), COLLECT_LIST()
Time-Series AggregationsDATE_TRUNC(), WINDOW()
Quantile-Based AggregationsPERCENT_RANK(), CUME_DIST(), PERCENTILE_APPROX()
0
Subscribe to my newsletter

Read articles from Tech Academy directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Tech Academy
Tech Academy

Data & AI Transformation Lead • 15+ years of experience in AI-driven Quality Engineering, Data Analytics, and Test Automation. • Expertise in AI/ML, Data Testing, Cloud Automation, ETL Pipelines, and Observability tools. • Hands-on leader in AI-driven automation, predictive analytics, and cloud data validation. • Strong technical proficiency in Snowflake, DBT, Power BI, Python, AWS, SQL, and MLOps pipelines. • Proven success in reducing defects, optimizing performance, and streamlining data workflows. • Expertise in integrating security into CI/CD pipelines, automated security scanning, vulnerability assessment, and compliance-driven quality engineering. • Expertise in JMeter, BlazeMeter, Snowflake SQL Profiler, optimizing application & data performance testing for high-scale environments. • Strong expertise in SAFe, Scrum, and Kanban methodologies, ensuring seamless collaboration between development, testing, and operations teams.