Introduction

If you've ever worked with hierarchical or tree-structured data, you’ve probably hit the limitations of standard SQL. That’s where recursive SQL queries come in!

In the first article of this three-part series, we explored recursion—what it is, why it’s crucial for processing hierarchical data, and how it can be implemented in SQL. However, knowing the basics of SQL recursion may not be enough when dealing with complex or "dirty" data. Queries may take too long to execute or run indefinitely. In this article, we’ll tackle advanced recursion techniques to optimize and improve the performance of recursive SQL queries.

By the end of this article, you’ll learn:

How to limit recursion depth to prevent infinite loops
How to detect and handle cycles in hierarchical data
How to optimize recursive queries
Alternative methods to recursive CTEs for hierarchical data processing

PREREQUISITES:

Basic SQL knowledge
Basic SQL recursion knowledge, check out Part 1 of the series!

DISCLAIMERS:

All examples in this article were run in a Snowflake environment (but concepts apply to PostgreSQL, SQL Server, etc.). To follow along, you can sign up for a free Snowflake trial account here.
All code and data used in this article are available in the accompanying GitHub repository.

Understanding our dataset

We will work with a dataset containing vehicle parts manufacturing data. The dataset includes the following fields:

Column Name	Description
`Part_ID`	A unique identifier for each part in the dataset
`Part_Name`	A human-readable name describing the part (e.g., "Engine Block #1")
`Part_Category`	The category of the part (e.g., Engine System, Transmission, etc.)
`Part_Cost`	The manufacturing cost of the part in USD
`Part_Manufacturing_Duration`	The time required to manufacture the part (in hours)
`Depends_On`	The `Part_ID` of another part that this part depends on (i.e., its parent in the hierarchy). `NULL` means it's a root component

Here is the available data:

Part ID	Part Name	Part Category	Depends On	Part Cost	Part Manufacturing Duration
1	Brake Pad #1	Braking System	3	45.00	2
2	Brake Caliper #2	Braking System	1	80.00	4
3	Brake Rotor #3	Braking System	2	120.00	6
4	Engine Block #4	Engine System	null	500.00	10
5	Piston #5	Engine System	4	80.00	5
6	Spark Plug #6	Engine System	5	10.00	1
7	Oil Filter #7	Engine System	6	15.00	2
8	Spark Plug #8	Engine System	5	12.00	1.5

flowchart TD
    4("Engine Block #4") --> 5("Piston #5")
    5 --> 6("Spark Plug #6") & 8("Spark Plug #8")
    6 --> 7("Oil Filter #7")
    3("Brake Rotor #3") --> 1("Brake Pad #1")
    1 --> 2("Brake Caliper #2")
    2 --> 3

Diagram: Visual representation of the dataset

We will use this dataset to build a Bill of Materials (BoM) for each part.

A Bill of Materials is a list of all components, raw materials or assemblies required to manufacture a product.

If you'd like to test these concepts further, check out the accompanying repository for additional datasets.

Based on what we’ve learned so far, the query to build the BoM should look something like this:

WITH 
    RECURSIVE PARTS_BOM AS (
        -- Anchor Member: Start with all parts (even those that have dependencies)
        SELECT
            VP.PART_ID,
            VP.PART_NAME,
            VP.PART_CATEGORY,
            VP.PART_ID::VARCHAR AS BOM
        FROM 
            RECURSIVE_CTE_DATA.VEHICLE_PARTS_W_CYCLES_ARTICLE VP

        UNION ALL

        -- Recursive Member: Traverse the parts that depend on others
        SELECT
            VP.PART_ID,
            VP.PART_NAME,
            VP.PART_CATEGORY,
            PB.BOM || '->' || VP.PART_ID::VARCHAR AS BOM
        FROM 
            RECURSIVE_CTE_DATA.VEHICLE_PARTS_W_CYCLES_ARTICLE VP
            INNER JOIN PARTS_BOM PB
                ON VP.DEPENDS_ON = PB.PART_ID
)
-- Final Select: Retrieve the BOM for all parts
SELECT
    PART_ID,
    PART_NAME,
    PART_CATEGORY,
    BOM
FROM PARTS_BOM
ORDER BY PART_ID;

However, if you try to execute this query, you’ll notice that it never ends.

Infinite loops in recursive queries

Infinite loops occur if the recursive query keeps generating new rows indefinitely. It typically happens due to one of the following reasons:

Lack of a termination condition: If the recursive query does not include a stopping condition or the hierarchy does not come to a natural end, the execution will go on indefinitely.
Presence of cycles in hierarchical data: Cycles occur when a node directly or indirectly refers back to itself, creating a loop instead of a proper tree or directed acyclic graph (DAG). This often happens in parent-child relationships where a descendant mistakenly becomes an ancestor, causing infinite recursion in queries.

Preventing inifinite loops in recursive queries

Several methods exist to avoid infinite loops. These can be broadly categorized into:

Limiting Recursion Depth
Detecting and handling cycles in hierarchical data

Recursion Depth

Recursion depth refers to the number of times a recursive query calls itself before stopping. Limiting it ensures that recursion does not continue indefinitely. If recursion depth is too high without proper constraints, it can lead to infinite loops or excessive memory usage.

How to limit recursion depth:

Add a numeric field, RECURSION_DEPTH, to the Anchor Member and initialize it to 0 or 1.
In the Recursive Member, increment RECURSION_DEPTH by 1. Each time the recursive part of the query executes, it adds 1 to the parent's RECURSION_DEPTH, effectively tracking the recursion depth.
Use RECURSION_DEPTH in the WHERE clause of the Recursive Member to define a stopping condition.

Some database systems impose a default recursion depth limit to prevent infinite loops.

Applying this logic to our use case, we can define a new query:

WITH 
    RECURSIVE PARTS_BOM AS (
        -- Anchor Member: Start with all parts (even those that have dependencies)
        SELECT
            VP.PART_ID,
            VP.PART_NAME,
            VP.PART_CATEGORY,
            VP.PART_ID::VARCHAR AS BOM,
            1 AS RECURSION_DEPTH
        FROM 
            RECURSIVE_CTE_DATA.VEHICLE_PARTS_W_CYCLES_ARTICLE VP

        UNION ALL

        -- Recursive Member: Traverse the parts that depend on others
        SELECT
            VP.PART_ID,
            VP.PART_NAME,
            VP.PART_CATEGORY,
            PB.BOM || '->' || VP.PART_ID::VARCHAR AS BOM,
            PB.RECURSION_DEPTH + 1 AS RECURSION_DEPTH
        FROM 
            RECURSIVE_CTE_DATA.VEHICLE_PARTS_W_CYCLES_ARTICLE VP
            INNER JOIN PARTS_BOM PB
                ON VP.DEPENDS_ON = PB.PART_ID
        WHERE
            -- Loop at most 10 times
            PB.RECURSION_DEPTH < 10  
)
-- Final Select: Retrieve the BOM for all parts
SELECT
    PART_ID,
    PART_NAME,
    PART_CATEGORY,
    BOM,
    RECURSION_DEPTH
FROM PARTS_BOM
ORDER BY PART_ID;

Executing this query now terminates successfully and produces 43 rows in the output.

Sample output:

PART_ID	PART_NAME	PART_CATEGORY	BOM	RECURSION_DEPTH
1	Brake Pad #1	Braking System	1	1
1	Brake Pad #1	Braking System	3->1	2
1	Brake Pad #1	Braking System	2->3->1	3
1	Brake Pad #1	Braking System	1->2->3->1	4
…	…	…	…	…
1	Brake Pad #1	Braking System	1->2->3->1->2->3->1->2->3->1	10
4	Engine Block #4	Engine System	4	1
5	Piston #5	Engine System	5	1
5	Piston #5	Engine System	4->5	2
6	Spark Plug #6	Engine System	6	1
6	Spark Plug #6	Engine System	5->6	2
6	Spark Plug #6	Engine System	4->5->6	3
7	Oil Filter #7	Engine System	7	1
7	Oil Filter #7	Engine System	6->7	2
7	Oil Filter #7	Engine System	5->6->7	3
7	Oil Filter #7	Engine System	4->5->6->7	4

We notice that the query still does not output the expected result. We have several rows for every part depending on the various recusion levels and paths (eg. part 7), we also have repeating patterns in our BoM due to cycles in our data (eg. part 1). Let's improve our solution.

Detecting and handling cycles in hierarchical data

Detecting cycles in hierarchical data ensures that recursion does not create infinite loops. Once detected, cycles can be handled to break the loop and prevent revisiting nodes in the same path.

This can be achieved by tracking all the visited nodes in a certain recursive path:

Add a field, VISITED_NODES, to the Anchor Member and initialize it with the ID of the node itself. This field could be a string, an array or any datatype that can store a series of values. In our example, we will use an array datatype.
In the Recursive Member, add the current node's ID to the VISITED_NODES field of the previous iteration, enabling us to efficiently track all nodes that were already visited in a path.
The VISITED_NODES can then be used in the WHERE clause of the Recursive Member to stop the recursion if a node was already visited.

Often, recursion depth limitation and cycle detection are implemented together to ensure the query finishes executing.

Applying this logic to our use case, we can define a new query:

WITH 
    RECURSIVE PARTS_BOM AS (
        -- Anchor Member: Start with all parts (even those that have dependencies)
        SELECT
            VP.PART_ID,
            VP.PART_NAME,
            VP.PART_CATEGORY,
            VP.PART_ID::VARCHAR AS BOM,
            1 AS RECURSION_DEPTH,
            ARRAY_CONSTRUCT(VP.PART_ID) AS VISITED_NODES
        FROM 
            RECURSIVE_CTE_DATA.VEHICLE_PARTS_W_CYCLES_ARTICLE VP

        UNION ALL

        -- Recursive Member: Traverse the parts that depend on others
        SELECT
            VP.PART_ID,
            VP.PART_NAME,
            VP.PART_CATEGORY,
            PB.BOM || '->' || VP.PART_ID::VARCHAR AS BOM,
            PB.RECURSION_DEPTH + 1 AS RECURSION_DEPTH,
            -- Adding the current ID to the list of visited_nodes
            ARRAY_APPEND(PB.VISITED_NODES, VP.PART_ID) AS VISITED_NODES
        FROM 
            RECURSIVE_CTE_DATA.VEHICLE_PARTS_W_CYCLES_ARTICLE VP
            INNER JOIN PARTS_BOM PB
                ON VP.DEPENDS_ON = PB.PART_ID
        WHERE
            PB.RECURSION_DEPTH < 10 
            -- Skip visited nodes
            AND NOT ARRAY_CONTAINS(VP.PART_ID, PB.VISITED_NODES)
)
-- Final Select: Retrieve the BOM for all parts
SELECT
    PART_ID,
    PART_NAME,
    PART_CATEGORY,
    BOM,
    RECURSION_DEPTH,
    VISITED_NODES
FROM PARTS_BOM
ORDER BY PART_ID, RECURSION_DEPTH;

Executing this query produces 22 rows, a significant improvement.

Sample output:

PART_ID	PART_NAME	PART_CATEGORY	BOM	RECURSION_DEPTH	VISITED_NODES
1	Brake Pad #1	Braking System	1	1	[1]
1	Brake Pad #1	Braking System	3->1	2	[3, 1]
1	Brake Pad #1	Braking System	2->3->1	3	[2, 3, 1]
4	Engine Block #4	Engine System	4	1	[4]
5	Piston #5	Engine System	5	1	[5]
5	Piston #5	Engine System	4->5	2	[4, 5]
6	Spark Plug #6	Engine System	6	1	[6]
6	Spark Plug #6	Engine System	5->6	2	[5, 6]
6	Spark Plug #6	Engine System	4->5->6	3	[4, 5, 6]
7	Oil Filter #7	Engine System	7	1	[7]
7	Oil Filter #7	Engine System	6->7	2	[6, 7]
7	Oil Filter #7	Engine System	5->6->7	3	[5, 6, 7]
7	Oil Filter #7	Engine System	4->5->6->7	4	[4, 5, 6, 7]

Although the query successfully handles cyclic data, it does not yet produce the expected final result. Currently, it generates all possible dependency paths for each part. To obtain a single row for each part, listing all its dependencies, we should retain only the row with the maximum recursion depth for each PART_ID, as it contains the full dependency chain.

The updated query becomes:

WITH 
    RECURSIVE PARTS_BOM AS (
        -- Anchor Member: Start with all parts (even those that have dependencies)
        SELECT
            VP.PART_ID,
            VP.PART_NAME,
            VP.PART_CATEGORY,
            VP.PART_ID::VARCHAR AS BOM,
            1 AS RECURSION_DEPTH,
            ARRAY_CONSTRUCT(VP.PART_ID) AS VISITED_NODES
        FROM 
            RECURSIVE_CTE_DATA.VEHICLE_PARTS_W_CYCLES_ARTICLE VP

        UNION ALL

        -- Recursive Member: Traverse the parts that depend on others
        SELECT
            VP.PART_ID,
            VP.PART_NAME,
            VP.PART_CATEGORY,
            PB.BOM || '->' || VP.PART_ID::VARCHAR AS BOM,
            PB.RECURSION_DEPTH + 1 AS RECURSION_DEPTH,
            -- Adding the current ID to the list of visited_nodes
            ARRAY_APPEND(PB.VISITED_NODES, VP.PART_ID) AS VISITED_NODES
        FROM 
            RECURSIVE_CTE_DATA.VEHICLE_PARTS_W_CYCLES_ARTICLE VP
            INNER JOIN PARTS_BOM PB
                ON VP.DEPENDS_ON = PB.PART_ID
        WHERE

            -- Loop at most 10 times
            PB.RECURSION_DEPTH < 10 
            -- Skip visited nodes
            AND NOT ARRAY_CONTAINS(VP.PART_ID, PB.VISITED_NODES)
)
-- Final Select: Retrieve the BOM for all parts
SELECT
    PART_ID,
    PART_NAME,
    PART_CATEGORY,
    BOM,
    RECURSION_DEPTH,
    VISITED_NODES
FROM PARTS_BOM
-- take the row with maximum recursion depth for each part_ID
QUALIFY ROW_NUMBER() OVER (PARTITION BY PART_ID ORDER BY RECURSION_DEPTH DESC) = 1
ORDER BY PART_ID, RECURSION_DEPTH;

Final output:

PART_ID	PART_NAME	PART_CATEGORY	BOM	RECURSION_DEPTH	VISITED_NODES
1	Brake Pad #1	Braking System	2->3->1	3	[2, 3, 1]
2	Brake Caliper #2	Braking System	3->1->2	3	[3, 1, 2]
3	Brake Rotor #3	Braking System	1->2->3	3	[1, 2, 3]
4	Engine Block #4	Engine System	4	1	[4]
5	Piston #5	Engine System	4->5	2	[4, 5]
6	Spark Plug #6	Engine System	4->5->6	3	[4, 5, 6]
7	Oil Filter #7	Engine System	4->5->6->7	4	[4, 5, 6, 7]
8	Spark Plug #8	Engine System	4->5->8	3	[4, 5, 8]

How to optimize recursive queries

Optimizing recursive queries is essential in defining an efficient and reliable solution; especially when dealing with large datasets, as rows can exponentially multiply. Here are a few tips to help you optimize your queries:

Filter data early: If you need to process only a subset of your data, exclude the unnecessary data before processing. Apply WHERE conditions as soon as possible to limit the number of rows processed during recursion. For example, if you want to generate the BOM only for parts that belong to the PART_CATEGORY 'Engine System', add a filter in your Anchor Member before processing. Avoid, as much as possible, excluding rows as a post-processing operation.
Clean loops and cycles (if possible): Loops in hierarchical data often represent anomalies. If this applies to your dataset, consider fixing the underlying data before executing the recursive query to avoid unwanted infinite loops.
Use proper indexing: Ensure indexes are created on key columns involved in the recursive joins to speed up lookups.
Avoid unnecessary columns: Select only the columns you require to reduce memory consumption and improve performance.
Use UNION ALL instead of UNION: UNION performs a distinct check, which can be expensive. Using UNION ALL can improve performance by skipping this check.
Monitor and tune query execution plans: Use execution plans and profiling tools (e.g., EXPLAIN in Snowflake) to analyze performance bottlenecks and optimize queries accordingly.

For more in-depth SQL optimization tips, you can check out the following article!

Alternative methods to recursive CTEs for hierarchical data processing

Some systems can provide additional methods to process hierarchical data. These methods can simplify the query syntax (e.g., providing functions to calculate recursion depth and visited nodes) but may impose some limitations (e.g., constraints to use a single table).

Let's consider, for example, Snowflake's CONNECT BY statement. According to the documentation it has the following syntax:

SELECT <column_list> [ , <level_expression> ]
  FROM <data_source>
    START WITH <predicate>
    CONNECT BY [ PRIOR ] <col1_identifier> = [ PRIOR ] <col2_identifier>
           [ , [ PRIOR ] <col3_identifier> = [ PRIOR ] <col4_identifier> ]
           ...
  ...

Where:

START WITH defines the Anchor Member conditions
CONNECT BY contains the join conditions, stop conditions and filters present in the Recursive Member
<level_expression> are expressions provided by Snowflake used to construct the recusion path, determine the recursion depth etc.

To generate the BoM using Snowflake's CONNECT BY, the query would thus look like this:

WITH
    PARTS_BOM AS (
        SELECT
            VP.PART_ID,
            VP.PART_NAME,
            VP.PART_CATEGORY,
            SYS_CONNECT_BY_PATH(VP.PART_ID, ' -> ') BOM,
            LEVEL AS RECURSION_DEPTH
        FROM 
            RECURSIVE_CTE_DATA.VEHICLE_PARTS_W_CYCLES_ARTICLE VP
            START WITH TRUE -- Anchor Member: Start with all parts (even those that have dependencies)
            CONNECT BY 
                -- Recursive Member: Traverse the parts that depend on others
                VP.DEPENDS_ON = PRIOR VP.PART_ID
                -- Loop at most 10 times
                AND PRIOR RECURSION_DEPTH < 10
                -- Skip visited nodes
                -- You could also define an array field just like in the recursive CTE example
                AND PRIOR BOM NOT LIKE '%' || VP.PART_ID || '%'
)
-- Final Select: Retrieve the BOM for all parts
SELECT
    PART_ID,
    PART_NAME,
    PART_CATEGORY,
    BOM,
    RECURSION_DEPTH
FROM PARTS_BOM
-- take the row with maximum recursion depth for each part_ID
QUALIFY ROW_NUMBER() OVER (PARTITION BY PART_ID ORDER BY RECURSION_DEPTH DESC) = 1
ORDER BY PART_ID, RECURSION_DEPTH

Conclusion

Recursive CTEs are a powerful SQL tool for dealing with hierarchical data structures. However, it is important to optimize these queries to avoid excessive resource consumption, infinite loops, and other performance issues, ensuring the creation of robust solutions.

Key takeaways:

Using recursion depth to limit iterations
Detecting and handling cycles in data
Tips to optimize recursive queries

Stay tuned for Part 3, where we’ll cover essential use cases for SQL recursion.

Want to learn more about this topic? Check out the full series here: datainbites.hashnode.dev/series/mastering-sql-recursion.

Mastering SQL Recursion: Advanced Techniques for Handling Hierarchies, Cycles, and Optimization (Part 2)

Table of contents