Mastering SQL Recursion: Advanced Techniques for Handling Hierarchies, Cycles, and Optimization (Part 2)

Elie FayadElie Fayad
14 min read

Introduction

If you've ever worked with hierarchical or tree-structured data, you’ve probably hit the limitations of standard SQL. That’s where recursive SQL queries come in!

In the first article of this three-part series, we explored recursion—what it is, why it’s crucial for processing hierarchical data, and how it can be implemented in SQL. However, knowing the basics of SQL recursion may not be enough when dealing with complex or "dirty" data. Queries may take too long to execute or run indefinitely. In this article, we’ll tackle advanced recursion techniques to optimize and improve the performance of recursive SQL queries.

By the end of this article, you’ll learn:

  • How to limit recursion depth to prevent infinite loops
  • How to detect and handle cycles in hierarchical data
  • How to optimize recursive queries
  • Alternative methods to recursive CTEs for hierarchical data processing

PREREQUISITES:

DISCLAIMERS:

  • All examples in this article were run in a Snowflake environment (but concepts apply to PostgreSQL, SQL Server, etc.). To follow along, you can sign up for a free Snowflake trial account here.
  • All code and data used in this article are available in the accompanying GitHub repository.

Understanding our dataset

We will work with a dataset containing vehicle parts manufacturing data. The dataset includes the following fields:

Column NameDescription
Part_IDA unique identifier for each part in the dataset
Part_NameA human-readable name describing the part (e.g., "Engine Block #1")
Part_CategoryThe category of the part (e.g., Engine System, Transmission, etc.)
Part_CostThe manufacturing cost of the part in USD
Part_Manufacturing_DurationThe time required to manufacture the part (in hours)
Depends_OnThe Part_ID of another part that this part depends on (i.e., its parent in the hierarchy). NULL means it's a root component

Here is the available data:

Part IDPart NamePart CategoryDepends OnPart CostPart Manufacturing Duration
1Brake Pad #1Braking System345.002
2Brake Caliper #2Braking System180.004
3Brake Rotor #3Braking System2120.006
4Engine Block #4Engine Systemnull500.0010
5Piston #5Engine System480.005
6Spark Plug #6Engine System510.001
7Oil Filter #7Engine System615.002
8Spark Plug #8Engine System512.001.5
flowchart TD
    4("Engine Block #4") --> 5("Piston #5")
    5 --> 6("Spark Plug #6") & 8("Spark Plug #8")
    6 --> 7("Oil Filter #7")
    3("Brake Rotor #3") --> 1("Brake Pad #1")
    1 --> 2("Brake Caliper #2")
    2 --> 3

Diagram: Visual representation of the dataset

We will use this dataset to build a Bill of Materials (BoM) for each part.

A Bill of Materials is a list of all components, raw materials or assemblies required to manufacture a product.

If you'd like to test these concepts further, check out the accompanying repository for additional datasets.

Based on what we’ve learned so far, the query to build the BoM should look something like this:

WITH 
    RECURSIVE PARTS_BOM AS (
        -- Anchor Member: Start with all parts (even those that have dependencies)
        SELECT
            VP.PART_ID,
            VP.PART_NAME,
            VP.PART_CATEGORY,
            VP.PART_ID::VARCHAR AS BOM
        FROM 
            RECURSIVE_CTE_DATA.VEHICLE_PARTS_W_CYCLES_ARTICLE VP

        UNION ALL

        -- Recursive Member: Traverse the parts that depend on others
        SELECT
            VP.PART_ID,
            VP.PART_NAME,
            VP.PART_CATEGORY,
            PB.BOM || '->' || VP.PART_ID::VARCHAR AS BOM
        FROM 
            RECURSIVE_CTE_DATA.VEHICLE_PARTS_W_CYCLES_ARTICLE VP
            INNER JOIN PARTS_BOM PB
                ON VP.DEPENDS_ON = PB.PART_ID
)
-- Final Select: Retrieve the BOM for all parts
SELECT
    PART_ID,
    PART_NAME,
    PART_CATEGORY,
    BOM
FROM PARTS_BOM
ORDER BY PART_ID;

However, if you try to execute this query, you’ll notice that it never ends.

Infinite loops in recursive queries

Infinite loops occur if the recursive query keeps generating new rows indefinitely. It typically happens due to one of the following reasons:

  • Lack of a termination condition: If the recursive query does not include a stopping condition or the hierarchy does not come to a natural end, the execution will go on indefinitely.
  • Presence of cycles in hierarchical data: Cycles occur when a node directly or indirectly refers back to itself, creating a loop instead of a proper tree or directed acyclic graph (DAG). This often happens in parent-child relationships where a descendant mistakenly becomes an ancestor, causing infinite recursion in queries.

Preventing inifinite loops in recursive queries

Several methods exist to avoid infinite loops. These can be broadly categorized into:

  • Limiting Recursion Depth
  • Detecting and handling cycles in hierarchical data

Recursion Depth

Recursion depth refers to the number of times a recursive query calls itself before stopping. Limiting it ensures that recursion does not continue indefinitely. If recursion depth is too high without proper constraints, it can lead to infinite loops or excessive memory usage.

How to limit recursion depth:

  1. Add a numeric field, RECURSION_DEPTH, to the Anchor Member and initialize it to 0 or 1.
  2. In the Recursive Member, increment RECURSION_DEPTH by 1. Each time the recursive part of the query executes, it adds 1 to the parent's RECURSION_DEPTH, effectively tracking the recursion depth.
  3. Use RECURSION_DEPTH in the WHERE clause of the Recursive Member to define a stopping condition.

Some database systems impose a default recursion depth limit to prevent infinite loops.

Applying this logic to our use case, we can define a new query:

WITH 
    RECURSIVE PARTS_BOM AS (
        -- Anchor Member: Start with all parts (even those that have dependencies)
        SELECT
            VP.PART_ID,
            VP.PART_NAME,
            VP.PART_CATEGORY,
            VP.PART_ID::VARCHAR AS BOM,
            1 AS RECURSION_DEPTH
        FROM 
            RECURSIVE_CTE_DATA.VEHICLE_PARTS_W_CYCLES_ARTICLE VP

        UNION ALL

        -- Recursive Member: Traverse the parts that depend on others
        SELECT
            VP.PART_ID,
            VP.PART_NAME,
            VP.PART_CATEGORY,
            PB.BOM || '->' || VP.PART_ID::VARCHAR AS BOM,
            PB.RECURSION_DEPTH + 1 AS RECURSION_DEPTH
        FROM 
            RECURSIVE_CTE_DATA.VEHICLE_PARTS_W_CYCLES_ARTICLE VP
            INNER JOIN PARTS_BOM PB
                ON VP.DEPENDS_ON = PB.PART_ID
        WHERE
            -- Loop at most 10 times
            PB.RECURSION_DEPTH < 10  
)
-- Final Select: Retrieve the BOM for all parts
SELECT
    PART_ID,
    PART_NAME,
    PART_CATEGORY,
    BOM,
    RECURSION_DEPTH
FROM PARTS_BOM
ORDER BY PART_ID;

Executing this query now terminates successfully and produces 43 rows in the output.

Sample output:

PART_IDPART_NAMEPART_CATEGORYBOMRECURSION_DEPTH
1Brake Pad #1Braking System11
1Brake Pad #1Braking System3->12
1Brake Pad #1Braking System2->3->13
1Brake Pad #1Braking System1->2->3->14
1Brake Pad #1Braking System1->2->3->1->2->3->1->2->3->110
4Engine Block #4Engine System41
5Piston #5Engine System51
5Piston #5Engine System4->52
6Spark Plug #6Engine System61
6Spark Plug #6Engine System5->62
6Spark Plug #6Engine System4->5->63
7Oil Filter #7Engine System71
7Oil Filter #7Engine System6->72
7Oil Filter #7Engine System5->6->73
7Oil Filter #7Engine System4->5->6->74

We notice that the query still does not output the expected result. We have several rows for every part depending on the various recusion levels and paths (eg. part 7), we also have repeating patterns in our BoM due to cycles in our data (eg. part 1). Let's improve our solution.


Detecting and handling cycles in hierarchical data

Detecting cycles in hierarchical data ensures that recursion does not create infinite loops. Once detected, cycles can be handled to break the loop and prevent revisiting nodes in the same path.

This can be achieved by tracking all the visited nodes in a certain recursive path:

  1. Add a field, VISITED_NODES, to the Anchor Member and initialize it with the ID of the node itself. This field could be a string, an array or any datatype that can store a series of values. In our example, we will use an array datatype.
  2. In the Recursive Member, add the current node's ID to the VISITED_NODES field of the previous iteration, enabling us to efficiently track all nodes that were already visited in a path.
  3. The VISITED_NODES can then be used in the WHERE clause of the Recursive Member to stop the recursion if a node was already visited.

Often, recursion depth limitation and cycle detection are implemented together to ensure the query finishes executing.

Applying this logic to our use case, we can define a new query:

WITH 
    RECURSIVE PARTS_BOM AS (
        -- Anchor Member: Start with all parts (even those that have dependencies)
        SELECT
            VP.PART_ID,
            VP.PART_NAME,
            VP.PART_CATEGORY,
            VP.PART_ID::VARCHAR AS BOM,
            1 AS RECURSION_DEPTH,
            ARRAY_CONSTRUCT(VP.PART_ID) AS VISITED_NODES
        FROM 
            RECURSIVE_CTE_DATA.VEHICLE_PARTS_W_CYCLES_ARTICLE VP

        UNION ALL

        -- Recursive Member: Traverse the parts that depend on others
        SELECT
            VP.PART_ID,
            VP.PART_NAME,
            VP.PART_CATEGORY,
            PB.BOM || '->' || VP.PART_ID::VARCHAR AS BOM,
            PB.RECURSION_DEPTH + 1 AS RECURSION_DEPTH,
            -- Adding the current ID to the list of visited_nodes
            ARRAY_APPEND(PB.VISITED_NODES, VP.PART_ID) AS VISITED_NODES
        FROM 
            RECURSIVE_CTE_DATA.VEHICLE_PARTS_W_CYCLES_ARTICLE VP
            INNER JOIN PARTS_BOM PB
                ON VP.DEPENDS_ON = PB.PART_ID
        WHERE
            PB.RECURSION_DEPTH < 10 
            -- Skip visited nodes
            AND NOT ARRAY_CONTAINS(VP.PART_ID, PB.VISITED_NODES)
)
-- Final Select: Retrieve the BOM for all parts
SELECT
    PART_ID,
    PART_NAME,
    PART_CATEGORY,
    BOM,
    RECURSION_DEPTH,
    VISITED_NODES
FROM PARTS_BOM
ORDER BY PART_ID, RECURSION_DEPTH;

Executing this query produces 22 rows, a significant improvement.

Sample output:

PART_IDPART_NAMEPART_CATEGORYBOMRECURSION_DEPTHVISITED_NODES
1Brake Pad #1Braking System11[1]
1Brake Pad #1Braking System3->12[3, 1]
1Brake Pad #1Braking System2->3->13[2, 3, 1]
4Engine Block #4Engine System41[4]
5Piston #5Engine System51[5]
5Piston #5Engine System4->52[4, 5]
6Spark Plug #6Engine System61[6]
6Spark Plug #6Engine System5->62[5, 6]
6Spark Plug #6Engine System4->5->63[4, 5, 6]
7Oil Filter #7Engine System71[7]
7Oil Filter #7Engine System6->72[6, 7]
7Oil Filter #7Engine System5->6->73[5, 6, 7]
7Oil Filter #7Engine System4->5->6->74[4, 5, 6, 7]

Although the query successfully handles cyclic data, it does not yet produce the expected final result. Currently, it generates all possible dependency paths for each part. To obtain a single row for each part, listing all its dependencies, we should retain only the row with the maximum recursion depth for each PART_ID, as it contains the full dependency chain.

The updated query becomes:

WITH 
    RECURSIVE PARTS_BOM AS (
        -- Anchor Member: Start with all parts (even those that have dependencies)
        SELECT
            VP.PART_ID,
            VP.PART_NAME,
            VP.PART_CATEGORY,
            VP.PART_ID::VARCHAR AS BOM,
            1 AS RECURSION_DEPTH,
            ARRAY_CONSTRUCT(VP.PART_ID) AS VISITED_NODES
        FROM 
            RECURSIVE_CTE_DATA.VEHICLE_PARTS_W_CYCLES_ARTICLE VP

        UNION ALL

        -- Recursive Member: Traverse the parts that depend on others
        SELECT
            VP.PART_ID,
            VP.PART_NAME,
            VP.PART_CATEGORY,
            PB.BOM || '->' || VP.PART_ID::VARCHAR AS BOM,
            PB.RECURSION_DEPTH + 1 AS RECURSION_DEPTH,
            -- Adding the current ID to the list of visited_nodes
            ARRAY_APPEND(PB.VISITED_NODES, VP.PART_ID) AS VISITED_NODES
        FROM 
            RECURSIVE_CTE_DATA.VEHICLE_PARTS_W_CYCLES_ARTICLE VP
            INNER JOIN PARTS_BOM PB
                ON VP.DEPENDS_ON = PB.PART_ID
        WHERE

            -- Loop at most 10 times
            PB.RECURSION_DEPTH < 10 
            -- Skip visited nodes
            AND NOT ARRAY_CONTAINS(VP.PART_ID, PB.VISITED_NODES)
)
-- Final Select: Retrieve the BOM for all parts
SELECT
    PART_ID,
    PART_NAME,
    PART_CATEGORY,
    BOM,
    RECURSION_DEPTH,
    VISITED_NODES
FROM PARTS_BOM
-- take the row with maximum recursion depth for each part_ID
QUALIFY ROW_NUMBER() OVER (PARTITION BY PART_ID ORDER BY RECURSION_DEPTH DESC) = 1
ORDER BY PART_ID, RECURSION_DEPTH;

Final output:

PART_IDPART_NAMEPART_CATEGORYBOMRECURSION_DEPTHVISITED_NODES
1Brake Pad #1Braking System2->3->13[2, 3, 1]
2Brake Caliper #2Braking System3->1->23[3, 1, 2]
3Brake Rotor #3Braking System1->2->33[1, 2, 3]
4Engine Block #4Engine System41[4]
5Piston #5Engine System4->52[4, 5]
6Spark Plug #6Engine System4->5->63[4, 5, 6]
7Oil Filter #7Engine System4->5->6->74[4, 5, 6, 7]
8Spark Plug #8Engine System4->5->83[4, 5, 8]

How to optimize recursive queries

Optimizing recursive queries is essential in defining an efficient and reliable solution; especially when dealing with large datasets, as rows can exponentially multiply. Here are a few tips to help you optimize your queries:

  1. Filter data early: If you need to process only a subset of your data, exclude the unnecessary data before processing. Apply WHERE conditions as soon as possible to limit the number of rows processed during recursion. For example, if you want to generate the BOM only for parts that belong to the PART_CATEGORY 'Engine System', add a filter in your Anchor Member before processing. Avoid, as much as possible, excluding rows as a post-processing operation.
  2. Clean loops and cycles (if possible): Loops in hierarchical data often represent anomalies. If this applies to your dataset, consider fixing the underlying data before executing the recursive query to avoid unwanted infinite loops.
  3. Use proper indexing: Ensure indexes are created on key columns involved in the recursive joins to speed up lookups.
  4. Avoid unnecessary columns: Select only the columns you require to reduce memory consumption and improve performance.
  5. Use UNION ALL instead of UNION: UNION performs a distinct check, which can be expensive. Using UNION ALL can improve performance by skipping this check.
  6. Monitor and tune query execution plans: Use execution plans and profiling tools (e.g., EXPLAIN in Snowflake) to analyze performance bottlenecks and optimize queries accordingly.

Alternative methods to recursive CTEs for hierarchical data processing

Some systems can provide additional methods to process hierarchical data. These methods can simplify the query syntax (e.g., providing functions to calculate recursion depth and visited nodes) but may impose some limitations (e.g., constraints to use a single table).

Let's consider, for example, Snowflake's CONNECT BY statement. According to the documentation it has the following syntax:

SELECT <column_list> [ , <level_expression> ]
  FROM <data_source>
    START WITH <predicate>
    CONNECT BY [ PRIOR ] <col1_identifier> = [ PRIOR ] <col2_identifier>
           [ , [ PRIOR ] <col3_identifier> = [ PRIOR ] <col4_identifier> ]
           ...
  ...

Where:

  • START WITH defines the Anchor Member conditions
  • CONNECT BY contains the join conditions, stop conditions and filters present in the Recursive Member
  • <level_expression> are expressions provided by Snowflake used to construct the recusion path, determine the recursion depth etc.

To generate the BoM using Snowflake's CONNECT BY, the query would thus look like this:

WITH
    PARTS_BOM AS (
        SELECT
            VP.PART_ID,
            VP.PART_NAME,
            VP.PART_CATEGORY,
            SYS_CONNECT_BY_PATH(VP.PART_ID, ' -> ') BOM,
            LEVEL AS RECURSION_DEPTH
        FROM 
            RECURSIVE_CTE_DATA.VEHICLE_PARTS_W_CYCLES_ARTICLE VP
            START WITH TRUE -- Anchor Member: Start with all parts (even those that have dependencies)
            CONNECT BY 
                -- Recursive Member: Traverse the parts that depend on others
                VP.DEPENDS_ON = PRIOR VP.PART_ID
                -- Loop at most 10 times
                AND PRIOR RECURSION_DEPTH < 10
                -- Skip visited nodes
                -- You could also define an array field just like in the recursive CTE example
                AND PRIOR BOM NOT LIKE '%' || VP.PART_ID || '%'
)
-- Final Select: Retrieve the BOM for all parts
SELECT
    PART_ID,
    PART_NAME,
    PART_CATEGORY,
    BOM,
    RECURSION_DEPTH
FROM PARTS_BOM
-- take the row with maximum recursion depth for each part_ID
QUALIFY ROW_NUMBER() OVER (PARTITION BY PART_ID ORDER BY RECURSION_DEPTH DESC) = 1
ORDER BY PART_ID, RECURSION_DEPTH

Conclusion

Recursive CTEs are a powerful SQL tool for dealing with hierarchical data structures. However, it is important to optimize these queries to avoid excessive resource consumption, infinite loops, and other performance issues, ensuring the creation of robust solutions.

Key takeaways:

  • Using recursion depth to limit iterations
  • Detecting and handling cycles in data
  • Tips to optimize recursive queries

Stay tuned for Part 3, where we’ll cover essential use cases for SQL recursion.


Want to learn more about this topic? Check out the full series here: datainbites.hashnode.dev/series/mastering-sql-recursion.

0
Subscribe to my newsletter

Read articles from Elie Fayad directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Elie Fayad
Elie Fayad

I'm a data professional specializing in SQL and Snowflake, with a strong background in cloud migrations, data platform configuration, ETL/ELT pipeline development, data modeling, and workflow orchestration. I'm proactive, eager to learn, and passionate about tackling new challenges. I enjoy exploring emerging technologies and sharing knowledge with the community!