Introduction

In computer science, the choice of data structure plays a crucial role in determining how efficiently a problem's query is handled. Selecting the right data structure can drastically reduce execution time, which is essential when solving real-world problems. One such highly efficient data structure is the Disjoint Set Union (DSU), commonly known as Union Find due to the core functionalities it offers.

The disjoint-set data structure is a powerful tool for managing collections of non-overlapping sets. Think of it as a way to organize elements into groups, where each element belongs to exactly one group. It is similar to how you might organize balls into different slots, ensuring that no ball is placed in more than one slot at a time. This data structure keeps track of which elements are in which set and helps efficiently manage group membership.

In this blog, we’ll dive deep into the Disjoint Set Union (DSU), often called the Union Find algorithm. We will build it from scratch to provide a clear and detailed understanding of how it works and why it is so useful for various computational problems.

Disjoint Set Data Structure

The Union-Find algorithm is a powerful data structure that maintains a collection of disjoint sets. It supports two fundamental operations: Union, which merges two sets into one, and Find, which identifies the set to which a specific element belongs. This structure plays a crucial role in graph algorithms, particularly for efficiently managing connectivity and partitioning challenges.

To understand the concept of disjoint sets, let’s break down the term: disjoint and sets.

What does disjoint mean? Simply put, it refers to something that is not connected or joined. Now, what do we think of when we hear the word sets? Most of us probably imagine something like this: S = {1, 2, 3, 4, 5}.

To make this clearer, consider a game of polo. In a polo match, each team consists of four players. Each player is assigned a jersey number according to their position on the field. Let’s visualize this as a graph, where each player is represented as a vertex. For simplicity, imagine that the players on the blue team are numbered from 5 to 8, while the players on the green team are numbered from 1 to 4.

These teams can also be represented by sets. A player from team green will never be on team blue, meaning there is no overlap between the two sets of players. In mathematical terms, P ∩ O = Φ. This indicates that the two sets are disjoint, meaning they share no common elements.

In conclusion, disjoint sets are simply sets that have no elements in common, and this forms the foundation of the Union-Find data structure.

Example of Disjoint Set

#include <bits/stdc++.h>
using namespace std;
class DisjointSet
{
private:
    vector<int> parent;

public:
    DisjointSet(int n) : parent(n)
    {
        for (int i = 0; i < n; ++i)
            parent[i] = i; // Initialize each element as its own set
    }

    int find(int x)
    {
        if (parent[x] == x)
            return x;
        return parent[x] = find(parent[x]); // Path compression
    }

    void unionSets(int x, int y)
    {
        int rootX = find(x);
        int rootY = find(y);
        if (rootX != rootY)
            parent[rootY] = rootX; // Union by rank
    }
};

int main()
{
    DisjointSet ds(5);
    ds.unionSets(0, 1);
    ds.unionSets(2, 3);
    ds.unionSets(0, 3);
    cout << "Parent array after union operations: ";
    for (int i = 0; i < 5; ++i)
        cout << ds.find(i) << " ";
    cout << endl;
    return 0;
}

Output:

Parent array after union operations: 0 0 0 0 4

Explanation:
Initially, each element is its own set, so the parent array starts as [0, 1, 2, 3, 4].

After performing the union operation ds.unionSets(0, 1), the sets containing elements 0 and 1 are combined, updating the parent array to [0, 0, 2, 3, 4].
After the union operation ds.unionSets(2, 3), the sets containing elements 2 and 3 merge, resulting in [0, 0, 2, 2, 4].
When the union operation ds.unionSets(0, 3) is performed, the sets with elements 0 and 3 merge, and 0 becomes the representative element, making the parent array [0, 0, 2, 0, 4].
The final parent array shows the sets after all union operations, indicating the combined sets and their representatives.

DSU Operations

Two key operations that can be performed on Disjoint Sets are "find" and "union." Let's break down each one individually for a clearer understanding.

Find

The find operation in a disjoint set data structure is essential for identifying the representative element or root of the set that contains a specific element. This operation helps determine the set membership of an element. To improve efficiency, the find operation implements a technique called path compression. Path compression works by adjusting the structure during find operations, making each element directly point to the root. This optimization ensures that future find operations are faster, as it reduces the height of the tree-like structure, leading to near-constant time complexity for each query.

int find(int x) {
    if (parent[x] == x)
        return x;
    return parent[x] = find(parent[x]); // Path compression
}

Union

In a disjoint set data structure, the union operation combines two distinct sets by designating one of the representative elements as the parent of the other. This process is optimized using a technique known as union by rank, which helps maintain the overall balance between the merged sets. Union by rank assigns a "rank" to each set, usually based on the height of the tree representing the set. When merging two sets, the representative with the lower rank becomes the child of the representative with the higher rank. This strategy keeps the structure balanced, ensuring that future find operations remain efficient.

void unionSets(int x, int y) {
    int rootX = find(x);
    int rootY = find(y);
    if (rootX != rootY)
        parent[rootY] = rootX; // Union by rank
}

These operations are essential for efficiently managing disjoint sets, facilitating tasks like connectivity queries and cycle detection in graphs.

Naive Implementation

A naive implementation of the disjoint set data structure focuses on directly coding the basic operations without any advanced optimizations, such as path compression or union by rank. In this basic approach, the find operation simply follows parent pointers until it locates the root element, without compressing the path to make future lookups more efficient. Likewise, the union operation merges two sets without taking their sizes or tree depths into account, which can result in unbalanced trees. This unoptimized method may lead to slower operations as the trees grow deeper and less structured over time.

Naive Find Implementation:

int find(int x) {
    while (x != parent[x]) {
        x = parent[x];
    }
    return x;
}

Naive Union Implementation:

void unionSets(int x, int y) {
    int rootX = find(x);
    int rootY = find(y);
    if (rootX != rootY) {
        parent[rootY] = rootX;
    }
}

The basic method described here is simple to implement but can lead to inefficiencies, particularly in scenarios involving large disjoint sets or complex, highly connected data structures. As the size of the data grows, the performance of find operations can degrade significantly. In contrast, more advanced approaches like path compression and union by rank offer improved efficiency. These optimized techniques reduce the time complexity of find and union operations, making them more suitable for large-scale applications where performance is critical. Therefore, the basic method is typically less efficient than these enhanced implementations.

Optimizations

Disjoint set data structures, also known as Union-Find, are optimized to improve the efficiency of common operations like finding the representative element of a set and merging two sets. These optimizations are critical for reducing the time complexity, making the data structure much more efficient in practice. Techniques such as path compression and union by rank are widely used to achieve these improvements. Path compression ensures that elements within a set point directly to the root, minimizing the depth of trees, while union by rank helps keep the structure balanced during the merging process. These optimizations are especially beneficial when working with large datasets or performing operations frequently, as they help maintain near-constant time complexity for both the "find" and "union" operations. Such efficiency is crucial in practical applications, including network connectivity, image processing, and dynamic connectivity problems, where operations need to be performed at scale.

There are two ways to improve it

Path Compression
Union by Rank

Path Compression

Path compression is an optimization technique used to streamline the tree structure in the find_set(x) operation, often employed in Disjoint Set Union (DSU) or Union-Find algorithms. Its primary purpose is to accelerate the process of determining the representative of a set.

When find_set(x) is called for a given vertex x, the algorithm traces the path from x to its representative, denoted as p. During this traversal, it visits multiple vertices, forming a path from x to p. The key idea behind path compression is to shorten these paths for all the vertices involved, making future operations more efficient.

After finding the representative p, the method updates the parent of every visited vertex along the path, directly pointing them to p. This significantly flattens the tree, reducing the time complexity of subsequent find_set operations. The result is a much faster process not only for the current elements but also for those connected to them, as the tree structure becomes more compact and easier to navigate in future queries.

The left side shows a tree whereas the right side shows a compressed tree for visited nodes 2,3,5,7.

Implementation

The new implementation of find_set(x) is as follows:

int find_set(int v) {
    if (v == parent[v])
        return v;
    return parent[v] = find_set(parent[v]);
}

Complexity Analysis

Time Complexity: O(log N), where n is the number of nodes.

Space Complexity: O(n), where n is the number of nodes.

Union by Rank

In the context of disjoint-set data structures, when we call find_set(x) or find_set(y), the function identifies the roots of the trees containing x and y. If the roots are different, meaning the trees are separate, they are merged by connecting the root of one tree to the root of the other. However, if this merging is done without optimization, such as always making x a child of y, the resulting tree height can grow up to O(n), leading to inefficient operations over time.

To improve this, we can implement a technique known as union by rank. This method optimizes tree merging by minimizing the height of the trees. There are two popular strategies for applying union by rank: based on size and based on depth. The core idea behind both strategies is to attach the tree with the smaller rank to the tree with the larger rank, ensuring that the trees stay balanced, which in turn keeps the operations efficient.

Union by Rank Based on Size: In this approach, the rank of a tree is determined by the number of elements (or size) in it. When two trees are merged, the tree with the smaller size is attached to the root of the tree with the larger size. This ensures that the smaller tree doesn’t increase the overall height of the merged tree significantly.
Union by Rank Based on Depth: In this version, the rank of a tree is defined by its depth. The tree with the lesser depth is attached to the root of the tree with greater depth, ensuring that the overall height of the merged tree is minimized.

Both strategies share the same optimization principle: always attach the tree with the lower rank (whether by size or depth) to the tree with the higher rank. This keeps the height of the resulting trees as low as possible, preventing inefficient operations over time and ensuring the data structure remains balanced.

Union by rank based on size

In our previous method, we tracked the parent of each node in the array, with the root element being represented by its index iii. For all other elements, we recorded the parent index of iii. However, in this new approach, we modify our storage strategy by maintaining a negative value that reflects the size of the subtree instead. For instance, if the size of a subtree is 3, we store −3-3−3 in the parent array for the root of that subtree.

In cases where a single element constitutes a set, we would store −1-1−1 in the parent array. This adjustment allows us to retain the necessary structural information about the tree while simplifying the process of determining the size of each subtree. Other than this modification, no further changes have been made to our initial approach. This new method enhances efficiency and clarity when managing tree data structures.

Implementation

void make_set(int v) {
    parent[v] = v;
    size[v] = -1;
}

void union_sets(int a, int b) {
    a = find_set(a);
    b = find_set(b);
    if (a != b) {
        if (abs(size[a]) < abs(size[b]))
            swap(a, b);
        parent[b] = a;
        size[a] += size[b];
    }
}

Union by rank based on depths

In contrast to the UNION operation based on tree height, this method utilizes the concept of depth for managing elements in the data structure. This approach is beneficial because path compression can alter the heights of trees as they are modified over time.

Each element within this system is assigned a rank, which starts at zero. At the outset, each set consists of a single element, and its rank reflects this initial state. By focusing on depth rather than height, the structure maintains efficiency even as the elements undergo various union operations. This strategy helps keep the trees balanced and optimizes the performance of future operations.

Implementation

void make_set(int v) {
    parent[v] = v;
    rank[v] = 0;
}

void union_sets(int a, int b) {
    a = find_set(a);
    b = find_set(b);
    if (a != b) {
        if (rank[a] < rank[b])
            swap(a, b);
        parent[b] = a;
        if (rank[a] == rank[b])
            rank[a]++;
    }
}

Note: For the find_set(x) operation there is no change in the implementation.

Complexity Analysis

Both optimizations are equivalent in terms of time and space complexity.

Time Complexity: O(log N), where n is the number of nodes.

Space Complexity: O(n), where n is the number of nodes.

Implementation of Disjoint Set

#include <iostream>
using namespace std;
class DisjointSet
{
    int *rank, *parent, n;

public:
    // Constructor to create and
    // initialize sets of n items
    DisjointSet(int n)
    {
        rank = new int[n];
        parent = new int[n];
        this->n = n;
        make_set();
    }

    // Creates n single item sets
    void make_set()
    {
        for (int i = 0; i < n; i++)
        {
            parent[i] = i;
        }
    }

    // Finds set of given item x
    int find_set(int x)
    {
        // Finds the representative of the set
        // that x is an element of
        if (parent[x] != x)
        {

            // if x is not the parent of itself
            // Then x is not the representative of
            // his set,
            parent[x] = find_set(parent[x]);

            // so we recursively call Find on its parent
            // and move i's node directly under the
            // representative of this set
        }

        return parent[x];
    }

    // Do union of two sets represented
    // by x and y.
    void Union(int x, int y)
    {
        // Find current sets of x and y
        int xset = find_set(x);
        int yset = find_set(y);

        // If they are already in same set
        if (xset == yset)
            return;

        // Put smaller ranked item under
        // bigger ranked item if ranks are
        // different
        if (rank[xset] < rank[yset])
        {
            parent[xset] = yset;
        }
        else if (rank[xset] > rank[yset])
        {
            parent[yset] = xset;
        }

        // If ranks are same, then increment
        // rank.
        else
        {
            parent[yset] = xset;
            rank[xset] = rank[xset] + 1;
        }
    }
};

int main()
{
    DisjointSet obj(5);
    obj.Union(0, 2);
    obj.Union(4, 2);
    obj.Union(3, 1);
    if (obj.find_set(4) == obj.find_set(0))
    {
        cout << "Yes\n";
    }
    else
    {
        cout << "No\n";
    }
    if (obj.find_set(1) == obj.find_set(0))
    {
        cout << "Yes\n";
    }
    else
    {
        cout << "No\n";
    }
    return 0;
}

Output:

Yes

Applications Of Disjoint Sets

Disjoint set data structures, also known as union-find structures, play a crucial role in various domains that require efficient management of disjoint sets and connectivity queries. Here are some prominent applications of these structures:

Graph Algorithms:
One of the primary applications of disjoint sets is in graph algorithms. They are integral to Kruskal's algorithm and Borůvka's algorithm, both of which are designed for finding the minimum spanning tree of a graph. By enabling efficient connectivity checks and cycle detection, disjoint sets help streamline these processes, ensuring that edges are added only if they do not form a cycle, thereby optimizing the spanning tree creation.
Dynamic Connectivity:
Disjoint sets are essential in solving dynamic connectivity problems where elements are constantly added or removed, necessitating rapid responses to connectivity queries. This is particularly relevant in network connectivity scenarios, such as maintaining the status of a network as nodes are added or removed, and in social network analysis, where relationships and connections between users are continuously evolving.
Image Processing:
In the realm of image processing, disjoint sets are employed for tasks such as image segmentation. Here, pixels with similar characteristics are grouped into segments or regions, allowing for efficient management of these segments. Disjoint set data structures facilitate operations like merging or splitting segments based on specific criteria, enhancing the accuracy and efficiency of image analysis.
Union-Find Applications:
The union-find data structure is widely used across various applications, particularly in data compression algorithms like Huffman coding. These algorithms rely on efficient union and find operations to merge and query disjoint sets of elements, making them essential for optimizing data representation and reducing storage requirements.
Dynamic Graph Operations:
In scenarios involving dynamic graphs, where vertices and edges are added or removed, disjoint set structures help maintain the connectivity status of the graph efficiently. This is critical in applications like network routing algorithms, which require up-to-date information on network connectivity, and in database management systems that handle relationships between data entries.
Optimization Problems:
Disjoint sets also find utility in optimization problems, such as clustering and partitioning. In these cases, elements must be grouped into disjoint sets based on specific criteria to optimize various objectives, including resource allocation and load balancing. The efficient operations provided by disjoint sets facilitate the rapid grouping of elements, leading to more effective optimization solutions.

In conclusion, disjoint set data structures are invaluable across multiple fields, offering efficient solutions for managing and querying disjoint sets, thereby enhancing performance in diverse applications from graph algorithms to image processing and optimization tasks.

I hope you found my blog insightful! I would greatly appreciate your thoughts and feedback, so please don’t hesitate to share your comments. Also, remember to subscribe to the Hashnode newsletter to receive notifications whenever I release new content.

Wishing you a fantastic day ahead! 🙌

Disjoint Set Union Algorithm (Union Find Algorithm)

Table of contents

Introduction

Disjoint Set Data Structure

Example of Disjoint Set

DSU Operations

Find

Union

Naive Implementation

Naive Find Implementation:

Naive Union Implementation:

Optimizations

Path Compression

Implementation

Union by Rank

Union by rank based on size

Implementation

Union by rank based on depths

Implementation

Implementation of Disjoint Set

Applications Of Disjoint Sets

Subscribe to my newsletter

Aritra Pal

Aritra Pal