Distributed Query Processing Explained

This is Part 4 of the "Distributed DBs: A Clear Guide" series. In this article, we take a comprehensive look at how distributed databases process queries. We’ll walk through every stage of the process, explain real-world examples, and include a detailed cost analysis to make this topic accessible even if you’re new to databases.

1. What is Distributed Query Processing?

Distributed query processing refers to how a system answers a user’s query when the data is distributed across multiple physical locations or nodes. Despite this physical separation, the system must appear unified to the user.

In essence, it’s about executing one logical query across many machines and combining the results efficiently and accurately.

2. Why is Distributed Query Processing Complex?

Compared to a centralized system, a distributed setup introduces new complications:

Data is not local. It lives across multiple servers or locations.
Communication between nodes is expensive. Transferring large amounts of data can quickly become the bottleneck.
Failure is more likely. Each additional site increases the chances of failure.
Optimization is harder. The system must choose where to run different parts of the query and minimize total cost.

3. Stages of Distributed Query Processing

3.1 Query Parsing and Analysis

This is the first step, identical to any SQL-based system. The database parses your query into a tree-like structure representing logical operations.

Example query:

SELECT name, price FROM Products WHERE category = 'Books';

This is turned into a query tree:

    SELECT name, price
          |
       WHERE category = 'Books'
          |
       FROM Products

3.2 Query Localization

Next, the system identifies which fragments or sites contain the needed data.

If Products is horizontally fragmented:

Site 1 → Products_EU
Site 2 → Products_US
Site 3 → Products_ASIA

Then the query is rewritten like this:

SELECT name, price FROM Products_EU WHERE category = 'Books'
UNION
SELECT name, price FROM Products_US WHERE category = 'Books'
UNION
SELECT name, price FROM Products_ASIA WHERE category = 'Books';

3.3 Global Query Optimization

The optimizer evaluates execution plans based on estimated cost:

Data size
Join selectivity
Network transfer cost
Result location

It chooses the most cost-effective plan, possibly using semi-joins or filters.

3.4 Local Query Optimization

Once each node receives its localized query, it optimizes the execution using local stats, indexes, and strategies.

4. Join Strategies Across Sites

Joining data across distributed locations is one of the most expensive operations. Let’s examine the strategies in detail using a concrete example:

Example:

We want to execute the following query:

πDname, Fname, Lname (Departments ⋈Mgr_ssn=ssn Employees)

Employees (R): 10,000 records, 100 bytes each → 1,000,000 bytes
Departments (S): 100 records, 80 bytes each → 8,000 bytes
Assume: Fname, Lname = 20 bytes each; Dname = 10 bytes; ssn & Mgr_ssn = 6 bytes

Result expected at Site 3.

Strategy 1: Send both relations to Site 3 and join

Transfer R = 1,000,000 bytes
Transfer S = 8,000 bytes
Total = 1,008,000 bytes

Strategy 2: Send R to Site 2, join there, send result to Site 3

Transfer R = 1,000,000 bytes
Assume 100 matches (join result size) × (20 + 20 + 10) = 5,000 bytes
Total = 1,005,000 bytes

Strategy 3: Send S to Site 1, join there, send result to Site 3

Transfer S = 8,000 bytes
Assume result = 5,000 bytes
Total = 13,000 bytes

Strategy 4: Semi-join (optimized)

Step 1: Send πMgr_ssn(Departments) = 100 × 6 = 600 bytes
Step 2: Semi-join with Employees (filter)
Step 3: Send πssn, Fname, Lname(Employees′) = 100 × 46 = 4,600 bytes
Step 4: Final join at Site 2: 5,000 bytes
Total ≈ 10,200 bytes

✅ Semi-join strategy is optimal in this case.

5. Factors Influencing Query Cost

Main Cost Components:

Communication Cost: Dominates in distributed settings
Disk I/O: Reading local tables or indexes
CPU Cost: Sorting, joining, filtering data

Cost Equation:

Total_Cost = Communication_Cost + Disk_IO + CPU_Cost

A good optimizer aims to reduce network usage above all.

6. Transparency in Query Execution

A well-designed DDBMS hides the complexity of distribution:

Location Transparency: User doesn’t know where data is stored
Fragmentation Transparency: System hides data partitioning
Replication Transparency: User sees one copy even if multiple exist

These properties make it easier to work with distributed systems.

7. What Happens If Something Fails?

Distributed systems are more vulnerable to:

Node failures
Message loss
Network partitions

Recovery Measures:

Retry and timeout mechanisms
Use of replicas to continue query processing
Logging for partial results and checkpoints

Systems like Google Spanner and Amazon Aurora build these features in.

8. Full Walkthrough – Customer and Orders Example

Let’s say:

Customers is at Site A
Orders is at Site B
Result should go to Site C

Query:

SELECT C.name, O.total FROM Customers C JOIN Orders O ON C.id = O.customer_id WHERE O.total > 500;

Semi-Join Plan:

Site A sends πid(Customers) to Site B
Site B filters Orders with matching IDs and amount > 500
Send filtered Orders to Site C
Site C joins with Customers

This approach transfers less data and avoids overloading any one site.

9. Real-World Analogy

Imagine you place a multi-vendor online order:

Your phone ships from Berlin
Your charger ships from Paris
Your case comes from Milan

Each item is fetched from a different warehouse and combined in the final delivery. You placed one order; the system handled all the coordination.

That’s distributed query processing.

10. Summary

Distributed Query Processing is the engine behind scalable, efficient, and reliable distributed databases. When done right, it ensures that:

Queries run quickly and correctly
Communication cost is minimized
Systems stay resilient despite failure

With a deeper understanding of strategies and cost estimation, you’re now better prepared to design or optimize real-world distributed database systems.

This concludes the "Distributed DBs: A Clear Guide" series.

If you missed earlier posts:

Introduction to Distributed Databases
Fragmentation, Allocation, and Replication
Distributed Transaction Management

Thank you for reading. If you have questions or feedback, feel free to share.

Distributed Query Processing – Making Queries Work Across Systems