Distributed Query Processing – Making Queries Work Across Systems

This is Part 4 of the "Distributed DBs: A Clear Guide" series. In this article, we take a comprehensive look at how distributed databases process queries. We’ll walk through every stage of the process, explain real-world examples, and include a detailed cost analysis to make this topic accessible even if you’re new to databases.


1. What is Distributed Query Processing?

Distributed query processing refers to how a system answers a user’s query when the data is distributed across multiple physical locations or nodes. Despite this physical separation, the system must appear unified to the user.

In essence, it’s about executing one logical query across many machines and combining the results efficiently and accurately.


2. Why is Distributed Query Processing Complex?

Compared to a centralized system, a distributed setup introduces new complications:

  • Data is not local. It lives across multiple servers or locations.

  • Communication between nodes is expensive. Transferring large amounts of data can quickly become the bottleneck.

  • Failure is more likely. Each additional site increases the chances of failure.

  • Optimization is harder. The system must choose where to run different parts of the query and minimize total cost.


3. Stages of Distributed Query Processing

3.1 Query Parsing and Analysis

This is the first step, identical to any SQL-based system. The database parses your query into a tree-like structure representing logical operations.

Example query:

SELECT name, price FROM Products WHERE category = 'Books';

This is turned into a query tree:

    SELECT name, price
          |
       WHERE category = 'Books'
          |
       FROM Products

3.2 Query Localization

Next, the system identifies which fragments or sites contain the needed data.

If Products is horizontally fragmented:

  • Site 1 → Products_EU

  • Site 2 → Products_US

  • Site 3 → Products_ASIA

Then the query is rewritten like this:

SELECT name, price FROM Products_EU WHERE category = 'Books'
UNION
SELECT name, price FROM Products_US WHERE category = 'Books'
UNION
SELECT name, price FROM Products_ASIA WHERE category = 'Books';

3.3 Global Query Optimization

The optimizer evaluates execution plans based on estimated cost:

  • Data size

  • Join selectivity

  • Network transfer cost

  • Result location

It chooses the most cost-effective plan, possibly using semi-joins or filters.

3.4 Local Query Optimization

Once each node receives its localized query, it optimizes the execution using local stats, indexes, and strategies.


4. Join Strategies Across Sites

Joining data across distributed locations is one of the most expensive operations. Let’s examine the strategies in detail using a concrete example:

Example:

We want to execute the following query:

πDname, Fname, Lname (Departments ⋈Mgr_ssn=ssn Employees)
  • Employees (R): 10,000 records, 100 bytes each → 1,000,000 bytes

  • Departments (S): 100 records, 80 bytes each → 8,000 bytes

  • Assume: Fname, Lname = 20 bytes each; Dname = 10 bytes; ssn & Mgr_ssn = 6 bytes

Result expected at Site 3.

Strategy 1: Send both relations to Site 3 and join

  • Transfer R = 1,000,000 bytes

  • Transfer S = 8,000 bytes

  • Total = 1,008,000 bytes

Strategy 2: Send R to Site 2, join there, send result to Site 3

  • Transfer R = 1,000,000 bytes

  • Assume 100 matches (join result size) × (20 + 20 + 10) = 5,000 bytes

  • Total = 1,005,000 bytes

Strategy 3: Send S to Site 1, join there, send result to Site 3

  • Transfer S = 8,000 bytes

  • Assume result = 5,000 bytes

  • Total = 13,000 bytes

Strategy 4: Semi-join (optimized)

  • Step 1: Send πMgr_ssn(Departments) = 100 × 6 = 600 bytes

  • Step 2: Semi-join with Employees (filter)

  • Step 3: Send πssn, Fname, Lname(Employees′) = 100 × 46 = 4,600 bytes

  • Step 4: Final join at Site 2: 5,000 bytes

  • Total ≈ 10,200 bytes

Semi-join strategy is optimal in this case.


5. Factors Influencing Query Cost

Main Cost Components:

  • Communication Cost: Dominates in distributed settings

  • Disk I/O: Reading local tables or indexes

  • CPU Cost: Sorting, joining, filtering data

Cost Equation:

Total_Cost = Communication_Cost + Disk_IO + CPU_Cost

A good optimizer aims to reduce network usage above all.


6. Transparency in Query Execution

A well-designed DDBMS hides the complexity of distribution:

  • Location Transparency: User doesn’t know where data is stored

  • Fragmentation Transparency: System hides data partitioning

  • Replication Transparency: User sees one copy even if multiple exist

These properties make it easier to work with distributed systems.


7. What Happens If Something Fails?

Distributed systems are more vulnerable to:

  • Node failures

  • Message loss

  • Network partitions

Recovery Measures:

  • Retry and timeout mechanisms

  • Use of replicas to continue query processing

  • Logging for partial results and checkpoints

Systems like Google Spanner and Amazon Aurora build these features in.


8. Full Walkthrough – Customer and Orders Example

Let’s say:

  • Customers is at Site A

  • Orders is at Site B

  • Result should go to Site C

Query:

SELECT C.name, O.total FROM Customers C JOIN Orders O ON C.id = O.customer_id WHERE O.total > 500;

Semi-Join Plan:

  1. Site A sends πid(Customers) to Site B

  2. Site B filters Orders with matching IDs and amount > 500

  3. Send filtered Orders to Site C

  4. Site C joins with Customers

This approach transfers less data and avoids overloading any one site.


9. Real-World Analogy

Imagine you place a multi-vendor online order:

  • Your phone ships from Berlin

  • Your charger ships from Paris

  • Your case comes from Milan

Each item is fetched from a different warehouse and combined in the final delivery. You placed one order; the system handled all the coordination.

That’s distributed query processing.


10. Summary

Distributed Query Processing is the engine behind scalable, efficient, and reliable distributed databases. When done right, it ensures that:

  • Queries run quickly and correctly

  • Communication cost is minimized

  • Systems stay resilient despite failure

With a deeper understanding of strategies and cost estimation, you’re now better prepared to design or optimize real-world distributed database systems.


This concludes the "Distributed DBs: A Clear Guide" series.

If you missed earlier posts:

  1. Introduction to Distributed Databases

  2. Fragmentation, Allocation, and Replication

  3. Distributed Transaction Management

Thank you for reading. If you have questions or feedback, feel free to share.

0
Subscribe to my newsletter

Read articles from Muhammad Sajid Bashir directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Muhammad Sajid Bashir
Muhammad Sajid Bashir

I'm a versatile tech professional working at the intersection of Machine Learning, Data Engineering, and Full Stack Development. With hands-on experience in distributed systems, pipelines, and scalable applications, I translate complex data into real-world impact.