Distributed Query Processing – Making Queries Work Across Systems


This is Part 4 of the "Distributed DBs: A Clear Guide" series. In this article, we take a comprehensive look at how distributed databases process queries. We’ll walk through every stage of the process, explain real-world examples, and include a detailed cost analysis to make this topic accessible even if you’re new to databases.
1. What is Distributed Query Processing?
Distributed query processing refers to how a system answers a user’s query when the data is distributed across multiple physical locations or nodes. Despite this physical separation, the system must appear unified to the user.
In essence, it’s about executing one logical query across many machines and combining the results efficiently and accurately.
2. Why is Distributed Query Processing Complex?
Compared to a centralized system, a distributed setup introduces new complications:
Data is not local. It lives across multiple servers or locations.
Communication between nodes is expensive. Transferring large amounts of data can quickly become the bottleneck.
Failure is more likely. Each additional site increases the chances of failure.
Optimization is harder. The system must choose where to run different parts of the query and minimize total cost.
3. Stages of Distributed Query Processing
3.1 Query Parsing and Analysis
This is the first step, identical to any SQL-based system. The database parses your query into a tree-like structure representing logical operations.
Example query:
SELECT name, price FROM Products WHERE category = 'Books';
This is turned into a query tree:
SELECT name, price
|
WHERE category = 'Books'
|
FROM Products
3.2 Query Localization
Next, the system identifies which fragments or sites contain the needed data.
If Products
is horizontally fragmented:
Site 1 → Products_EU
Site 2 → Products_US
Site 3 → Products_ASIA
Then the query is rewritten like this:
SELECT name, price FROM Products_EU WHERE category = 'Books'
UNION
SELECT name, price FROM Products_US WHERE category = 'Books'
UNION
SELECT name, price FROM Products_ASIA WHERE category = 'Books';
3.3 Global Query Optimization
The optimizer evaluates execution plans based on estimated cost:
Data size
Join selectivity
Network transfer cost
Result location
It chooses the most cost-effective plan, possibly using semi-joins or filters.
3.4 Local Query Optimization
Once each node receives its localized query, it optimizes the execution using local stats, indexes, and strategies.
4. Join Strategies Across Sites
Joining data across distributed locations is one of the most expensive operations. Let’s examine the strategies in detail using a concrete example:
Example:
We want to execute the following query:
πDname, Fname, Lname (Departments ⋈Mgr_ssn=ssn Employees)
Employees (R): 10,000 records, 100 bytes each → 1,000,000 bytes
Departments (S): 100 records, 80 bytes each → 8,000 bytes
Assume: Fname, Lname = 20 bytes each; Dname = 10 bytes; ssn & Mgr_ssn = 6 bytes
Result expected at Site 3.
Strategy 1: Send both relations to Site 3 and join
Transfer R = 1,000,000 bytes
Transfer S = 8,000 bytes
Total = 1,008,000 bytes
Strategy 2: Send R to Site 2, join there, send result to Site 3
Transfer R = 1,000,000 bytes
Assume 100 matches (join result size) × (20 + 20 + 10) = 5,000 bytes
Total = 1,005,000 bytes
Strategy 3: Send S to Site 1, join there, send result to Site 3
Transfer S = 8,000 bytes
Assume result = 5,000 bytes
Total = 13,000 bytes
Strategy 4: Semi-join (optimized)
Step 1: Send πMgr_ssn(Departments) = 100 × 6 = 600 bytes
Step 2: Semi-join with Employees (filter)
Step 3: Send πssn, Fname, Lname(Employees′) = 100 × 46 = 4,600 bytes
Step 4: Final join at Site 2: 5,000 bytes
Total ≈ 10,200 bytes
✅ Semi-join strategy is optimal in this case.
5. Factors Influencing Query Cost
Main Cost Components:
Communication Cost: Dominates in distributed settings
Disk I/O: Reading local tables or indexes
CPU Cost: Sorting, joining, filtering data
Cost Equation:
Total_Cost = Communication_Cost + Disk_IO + CPU_Cost
A good optimizer aims to reduce network usage above all.
6. Transparency in Query Execution
A well-designed DDBMS hides the complexity of distribution:
Location Transparency: User doesn’t know where data is stored
Fragmentation Transparency: System hides data partitioning
Replication Transparency: User sees one copy even if multiple exist
These properties make it easier to work with distributed systems.
7. What Happens If Something Fails?
Distributed systems are more vulnerable to:
Node failures
Message loss
Network partitions
Recovery Measures:
Retry and timeout mechanisms
Use of replicas to continue query processing
Logging for partial results and checkpoints
Systems like Google Spanner and Amazon Aurora build these features in.
8. Full Walkthrough – Customer and Orders Example
Let’s say:
Customers
is at Site AOrders
is at Site BResult should go to Site C
Query:
SELECT C.name, O.total FROM Customers C JOIN Orders O ON C.id = O.customer_id WHERE O.total > 500;
Semi-Join Plan:
Site A sends πid(Customers) to Site B
Site B filters Orders with matching IDs and amount > 500
Send filtered Orders to Site C
Site C joins with Customers
This approach transfers less data and avoids overloading any one site.
9. Real-World Analogy
Imagine you place a multi-vendor online order:
Your phone ships from Berlin
Your charger ships from Paris
Your case comes from Milan
Each item is fetched from a different warehouse and combined in the final delivery. You placed one order; the system handled all the coordination.
That’s distributed query processing.
10. Summary
Distributed Query Processing is the engine behind scalable, efficient, and reliable distributed databases. When done right, it ensures that:
Queries run quickly and correctly
Communication cost is minimized
Systems stay resilient despite failure
With a deeper understanding of strategies and cost estimation, you’re now better prepared to design or optimize real-world distributed database systems.
This concludes the "Distributed DBs: A Clear Guide" series.
If you missed earlier posts:
Introduction to Distributed Databases
Fragmentation, Allocation, and Replication
Distributed Transaction Management
Thank you for reading. If you have questions or feedback, feel free to share.
Subscribe to my newsletter
Read articles from Muhammad Sajid Bashir directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Muhammad Sajid Bashir
Muhammad Sajid Bashir
I'm a versatile tech professional working at the intersection of Machine Learning, Data Engineering, and Full Stack Development. With hands-on experience in distributed systems, pipelines, and scalable applications, I translate complex data into real-world impact.