Distributed Transaction Management Explained

This is Part 3 of the "Distributed DBs: A Clear Guide" series. In this article, we break down how distributed systems manage transactions across multiple nodes while ensuring accuracy, atomicity, and consistency — even when things go wrong.

1. What Is a Transaction?

A transaction is a sequence of operations performed as a single logical unit of work. A good transaction follows the ACID properties:

Atomicity – all or nothing
Consistency – maintains data integrity
Isolation – no interference from other transactions
Durability – once committed, data stays saved

2. The Challenge in Distributed Systems

In centralized databases, transactions are straightforward. But in distributed systems, the data involved in a transaction may span multiple nodes.

Real-world example:

Imagine booking a hotel and a flight in one transaction. The hotel service and the flight service are on different servers. Both must succeed — or both must roll back. That’s a distributed transaction.

3. Distributed Transaction Management Goals

Atomicity across sites
Coordination of commit/abort decisions
Failure recovery
Concurrency control across nodes

4. Concurrency Control in Distributed Databases

Concurrency control ensures multiple transactions don’t interfere with one another.

4.1 Two-Phase Locking (2PL)

A transaction:

Acquires all locks it needs (growing phase)
Releases them only after commit (shrinking phase)

This prevents conflicts but can cause deadlocks.

5. Deadlocks in Distributed Systems

A deadlock occurs when two transactions wait for each other to release resources. In distributed systems, this is harder to detect.

Techniques to detect and resolve deadlocks:

Centralized detection
Timeout-based detection
Hierarchical detection

Phantom deadlocks can happen if messages are delayed, showing a false wait cycle.

6. Two-Phase Commit Protocol (2PC)

6.1 Why 2PC?

To ensure that a transaction either commits on all nodes or none at all.

6.2 The Two Phases:

Phase 1: Prepare

Coordinator asks all participants if they’re ready to commit
Participants write logs and reply YES/NO

Phase 2: Commit or Abort

If all said YES, coordinator tells everyone to COMMIT
If any said NO, coordinator tells all to ABORT

       Coordinator
           ↓
   +---------------+
   |   PREPARE?    |
   +---------------+
  /       |       \
Site A  Site B  Site C
 ↓        ↓        ↓
YES      YES     YES
           ↓
       COMMIT!

7. Limitations of 2PC

Blocking: If the coordinator crashes after Phase 1, the system can’t proceed
Synchronous: All nodes must reply — slows things down

7.1 3PC – Three-Phase Commit

Introduces a "pre-commit" step to avoid blocking.
Used less often due to overhead.

8. Failure Handling

8.1 Types of Failures:

Site failure: node goes down
Link failure: network break
Message failure: messages delayed or lost

8.2 Recovery Techniques:

Write-ahead logging (WAL)
Redo and undo logs
Participant state records

9. Global Serializability

Even with transactions across sites, we must ensure the global outcome is serializable — behaves like if transactions ran one after the other.

This is harder in distributed settings but essential for correctness.

10. Visual Summary: 2PC Protocol

Client      Coordinator     Site A     Site B
   |             |             |         |
Begin TX         |             |         |
   |----------→ Begin --------→         |
   |             |---------→ Begin ----→|
   |             |             |         |
   |             |<- PREPARE --|         |
   |             |<----------- |         |
   |             | -- COMMIT ->|         |
   |             |------------>|         |
   |             |             |         |

11. Real-World Analogy: Banking System

Imagine transferring money from your checking to savings account. If the system withdraws from checking but fails before depositing into savings, your money is lost. Distributed transaction management prevents this by committing only if all steps succeed.

12. Wrapping Up

Distributed transaction management is the glue that ensures integrity in a fragmented system. It handles coordination, failure, and concurrency so that developers can focus on building robust applications.

In the next part of this series, we’ll dive into Distributed Query Processing — how to run efficient queries across the distributed setup.

Distributed Transaction Management – Keeping Data Consistent Across the System