Distributed Transaction Management – Keeping Data Consistent Across the System

This is Part 3 of the "Distributed DBs: A Clear Guide" series. In this article, we break down how distributed systems manage transactions across multiple nodes while ensuring accuracy, atomicity, and consistency — even when things go wrong.


1. What Is a Transaction?

A transaction is a sequence of operations performed as a single logical unit of work. A good transaction follows the ACID properties:

  • Atomicity – all or nothing

  • Consistency – maintains data integrity

  • Isolation – no interference from other transactions

  • Durability – once committed, data stays saved


2. The Challenge in Distributed Systems

In centralized databases, transactions are straightforward. But in distributed systems, the data involved in a transaction may span multiple nodes.

Real-world example:

Imagine booking a hotel and a flight in one transaction. The hotel service and the flight service are on different servers. Both must succeed — or both must roll back. That’s a distributed transaction.


3. Distributed Transaction Management Goals

  • Atomicity across sites

  • Coordination of commit/abort decisions

  • Failure recovery

  • Concurrency control across nodes


4. Concurrency Control in Distributed Databases

Concurrency control ensures multiple transactions don’t interfere with one another.

4.1 Two-Phase Locking (2PL)

A transaction:

  1. Acquires all locks it needs (growing phase)

  2. Releases them only after commit (shrinking phase)

This prevents conflicts but can cause deadlocks.


5. Deadlocks in Distributed Systems

A deadlock occurs when two transactions wait for each other to release resources. In distributed systems, this is harder to detect.

Techniques to detect and resolve deadlocks:

  • Centralized detection

  • Timeout-based detection

  • Hierarchical detection

Phantom deadlocks can happen if messages are delayed, showing a false wait cycle.


6. Two-Phase Commit Protocol (2PC)

6.1 Why 2PC?

To ensure that a transaction either commits on all nodes or none at all.

6.2 The Two Phases:

Phase 1: Prepare

  • Coordinator asks all participants if they’re ready to commit

  • Participants write logs and reply YES/NO

Phase 2: Commit or Abort

  • If all said YES, coordinator tells everyone to COMMIT

  • If any said NO, coordinator tells all to ABORT

       Coordinator
           ↓
   +---------------+
   |   PREPARE?    |
   +---------------+
  /       |       \
Site A  Site B  Site C
 ↓        ↓        ↓
YES      YES     YES
           ↓
       COMMIT!

7. Limitations of 2PC

  • Blocking: If the coordinator crashes after Phase 1, the system can’t proceed

  • Synchronous: All nodes must reply — slows things down

7.1 3PC – Three-Phase Commit

Introduces a "pre-commit" step to avoid blocking.
Used less often due to overhead.


8. Failure Handling

8.1 Types of Failures:

  • Site failure: node goes down

  • Link failure: network break

  • Message failure: messages delayed or lost

8.2 Recovery Techniques:

  • Write-ahead logging (WAL)

  • Redo and undo logs

  • Participant state records


9. Global Serializability

Even with transactions across sites, we must ensure the global outcome is serializable — behaves like if transactions ran one after the other.

This is harder in distributed settings but essential for correctness.


10. Visual Summary: 2PC Protocol

Client      Coordinator     Site A     Site B
   |             |             |         |
Begin TX         |             |         |
   |----------→ Begin --------→         |
   |             |---------→ Begin ----→|
   |             |             |         |
   |             |<- PREPARE --|         |
   |             |<----------- |         |
   |             | -- COMMIT ->|         |
   |             |------------>|         |
   |             |             |         |

11. Real-World Analogy: Banking System

Imagine transferring money from your checking to savings account. If the system withdraws from checking but fails before depositing into savings, your money is lost. Distributed transaction management prevents this by committing only if all steps succeed.


12. Wrapping Up

Distributed transaction management is the glue that ensures integrity in a fragmented system. It handles coordination, failure, and concurrency so that developers can focus on building robust applications.

In the next part of this series, we’ll dive into Distributed Query Processing — how to run efficient queries across the distributed setup.

0
Subscribe to my newsletter

Read articles from Muhammad Sajid Bashir directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Muhammad Sajid Bashir
Muhammad Sajid Bashir

I'm a versatile tech professional working at the intersection of Machine Learning, Data Engineering, and Full Stack Development. With hands-on experience in distributed systems, pipelines, and scalable applications, I translate complex data into real-world impact.