Demystifying the CAP Theorem

What is CAP Theorem

In the world of distributed systems, the CAP theorem is a fundamental concept that guides the design and architecture of these systems. Proposed by Eric Brewer in 2000, the CAP theorem states that it is impossible for a distributed system to simultaneously guarantee all three of the following properties:

Consistency
Availability
Partition Tolerance

Consistency

Consistency ensures that every read request reflects the most recent write.

In other words, all nodes have the same view of the data/state at any given time. When a client queries the system, it always retrieves the latest data.

Availability

Availability ensures that every request( read(recent or non-recent) or a write) always receives a response, even if there is a node failure or partition breakdown. This means remains operational providing response to any query.

Partition Tolerance

Partition tolerance guarantees that the system continues to operate despite any number of communication breakdowns/ network partitions between the nodes. In a distributed environment, network partitions are inevitable due to hardware failures, network congestion, or other issues.

Deep Dive into CAP Theorem

A distributed system always needs to be partition tolerant, we shouldn’t be making a system where a network partition brings down the whole system.
So, a distributed system is always built Partition Tolerant.

So, In simple words, CAP theorem means if there is network partition and if you want your system to keep functioning you can provide either Availability or Consistency and not both.

How a Distributed System breaks Consistency or Availability?

Scenario 1: Multi-Node system where multi nodes capable of handing read/ write and nodes failure to propagate an update request to other nodes.

Consider a cluster with two nodes, N1 and N2, both capable of handling read and write requests.

In the diagram above, N1 receives an update request for id=2, modifying the salary from 800 to 1000. However, due to a network partition, N1 cannot propagate this update to N2.

When a read request is directed to N2, the node has two possible responses:

Respond with its current data (salary = 800) and later update the data when the network partition is resolved. This approach makes system available but not consistent.
Return an error, indicating it does not have the latest data. This ensures consistency by avoiding the return of stale data but compromises availability.

Scenario 2: Single-leader system for read and write operations

In a single-leader system, all read and write operations come to the leader, while other nodes remain synchronized with the leader and act as standby nodes in case the leader fails.

The challenge arises if the leader becomes disconnected from the cluster or clients cannot connect to it due to a network partition. In such cases, the system cannot process write requests until a new leader is elected, making the system consistent but not available during the transition.

But if system allows read request from Read replica then system can response even if there is master node failure, which makes system highly available but not consistent for reads.

A single-leader system that handles both reads and writes from master, should not be classified as highly available.

RDBMS(MySQL, Oracle, MS SQL Server, etc)

It’s no brainer that all RDBMS are Consistent as all reads and writes go to a single node/server.

How about availability? You might say, it is one single server and hence a single point of failure. So, how it’s categorized under Availability?

As I said earlier CAP-Availability is not the same as day to day availability/downtime we talk about. In a single node system, there will not be any network partition hence if the node is up, it will always return success for any read/write operation and hence available.

Thus, RDMS system can be Highly available and Consistent.

Trade-offs in CAP Theorem

The CAP theorem highlights three trade-off scenarios in distributed systems:

Consistency and Availability (CA):
Ensures identical data across all nodes and responsiveness to requests. Performance may be compromised during network issues to maintain data accuracy.
Consistency and Partition Tolerance (CP):
Prioritizes data consistency across nodes despite network partitions. The system may become temporarily unavailable to preserve data integrity.
Availability and Partition Tolerance (AP):
Focuses on staying operational during network disruptions. Sacrifices strict consistency, accepting temporary data inconsistencies to ensure accessibility.

Practical Implications

In real-world applications, the choice between consistency, availability, and partition tolerance depends on the specific use case:

Financial Systems: Strong consistency is critical to ensure accurate transactions.
Social Media Platforms: Prioritize availability, allowing users to interact with slightly stale data.
Global Systems: Partition tolerance is essential to maintain operations across distributed regions.

Understanding the CAP theorem and its trade-offs helps engineers design systems that align with the unique requirements of their applications, ensuring reliability and performance in distributed environments.

Probing the CAP Theorem

Can you only have 2 out of 3 CAP properties?
No, CAP means you must choose between Consistency and Availability during a partition, not abandon one entirely.
Does partition tolerance eliminate partition challenges?
No, it ensures operation during partitions but doesn’t resolve consistency or availability issues.
Example of a non-partition-tolerant system:
A centralized database or a multi-node system with synchronous replication halts during partitions due to dependency on full communication.
How to make systems partition-tolerant?
- Use eventual consistency to allow independent node decisions and reconcile later.
- Adopt asynchronous replication to accept writes without waiting for acknowledgment.
- Employ quorum-based systems for majority agreement.
Is partition tolerance optional?
No, distributed systems must handle partitions; the trade-off is between consistency and availability.
What are CA systems?
CA systems prioritize consistency and availability but fail during partitions, making them non-partition-tolerant.
Does 99.999% uptime mean high availability?
Not in CAP terms. Availability requires every request to a non-failing node to receive a valid response, even during partitions.
Do timeout errors count as availability?
No, errors or timeouts compromise availability in CAP’s definition.
Does eventual consistency meet CAP's consistency?
No, CAP’s consistency refers to strong consistency, which eventual consistency does not satisfy.
Does relaxing consistency always lead to eventual consistency?
Not always; it might result in unresolved inconsistencies without conflict resolution mechanisms.
Can strong consistency be achieved with a majority quorum?
Yes, but it sacrifices availability, adhering to CAP’s trade-offs.
Does CAP apply to microservices?
Yes, CAP principles are relevant to microservices as well as distributed databases.
What if partition tolerance is ignored?
Ignoring partition tolerance works in systems with reliable networks but risks failure during real-world partitions.
When can partition tolerance be ignored?
In tightly controlled environments (e.g., single-node systems or highly reliable networks), partitions are negligible. Examples: MySQL on a single server or Google Spanner with controlled infrastructure.

Understanding the CAP Theorem

Table of contents