Leader Less Replication

Single-Leader Replication and Multi-Leader Replication are based on the principle that the client sends a write request to one node(Leader), and the database system copies that write to the other replicas.

A leader takes care of how writes should be processed, and followers apply the leader's writes in the same order.

Some data storage systems take a different approach, that leaves the concepts of a leader and allows any replica to accept direct writes from clients. Some of the earliest replicated data systems are leaderless, but the idea was mostly forgotten during the era of the dominance of relational databases.

Leaderless-Replication gained popularity after Amazon used it for its in-house Dynamo system.

Riak, Cassandra, and Voldemort are open-source data stores with leaderless replication models inspired by Dynamo, so this kind of database is also known as Dynamo-style.

In some leaderless implementations, the client directly sends its writes to several replicas, while in others, a coordinator node does this on behalf of the client.

Writing to the Database when a node is down

Imagine, You have a database with three replicas, and your one replica goes down.
In a leader-based replication, if you want to continue writing, you may need to perform a failover.
In a leaderless configuration, failover does not exist, but in a leaderless configuration, failover does not exist. The client sends the write request to all three replicas and two available nodes.

It is sufficient for two out of three replicas to acknowledge the writes to be successful. The client simply ignore the fact that one of the replicas missed the write.
Now imagine that the unavailable node comes back online, and clients start reading from it. Any writes that happened while the node was down are missing from that node. This, if you read from that node, you may get stale(outdated) values as responses.
To solve that problem, when a client reads from the database, it doesn't just send its request to one replica: read requests are also sent to several nodes in parallel*.* The client may get different responses from different nodes.

Read Repair and anti-entropy

The replication scheme should ensure that eventually all the data is copied to every replica.
Two mechanisms are often used in Dynamo-style datastores:

Read Repair
When a client makes a read from several nodes in parallel, it can detect any stale responses. The client sees that the down replica has a stale value and writes the newer value back to that replica.

Anti-entropy Process
In addition, some data stores have a background process that constantly looks for differences in the data between replicas and copies any missing data from one replica to another.

Not all systems implement both of these; for example, Voldemort currently does not have an anti-entropy process. Note that without an anti-entropy process, values that are rarely read may be missing from some replicas and thus have reduced durability, because read repair is only performed when a value is read by the application.

Quorums for reading and writing

Every successful write is guaranteed to be present on at least two out of the three replicas, which means the replicas are stale.

Thus, if we read from at least two replicas, we can be sure that at least one of the two is up to date.

If the third replica is down, it will continue to return up-to-date value.

If there are n replicas, every write must be confirmed by at least w nodes, and every read must be verified by at least r nodes for each read.

As long as, w + r > n , we expect to get an up-to-date value.

If reads and write obey these r and w values are called quorums reads and writes.

The quorum condition, w + r > n allows the system to tolerate unavailable nodes as follows:-

If w<n , we can still process writes if a node is unavailable.
If r<n , we can still process reads if a node is unavailable.
with n=3, w=2, r=2, we can tolerate one unavailable node.
with n=5, w=3, r=3, we can tolerate one unavailable node.

Limitations of Quorum's consistency

If you have n replicas, and you chose w and r such that w+r>n. This is valid until if and only if the nodes that you have written and the set of nodes from which you've read must overlap.

r and w are chosen to be the majority because that ensures w+r>n. while still tolerating n/2 node failures.

Sets of nodes used by read and write operations overlap in at least one node.

Quorum assignments are possible which allows some flexibility in the design of distributed algorithms.

You may also set w + r < n. Read and writes will still be sent to n nodes, but a small number of successful responses is required for the operations to succeed. With a smaller number of w and r, it is most likely to read stale values, but because it is more likely that your read didn't include the node with the latest value.

Even with w+r > n, there are likely to be edge cases where stale values are returned.

If a sloppy Quorum is used, the w writes may end up in a different node than the r reads, so there is no longer a guaranteed overlap between the r nodes and w nodes.
If two writes occur concurrently, it is not clear which one happened first. The winner is picked based on timestamp and writes can be last due to clock skew.
If a write happens concurrently with a read, the write may be reflected on only some of the replicas. It's undetermined whether the read returns the old or the new value.
If writes succeeded on some replicas but failed on others, and overall succeeded on fewer than w replicas, it is not rolled back on the replicas where it succeeded.
If a node carrying a new value fails, and its data is restored from a replica carrying an old value, the number of replicas storing the new value may fall below w, breaking the quorum condition.

That's it from my side. There are plenty of topics to discuss, such as:

Sloppy Quorums and Hinted Handoff.
Detecting Concurrent Writes.
Merging Concurrently Written Values.

Leaderless Replication

Table of contents

Writing to the Database when a node is down

Read Repair and anti-entropy

Quorums for reading and writing

Limitations of Quorum's consistency

Subscribe to my newsletter

Arpan Das

Arpan Das