Understanding etcd Consistency and Why an Odd Number of Instances Is Important

What is etcd?

  • etcd is a distributed key-value store used in Kubernetes to store all of its data (like cluster configuration, state, etc.). Since etcd is a critical part of the system, it needs to be highly available and consistent across multiple instances (nodes).

Why Multiple Instances?

  • To ensure high availability and reliability, etcd is typically run in a cluster with multiple instances. If one instance fails, others can continue to serve requests. But running multiple instances introduces a challenge: keeping all instances consistent with each other.

How etcd Maintains Consistency?

  • etcd uses the RAFT consensus algorithm to ensure that all instances (nodes) in the cluster agree on the current state.

  • Consensus means that most nodes in the cluster agree on what the current state is. In RAFT, this agreement is called a quorum.

What is a Quorum?

  • A quorum is more than half of the nodes in the cluster. For example:

    • In a cluster of 3 nodes, a quorum is 2 nodes.

    • In a cluster of 5 nodes, a quorum is 3 nodes.

  • To make any changes to the state, a quorum must agree. This prevents the cluster from having conflicting states.

Handling Network Splits (Split-Brain Scenario)

  • Imagine the network between nodes is broken, creating two groups of nodes (a "split-brain"):

    • Group A: 3 nodes.

    • Group B: 2 nodes.

  • Group A has a quorum (3 out of 5), so it can continue making changes.

  • Group B does not have a quorum (only 2 out of 5), so it cannot make changes.

  • This ensures that changes are only made by the majority group, keeping the state consistent.

  • When the network is restored, Group B can catch up with the latest state from Group A.

Why an Odd Number of Instances?

Why Not an Even Number?

  • With an even number of nodes (like 2 or 4):

    • If you have 2 nodes, both must be available to achieve a quorum (1 is not enough). If one node fails, the cluster can't continue.

    • If you have 4 nodes, you need 3 nodes to form a quorum (2 is not enough). If two nodes fail, the cluster can't continue.

Why an Odd Number Makes Sense:

  • With an odd number of nodes (like 3 or 5):

    • If you have 3 nodes, you only need 2 nodes for a quorum. This means the cluster can tolerate 1 failure and still continue.

    • If you have 5 nodes, you need 3 nodes for a quorum. This means the cluster can tolerate up to 2 failures.

Benefits of Odd Numbers:

  • Higher Fault Tolerance: An odd number of nodes maximizes the number of failures the cluster can tolerate while still maintaining a quorum.

  • Lower Risk of Complete Failure: Fewer nodes are needed to achieve a quorum, reducing the chances that the cluster will stop working entirely if a few nodes fail.

Conclusion

  • Odd Number of Instances: Ensures better availability and fault tolerance.

  • Why Not Even?: An even number of nodes reduces the cluster's fault tolerance because it requires more nodes to reach a quorum.

  • The Goal: To keep the etcd cluster running and consistent even if some nodes fail, which is why an odd number of instances is preferred.

0
Subscribe to my newsletter

Read articles from Sundaram Kumar Jha directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Sundaram Kumar Jha
Sundaram Kumar Jha

I Like Building Cloud Native Stuff , the microservices, backends, distributed systemsand cloud native tools using Golang