Let's Start

When it comes to building resilient and highly available Kubernetes clusters, the etcd database plays a crucial role. etcd is the backbone of Kubernetes, responsible for storing the entire cluster state, making it vital for the smooth operation of your system. But one often-overlooked aspect of etcd configuration is the number of nodes in the cluster. Specifically, why is it recommended to have an odd number of etcd nodes?

In this blog, we'll explore the reasons behind this best practice, including the role of the Raft consensus algorithm, and how it helps in maintaining high availability in production environments. We’ll also dive into a practical example to illustrate the concept.

Understanding etcd and High Availability

etcd is a distributed key-value store that is used by Kubernetes for storing all cluster data, including the configuration, secrets, and service discovery information. For a Kubernetes cluster to be resilient, etcd itself must be highly available. This is where the concept of a "quorum" comes into play.

The Role of the Raft Consensus Algorithm

etcd uses the Raft consensus algorithm to ensure that data is consistently replicated across all nodes in the cluster. Raft is a protocol designed to manage a distributed system’s consensus, ensuring that multiple servers (nodes) agree on the same data state.

In a Raft-based system, one node acts as the leader, and the rest are followers. The leader is responsible for managing log replication and ensuring that the followers have the same data. If the leader fails, the remaining nodes must elect a new leader, which requires a majority (quorum) of the nodes to agree on the new leader.

This is where the importance of an odd number of nodes comes into play. With an odd number, the system can more effectively achieve consensus and avoid split-brain scenarios where the cluster could end up with multiple leaders.

What is a Quorum?

In a distributed system like etcd that uses Raft, a quorum is the minimum number of members (nodes) required to agree on a transaction before it is committed. This is essential for ensuring data consistency across the cluster. The quorum size is calculated as:

Quorum = (N/2) + 1

Where N is the total number of nodes in the etcd cluster. The quorum ensures that the cluster can tolerate a certain number of node failures while still maintaining the ability to perform write operations.

Why Odd Number of Nodes?

Having an odd number of etcd nodes is a simple yet effective strategy for achieving high availability. Here’s why:

Optimal Quorum Calculation: With an odd number of nodes, the quorum calculation is more effective. For instance, in a 3-node cluster, the quorum is 2. This means that the cluster can tolerate 1 node failure. If you had an even number of nodes, say 4, the quorum would still be 3, meaning the cluster could still only tolerate 1 node failure, rendering the additional node redundant.
Leader Election and Split-Brain Prevention: The Raft algorithm requires a majority of nodes to agree on leadership. With an odd number of nodes, there’s always a clear majority that can elect a new leader if the current one fails. This reduces the risk of split-brain scenarios where the cluster might end up with multiple leaders due to a tie in votes.
Cost-Effectiveness: Running an odd number of nodes is more cost-effective because you are not paying for nodes that don’t significantly improve fault tolerance. For example, moving from 3 to 4 nodes doesn’t increase your tolerance to failures, but moving from 3 to 5 does, allowing you to tolerate up to 2 failures.

A Practical Example: 3-Node vs. 4-Node etcd Cluster

Let’s walk through an example to better understand why an odd number of nodes is recommended.

Scenario 1: 3-Node etcd Cluster

Consider a 3-node etcd cluster. The quorum in this case is calculated as:

Quorum = (3/2) + 1 = 2

This means that 2 out of the 3 nodes must agree on any operation before it’s committed. Here’s how this setup plays out:

Fault Tolerance: The cluster can tolerate 1 node failure. With 2 nodes still operational, the quorum of 2 is met, allowing the cluster to continue functioning normally.
High Availability: The system remains available even if 1 node goes down. Write operations can continue, and the cluster remains consistent.

If 1 node fails, the quorum of 2 is still maintained, so the cluster continues to operate. However, if 2 nodes fail, the quorum is lost (only 1 node left), and the cluster cannot process write operations, ensuring data consistency is maintained.

Scenario 2: 4-Node etcd Cluster

Now, let’s consider a 4-node etcd cluster. The quorum is:

Quorum = (4/2) + 1 = 3

In this setup, 3 out of the 4 nodes must agree on any operation before it’s committed. But here’s the catch:

Fault Tolerance: The cluster can still only tolerate 1 node failure, just like in the 3-node setup. If 1 node fails, you’re left with 3 nodes, and the quorum of 3 is met.
High Availability: The system remains available with 1 node failure, but adding the 4th node didn’t increase the fault tolerance compared to the 3-node setup.

So, despite having an extra node, you gain no additional fault tolerance. The 4th node essentially becomes redundant in terms of availability.

Why 3 is Better than 4?

💡

In both the 3-node and 4-node scenarios, the cluster can tolerate only 1 node failure. However, the 4-node setup requires more resources (compute, storage, and networking) without offering any additional benefits in terms of availability or fault tolerance. This makes the 4th node an unnecessary expense.

When to Consider More Nodes: The 5-Node Cluster

If you truly need higher availability, the next logical step is to move to a 5-node cluster. Here’s how the math works out:

Quorum = (5/2) + 1 = 3

Fault Tolerance: The 5-node cluster can tolerate up to 2 node failures. With 3 nodes remaining, the quorum of 3 is still met, and the cluster remains operational.

High Availability: This setup offers better fault tolerance than both the 3-node and 4-node clusters, justifying the additional resources.

Conclusion

The Raft consensus algorithm, which underpins etcd, relies on a majority of nodes to agree on data consistency and leadership decisions. Deploying an odd number of etcd nodes in production ensures that you achieve optimal fault tolerance and high availability without wasting resources. A 3-node cluster is sufficient for most scenarios, but if you need to tolerate more than one failure, a 5-node cluster is the way to go.

By choosing an odd number of nodes, you effectively balance performance, cost, and resilience, making your Kubernetes cluster more reliable and robust.

Why Odd Number of etcd Nodes are Preferred in Production for High Availability ?

Table of contents