Docker Series – Part 19: High Availability in Docker Swarm – Multiple Managers, Fault Tolerance & Node Management

Nitin DhimanNitin Dhiman
3 min read

When we talk about container orchestration in production, high availability is not a “nice to have” — it’s a necessity.

Imagine you have a single manager node in your Docker Swarm cluster, and it goes down. Without redundancy, your orchestration layer fails, applications lose coordination, and scaling decisions halt. That’s where multiple manager nodes come into play.

In this article, we’ll explore:

  • Why high availability matters in Docker Swarm

  • Adding multiple managers to a cluster

  • Leader election and fault tolerance

  • Node promotion/demotion

  • Maintenance strategies with drain mode


Understanding Manager & Worker Nodes

In a Docker Swarm cluster:

  • Manager Nodes: Handle orchestration — scheduling containers, scaling services, maintaining cluster state.

  • Worker Nodes: Run the actual application containers.

By default, all management actions (like scaling) must be performed on a manager node. Clients can connect to any worker node directly to access deployed services.

The risk? Single Point of Failure (SPOF).
If the only manager node goes down, the cluster loses its orchestration ability.


High Availability & Fault Tolerance

To avoid SPOF, we add more manager nodes.
Example:

  • With 3 managers, if 1 fails, the cluster can still function.

  • This resilience is called Fault Tolerance.

The formula:

Fault Tolerance = (N - 1) / 2

Where N is the number of managers.

Adding more managers also enables leader election using the Raft Consensus Algorithm — ensuring that at any time, one node is the leader and others are reachable managers.


Setting up Multiple Managers on AWS EC2

Step 1: Launch New Instances

We launch 2 additional Amazon Linux EC2 instances to join as managers.

Step 2: Install Docker on New Nodes

yum install docker -y
systemctl enable docker --now

Step 3: Generate Manager Join Token

On the leader node:

docker swarm join-token manager

Step 4: Join New Managers

Run the token command on the new instances:

docker swarm join --token <manager-token> <leader-ip>:2377

Now, listing nodes:

docker node ls

Shows multiple managers, with one as Leader and others as Reachable.


Leader Election in Action

When the leader node is stopped:

systemctl stop docker

Another reachable manager becomes the leader automatically.
This is how Docker Swarm ensures continuity without downtime.


Scaling Services

We can scale services either:

  • One at a time:
docker service update --replicas=5 <service-name>
  • Multiple at once:
docker service scale service1=3 service2=5

Promoting & Demoting Nodes

You can promote a worker to a manager:

docker node promote <worker-hostname>

Or demote a manager to a worker:

docker node demote <manager-hostname>

Maintenance with drain Mode

If you need to perform maintenance on a manager:

docker node update --availability drain <hostname>

This stops any containers running on that node and reschedules them to other nodes — keeping your services up without interruption.


Key Takeaways

  • Multiple managers = fault tolerance + high availability.

  • Leader election ensures cluster orchestration continuity.

  • Raft protocol maintains synchronized cluster state.

  • Use drain mode for safe node maintenance.

  • Promote/demote nodes as needed for flexible scaling.


Next in the series: We’ll explore Overlay Networks & Stacks in Swarm to enable multi-host container communication and secure deployments.

0
Subscribe to my newsletter

Read articles from Nitin Dhiman directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Nitin Dhiman
Nitin Dhiman

Self-taught DevOps enthusiast on a journey from beginner to pro. Passionate about demystifying complex tools like Docker, AWS, CI/CD & Kubernetes into clear, actionable insights. Fueled by curiosity, driven by hands-on learning, and committed to sharing the journey. Always building, always growing 🚀