Docker Series – Part 19: High Availability in Docker Swarm – Multiple Managers, Fault Tolerance & Node Management


When we talk about container orchestration in production, high availability is not a “nice to have” — it’s a necessity.
Imagine you have a single manager node in your Docker Swarm cluster, and it goes down. Without redundancy, your orchestration layer fails, applications lose coordination, and scaling decisions halt. That’s where multiple manager nodes come into play.
In this article, we’ll explore:
Why high availability matters in Docker Swarm
Adding multiple managers to a cluster
Leader election and fault tolerance
Node promotion/demotion
Maintenance strategies with
drain
mode
Understanding Manager & Worker Nodes
In a Docker Swarm cluster:
Manager Nodes: Handle orchestration — scheduling containers, scaling services, maintaining cluster state.
Worker Nodes: Run the actual application containers.
By default, all management actions (like scaling) must be performed on a manager node. Clients can connect to any worker node directly to access deployed services.
The risk? Single Point of Failure (SPOF).
If the only manager node goes down, the cluster loses its orchestration ability.
High Availability & Fault Tolerance
To avoid SPOF, we add more manager nodes.
Example:
With 3 managers, if 1 fails, the cluster can still function.
This resilience is called Fault Tolerance.
The formula:
Fault Tolerance = (N - 1) / 2
Where N is the number of managers.
Adding more managers also enables leader election using the Raft Consensus Algorithm — ensuring that at any time, one node is the leader and others are reachable managers.
Setting up Multiple Managers on AWS EC2
Step 1: Launch New Instances
We launch 2 additional Amazon Linux EC2 instances to join as managers.
Step 2: Install Docker on New Nodes
yum install docker -y
systemctl enable docker --now
Step 3: Generate Manager Join Token
On the leader node:
docker swarm join-token manager
Step 4: Join New Managers
Run the token command on the new instances:
docker swarm join --token <manager-token> <leader-ip>:2377
Now, listing nodes:
docker node ls
Shows multiple managers, with one as Leader
and others as Reachable
.
Leader Election in Action
When the leader node is stopped:
systemctl stop docker
Another reachable manager becomes the leader automatically.
This is how Docker Swarm ensures continuity without downtime.
Scaling Services
We can scale services either:
- One at a time:
docker service update --replicas=5 <service-name>
- Multiple at once:
docker service scale service1=3 service2=5
Promoting & Demoting Nodes
You can promote a worker to a manager:
docker node promote <worker-hostname>
Or demote a manager to a worker:
docker node demote <manager-hostname>
Maintenance with drain
Mode
If you need to perform maintenance on a manager:
docker node update --availability drain <hostname>
This stops any containers running on that node and reschedules them to other nodes — keeping your services up without interruption.
Key Takeaways
Multiple managers = fault tolerance + high availability.
Leader election ensures cluster orchestration continuity.
Raft protocol maintains synchronized cluster state.
Use drain mode for safe node maintenance.
Promote/demote nodes as needed for flexible scaling.
Next in the series: We’ll explore Overlay Networks & Stacks in Swarm to enable multi-host container communication and secure deployments.
Subscribe to my newsletter
Read articles from Nitin Dhiman directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Nitin Dhiman
Nitin Dhiman
Self-taught DevOps enthusiast on a journey from beginner to pro. Passionate about demystifying complex tools like Docker, AWS, CI/CD & Kubernetes into clear, actionable insights. Fueled by curiosity, driven by hands-on learning, and committed to sharing the journey. Always building, always growing 🚀