Kubernetes Outage Postmortem: Nodes Stuck in NotReady Due to CNI Failure


Recently, we encountered a critical production outage in our Kubernetes cluster. New nodes provisioned during autoscaling remained in a NotReady
state, leading to service disruptions and failed health checks across workloads.
In this post, I’ll walk you through:
What caused the issue
How we identified and resolved it
Best practices to prevent similar failures in your clusters
🔥 What Went Wrong
During a surge in traffic, our cluster autoscaler kicked in and added new nodes. However, these nodes failed to become Ready, resulting in:
❌ Workloads not scheduled
❌ Services unreachable
❌ Health checks failing, pods crashing
A quick check with:
kubectl get nodes
revealed multiple entries like:
ip-node-ip.eu-west-1.compute.internal NotReady
To dig deeper, we inspected system logs and found this error:
container runtime network not ready: NetworkReady=false
NetworkPluginNotReady: docker: network plugin is not ready: cni config uninitialized
⚠️ Root Cause: CNI Plugin Failure (Calico)
These errors indicated a CNI (Container Network Interface) misconfiguration. Our cluster was using Calico as the CNI, and it wasn’t initializing properly.
The Calico pods responsible for managing the network stack were either stuck or not starting due to missing configurations.
🛠️ How We Fixed It
1. Delete and Recreate Calico Pods
Force Kubernetes to restart Calico:
kubectl delete pod -n kube-system -l k8s-app=calico-node
This helped in recreating the Calico pods with the current (and correct) configuration.
2. Reapply Calico Configuration
On some nodes, the CNI config was missing or corrupted:
/etc/cni/net.d/10-calico.conflist
We reinstalled Calico using the official manifest:
kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml
3. Restart the Kubelet on Affected Nodes
Restarting the kubelet reinitialized the CNI network stack:
sudo systemctl restart kubelet
✅ Verification
After applying the fixes, we verified node status:
kubectl get nodes
Now showed:
ip-node-ip.eu-west-1.compute.internal Ready
Services started recovering, and workloads were rescheduled.
🧠 Lessons Learned: How to Prevent This in the Future
🔁 1. Backup CNI Configurations
Always back up CNI configs, especially:
/etc/cni/net.d/10-calico.conflist
This helps with disaster recovery and rapid bootstrapping.
📈 2. Monitor Calico Health and Node Status
Set up monitoring and alerting for:
kubectl get pods -n kube-system -l k8s-app=calico-node
kubectl get nodes
Alert if:
Calico pods crash or restart frequently
Nodes enter or stay in
NotReady
⚠️ 3. Avoid Mixing CNIs
Do not mix different CNI plugins (e.g., Calico and AWS VPC CNI) unless you're explicitly building a hybrid setup. It introduces instability and unexpected behavior.
In our case, we’ve since migrated to the AWS VPC CNI, which aligns better with EKS and provides native integration with VPC IP address management.
📌 Conclusion
Networking is the backbone of Kubernetes, and when the CNI fails, everything breaks.
This incident was a sharp reminder of the importance of:
Validating CNI configurations
Monitoring node readiness
Keeping your control plane and worker nodes in sync
By following the steps outlined above and applying proactive monitoring, you can prevent CNI-related outages and ensure high availability for your workloads.
Subscribe to my newsletter
Read articles from DevOpsofworld directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
