Kubernetes Outage Postmortem: Nodes Stuck in NotReady Due to CNI Failure

DevOpsofworldDevOpsofworld
3 min read

Recently, we encountered a critical production outage in our Kubernetes cluster. New nodes provisioned during autoscaling remained in a NotReady state, leading to service disruptions and failed health checks across workloads.

In this post, I’ll walk you through:

  • What caused the issue

  • How we identified and resolved it

  • Best practices to prevent similar failures in your clusters


🔥 What Went Wrong

During a surge in traffic, our cluster autoscaler kicked in and added new nodes. However, these nodes failed to become Ready, resulting in:

  • ❌ Workloads not scheduled

  • ❌ Services unreachable

  • ❌ Health checks failing, pods crashing

A quick check with:

kubectl get nodes

revealed multiple entries like:

ip-node-ip.eu-west-1.compute.internal   NotReady

To dig deeper, we inspected system logs and found this error:

container runtime network not ready: NetworkReady=false
NetworkPluginNotReady: docker: network plugin is not ready: cni config uninitialized

⚠️ Root Cause: CNI Plugin Failure (Calico)

These errors indicated a CNI (Container Network Interface) misconfiguration. Our cluster was using Calico as the CNI, and it wasn’t initializing properly.

The Calico pods responsible for managing the network stack were either stuck or not starting due to missing configurations.


🛠️ How We Fixed It

1. Delete and Recreate Calico Pods

Force Kubernetes to restart Calico:

kubectl delete pod -n kube-system -l k8s-app=calico-node

This helped in recreating the Calico pods with the current (and correct) configuration.


2. Reapply Calico Configuration

On some nodes, the CNI config was missing or corrupted:

/etc/cni/net.d/10-calico.conflist

We reinstalled Calico using the official manifest:

kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml

3. Restart the Kubelet on Affected Nodes

Restarting the kubelet reinitialized the CNI network stack:

sudo systemctl restart kubelet

✅ Verification

After applying the fixes, we verified node status:

kubectl get nodes

Now showed:

ip-node-ip.eu-west-1.compute.internal   Ready

Services started recovering, and workloads were rescheduled.


🧠 Lessons Learned: How to Prevent This in the Future

🔁 1. Backup CNI Configurations

Always back up CNI configs, especially:

/etc/cni/net.d/10-calico.conflist

This helps with disaster recovery and rapid bootstrapping.


📈 2. Monitor Calico Health and Node Status

Set up monitoring and alerting for:

kubectl get pods -n kube-system -l k8s-app=calico-node
kubectl get nodes

Alert if:

  • Calico pods crash or restart frequently

  • Nodes enter or stay in NotReady


⚠️ 3. Avoid Mixing CNIs

Do not mix different CNI plugins (e.g., Calico and AWS VPC CNI) unless you're explicitly building a hybrid setup. It introduces instability and unexpected behavior.

In our case, we’ve since migrated to the AWS VPC CNI, which aligns better with EKS and provides native integration with VPC IP address management.


📌 Conclusion

Networking is the backbone of Kubernetes, and when the CNI fails, everything breaks.

This incident was a sharp reminder of the importance of:

  • Validating CNI configurations

  • Monitoring node readiness

  • Keeping your control plane and worker nodes in sync

By following the steps outlined above and applying proactive monitoring, you can prevent CNI-related outages and ensure high availability for your workloads.

0
Subscribe to my newsletter

Read articles from DevOpsofworld directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

DevOpsofworld
DevOpsofworld