Building a Self-Healing Kubernetes Cluster on AWS EKS

Introduction
Kubernetes (EKS) is powerful but complex. Without proper automation, clusters can suffer from:
❌ Unplanned downtime (nodes/pods crashing)
❌ Wasted spend (over-provisioned resources)
❌ Configuration drift (manual changes causing outages)
In this tutorial, you’ll learn how to set up a self-healing EKS cluster using AWS-native tools only, focusing on:
AWS Karpenter for automatic, cost-efficient scaling
Amazon EKS Blueprints for infrastructure-as-code
CloudWatch Container Insights for monitoring
No third-party tools required.
Step 1: Set Up the EKS Cluster
Prerequisites
AWS CLI configured (follow steps here)
eksctl
installed (follow steps here)IAM permissions for EKS, EC2, and CloudWatch
Deploy EKS with eksctl
eksctl create cluster \
--name self-healing-cluster \
--version 1.28 \
--region us-east-1 \
--nodegroup-name ng-default \
--nodes 3 \
--nodes-min 1 \
--nodes-max 5 \
--managed
Key Flags:
--managed
: Uses EKS-managed nodes (simpler than self-managed)--nodes-min 1
: Ensures at least 1 node is always running--nodes-max 5
: Prevents runaway scaling
Step 2: Install Karpenter for Intelligent Scaling
Why Karpenter?
Faster scaling: Launches nodes in seconds (vs. minutes with Cluster Autoscaler)
Cost savings: Automatically uses Spot Instances
Simpler config: No node groups required
Install Karpenter
# Add the Helm repo
helm repo add karpenter https://charts.karpenter.sh
helm repo update
# Install Karpenter
helm upgrade --install karpenter karpenter/karpenter \
--namespace karpenter \
--create-namespace \
--version v0.32.1 \
--set settings.aws.defaultInstanceProfile=KarpenterNodeInstanceProfile
Configure a NodePool
Create karpenter-nodepool.yaml
:
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
requirements:
- key: "karpenter.sh/capacity-type"
operator: In
values: ["spot", "on-demand"] # Prefer Spot, fall back to On-Demand
- key: "kubernetes.io/arch"
operator: In
values: ["arm64"] # Graviton = 20% cheaper
limits:
cpu: 1000 # Max cores in the cluster
disruption:
consolidationPolicy: WhenUnderutilized # Automatically remove idle nodes
Apply it:
kubectl apply -f karpenter-nodepool.yaml
Key Features:
✅ Spot Instances: Saves up to 90%
✅ Consolidation: Removes underutilized nodes
✅ Multi-arch: Uses ARM64 (Graviton) for cost efficiency
Step 3: Deploy a Sample App (Test Scaling)
Deploy a Stress-Test App
# stress-test.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: stress-test
spec:
replicas: 10
selector:
matchLabels:
app: stress-test
template:
metadata:
labels:
app: stress-test
spec:
containers:
- name: stress-container
image: busybox
command: ["sh", "-c", "while true; do echo 'Simulating load'; sleep 1; done"]
resources:
requests:
cpu: 500m
memory: 256Mi
Apply it:
kubectl apply -f stress-test.yaml
Watch Karpenter Scale Up
kubectl get nodes -w
You’ll see new nodes launch within seconds (not minutes).
Step 4: Monitor with CloudWatch Container Insights
Enable Container Insights
aws eks update-addon \
--cluster-name self-healing-cluster \
--addon-name amazon-cloudwatch-observability \
--region us-east-1 \
--configuration-values '{"resources":{"limits":{"cpu":"200m","memory":"200Mi"}}}'
Key Metrics to Track
CPU/Memory Usage (
ClusterName, NodeName
)Pending Pods (indicates scaling delays)
Node Health (Spot Instance interruptions)
View in AWS Console:
- Navigate to CloudWatch → Container Insights → Performance Monitoring
Step 5: Test Self-Healing
Simulate a Node Failure
# Randomly delete a node
NODE_NAME=$(kubectl get nodes -o json | jq -r '.items[0].metadata.name')
kubectl drain $NODE_NAME --ignore-daemonsets --delete-emptydir-data
What Happens?
Karpenter detects the missing capacity
Launches a replacement node in <30 seconds
Pods reschedule automatically
Step 6: Clean Up (Optional)
# Delete the cluster
eksctl delete cluster --name self-healing-cluster --region us-east-1
Summary
You’ve built a self-healing EKS cluster using only AWS tools:
✅ Karpenter for auto-scaling (with Spot savings)
✅ EKS Blueprints for IaC (Infrastructure-as-Code)
✅ CloudWatch for monitoring
Next Steps:
Try vertical scaling (adjusting pod CPU/memory requests)
Explore Fargate for serverless Kubernetes
Set up alerts for scaling events
Key Takeaways
Karpenter is faster than Cluster Autoscaler (seconds vs. minutes)
Spot + ARM64 = Massive savings (up to 90% cost reduction)
CloudWatch provides built-in observability (no third-party tools needed)
Subscribe to my newsletter
Read articles from Samuel Aniekeme directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
