Introduction

Kubernetes (EKS) is powerful but complex. Without proper automation, clusters can suffer from:
❌ Unplanned downtime (nodes/pods crashing)
❌ Wasted spend (over-provisioned resources)
❌ Configuration drift (manual changes causing outages)

In this tutorial, you’ll learn how to set up a self-healing EKS cluster using AWS-native tools only, focusing on:

AWS Karpenter for automatic, cost-efficient scaling
Amazon EKS Blueprints for infrastructure-as-code
CloudWatch Container Insights for monitoring

No third-party tools required.

Step 1: Set Up the EKS Cluster

Prerequisites

AWS CLI configured (follow steps here)
eksctl installed (follow steps here)
IAM permissions for EKS, EC2, and CloudWatch

Deploy EKS with eksctl

eksctl create cluster \
  --name self-healing-cluster \
  --version 1.28 \
  --region us-east-1 \
  --nodegroup-name ng-default \
  --nodes 3 \
  --nodes-min 1 \
  --nodes-max 5 \
  --managed

Key Flags:

--managed: Uses EKS-managed nodes (simpler than self-managed)
--nodes-min 1: Ensures at least 1 node is always running
--nodes-max 5: Prevents runaway scaling

Step 2: Install Karpenter for Intelligent Scaling

Why Karpenter?

Faster scaling: Launches nodes in seconds (vs. minutes with Cluster Autoscaler)
Cost savings: Automatically uses Spot Instances
Simpler config: No node groups required

Install Karpenter

# Add the Helm repo
helm repo add karpenter https://charts.karpenter.sh
helm repo update

# Install Karpenter
helm upgrade --install karpenter karpenter/karpenter \
  --namespace karpenter \
  --create-namespace \
  --version v0.32.1 \
  --set settings.aws.defaultInstanceProfile=KarpenterNodeInstanceProfile

Configure a NodePool

Create karpenter-nodepool.yaml:

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
        - key: "karpenter.sh/capacity-type"
          operator: In
          values: ["spot", "on-demand"] # Prefer Spot, fall back to On-Demand
        - key: "kubernetes.io/arch"
          operator: In
          values: ["arm64"] # Graviton = 20% cheaper
  limits:
    cpu: 1000 # Max cores in the cluster
  disruption:
    consolidationPolicy: WhenUnderutilized # Automatically remove idle nodes

Apply it:

kubectl apply -f karpenter-nodepool.yaml

Key Features:
✅ Spot Instances: Saves up to 90%
✅ Consolidation: Removes underutilized nodes
✅ Multi-arch: Uses ARM64 (Graviton) for cost efficiency

Step 3: Deploy a Sample App (Test Scaling)

Deploy a Stress-Test App

# stress-test.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: stress-test
spec:
  replicas: 10
  selector:
    matchLabels:
      app: stress-test
  template:
    metadata:
      labels:
        app: stress-test
    spec:
      containers:
      - name: stress-container
        image: busybox
        command: ["sh", "-c", "while true; do echo 'Simulating load'; sleep 1; done"]
        resources:
          requests:
            cpu: 500m
            memory: 256Mi

Apply it:

kubectl apply -f stress-test.yaml

Watch Karpenter Scale Up

kubectl get nodes -w

You’ll see new nodes launch within seconds (not minutes).

Step 4: Monitor with CloudWatch Container Insights

Enable Container Insights

aws eks update-addon \
  --cluster-name self-healing-cluster \
  --addon-name amazon-cloudwatch-observability \
  --region us-east-1 \
  --configuration-values '{"resources":{"limits":{"cpu":"200m","memory":"200Mi"}}}'

Key Metrics to Track

CPU/Memory Usage (ClusterName, NodeName)
Pending Pods (indicates scaling delays)
Node Health (Spot Instance interruptions)

View in AWS Console:

Navigate to CloudWatch → Container Insights → Performance Monitoring

Step 5: Test Self-Healing

Simulate a Node Failure

# Randomly delete a node
NODE_NAME=$(kubectl get nodes -o json | jq -r '.items[0].metadata.name')
kubectl drain $NODE_NAME --ignore-daemonsets --delete-emptydir-data

What Happens?

Karpenter detects the missing capacity
Launches a replacement node in <30 seconds
Pods reschedule automatically

Step 6: Clean Up (Optional)

# Delete the cluster
eksctl delete cluster --name self-healing-cluster --region us-east-1

Summary

You’ve built a self-healing EKS cluster using only AWS tools:
✅ Karpenter for auto-scaling (with Spot savings)
✅ EKS Blueprints for IaC (Infrastructure-as-Code)
✅ CloudWatch for monitoring

Next Steps:

Try vertical scaling (adjusting pod CPU/memory requests)
Explore Fargate for serverless Kubernetes
Set up alerts for scaling events

Key Takeaways

Karpenter is faster than Cluster Autoscaler (seconds vs. minutes)
Spot + ARM64 = Massive savings (up to 90% cost reduction)
CloudWatch provides built-in observability (no third-party tools needed)

Building a Self-Healing Kubernetes Cluster on AWS EKS