Building a Self-Healing Kubernetes Cluster on AWS EKS

Samuel AniekemeSamuel Aniekeme
3 min read

Introduction

Kubernetes (EKS) is powerful but complex. Without proper automation, clusters can suffer from:
Unplanned downtime (nodes/pods crashing)
Wasted spend (over-provisioned resources)
Configuration drift (manual changes causing outages)

In this tutorial, you’ll learn how to set up a self-healing EKS cluster using AWS-native tools only, focusing on:

  • AWS Karpenter for automatic, cost-efficient scaling

  • Amazon EKS Blueprints for infrastructure-as-code

  • CloudWatch Container Insights for monitoring

No third-party tools required.


Step 1: Set Up the EKS Cluster

Prerequisites

  • AWS CLI configured (follow steps here)

  • eksctl installed (follow steps here)

  • IAM permissions for EKS, EC2, and CloudWatch

Deploy EKS with eksctl

eksctl create cluster \
  --name self-healing-cluster \
  --version 1.28 \
  --region us-east-1 \
  --nodegroup-name ng-default \
  --nodes 3 \
  --nodes-min 1 \
  --nodes-max 5 \
  --managed

Key Flags:

  • --managed: Uses EKS-managed nodes (simpler than self-managed)

  • --nodes-min 1: Ensures at least 1 node is always running

  • --nodes-max 5: Prevents runaway scaling


Step 2: Install Karpenter for Intelligent Scaling

Why Karpenter?

  • Faster scaling: Launches nodes in seconds (vs. minutes with Cluster Autoscaler)

  • Cost savings: Automatically uses Spot Instances

  • Simpler config: No node groups required

Install Karpenter

# Add the Helm repo
helm repo add karpenter https://charts.karpenter.sh
helm repo update

# Install Karpenter
helm upgrade --install karpenter karpenter/karpenter \
  --namespace karpenter \
  --create-namespace \
  --version v0.32.1 \
  --set settings.aws.defaultInstanceProfile=KarpenterNodeInstanceProfile

Configure a NodePool

Create karpenter-nodepool.yaml:

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
        - key: "karpenter.sh/capacity-type"
          operator: In
          values: ["spot", "on-demand"] # Prefer Spot, fall back to On-Demand
        - key: "kubernetes.io/arch"
          operator: In
          values: ["arm64"] # Graviton = 20% cheaper
  limits:
    cpu: 1000 # Max cores in the cluster
  disruption:
    consolidationPolicy: WhenUnderutilized # Automatically remove idle nodes

Apply it:

kubectl apply -f karpenter-nodepool.yaml

Key Features:
Spot Instances: Saves up to 90%
Consolidation: Removes underutilized nodes
Multi-arch: Uses ARM64 (Graviton) for cost efficiency


Step 3: Deploy a Sample App (Test Scaling)

Deploy a Stress-Test App

# stress-test.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: stress-test
spec:
  replicas: 10
  selector:
    matchLabels:
      app: stress-test
  template:
    metadata:
      labels:
        app: stress-test
    spec:
      containers:
      - name: stress-container
        image: busybox
        command: ["sh", "-c", "while true; do echo 'Simulating load'; sleep 1; done"]
        resources:
          requests:
            cpu: 500m
            memory: 256Mi

Apply it:

kubectl apply -f stress-test.yaml

Watch Karpenter Scale Up

kubectl get nodes -w

You’ll see new nodes launch within seconds (not minutes).


Step 4: Monitor with CloudWatch Container Insights

Enable Container Insights

aws eks update-addon \
  --cluster-name self-healing-cluster \
  --addon-name amazon-cloudwatch-observability \
  --region us-east-1 \
  --configuration-values '{"resources":{"limits":{"cpu":"200m","memory":"200Mi"}}}'

Key Metrics to Track

  1. CPU/Memory Usage (ClusterName, NodeName)

  2. Pending Pods (indicates scaling delays)

  3. Node Health (Spot Instance interruptions)

View in AWS Console:

  • Navigate to CloudWatch → Container Insights → Performance Monitoring

Step 5: Test Self-Healing

Simulate a Node Failure

# Randomly delete a node
NODE_NAME=$(kubectl get nodes -o json | jq -r '.items[0].metadata.name')
kubectl drain $NODE_NAME --ignore-daemonsets --delete-emptydir-data

What Happens?

  1. Karpenter detects the missing capacity

  2. Launches a replacement node in <30 seconds

  3. Pods reschedule automatically


Step 6: Clean Up (Optional)

# Delete the cluster
eksctl delete cluster --name self-healing-cluster --region us-east-1

Summary

You’ve built a self-healing EKS cluster using only AWS tools:
Karpenter for auto-scaling (with Spot savings)
EKS Blueprints for IaC (Infrastructure-as-Code)
CloudWatch for monitoring

Next Steps:

  • Try vertical scaling (adjusting pod CPU/memory requests)

  • Explore Fargate for serverless Kubernetes

  • Set up alerts for scaling events


Key Takeaways

  1. Karpenter is faster than Cluster Autoscaler (seconds vs. minutes)

  2. Spot + ARM64 = Massive savings (up to 90% cost reduction)

  3. CloudWatch provides built-in observability (no third-party tools needed)

0
Subscribe to my newsletter

Read articles from Samuel Aniekeme directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Samuel Aniekeme
Samuel Aniekeme