Building EKS Cluster with StatefulSets, Real Monitoring, and Logging

The Reality Check That Started Everything

Picture this: You're confidently like, "Yeah, I'll just throw up a quick EKS cluster for database workloads. Should take maybe an hour, two tops."

Two days later, you're debugging StatefulSet storage classes, wrestling with PostgreSQL operator configurations, and somehow you've built what might be the most robust cloud infrastructure you've ever touched.

That's exactly what happened to me. What started as a "simple database deployment" turned into an adventure through the deepest corners of Kubernetes StatefulSets, high-availability PostgreSQL clusters, and observability stacks.

The end result? A platform that handles:

PostgreSQL clusters that survive node failures without breaking a sweat
Monitoring that actually catches problems before your users do
Logging that makes debugging feel less like detective work
Infrastructure that scales gracefully under real load

AWS EKS: The Foundation That Actually Matters

Setting up EKS sounds straightforward until you realize that "managed Kubernetes" still requires you to understand VPCs, security groups, IAM roles, and about fifty other AWS services that all need to play nicely together.

EKS clusters need specific subnet tags, proper route table configurations, and security groups that allow the right traffic while blocking everything else. The managed node groups need different permissions than the cluster itself

The managed node groups are genuinely fantastic once configured properly. They handle node upgrades, security patches, and scaling automatically. But getting the initial configuration right - the instance types, scaling policies, and subnet placement - requires understanding how your workloads will actually behave under load.

After resource provisioning with Terraform, watching the cluster come alive is genuinely satisfying. Three worker nodes spread across availability zones, each running the full Kubernetes stack, all managed by AWS but under your complete control for workload scheduling.

Link to the Terraform for creating the cluster GitHub

PostgreSQL with CloudNativePG: StatefulSets Done Right

This is where things got really interesting. My journey began the traditional way: meticulously writing YAML for StatefulSets, PersistentVolumeClaims, and PersistentVolumes. I was deep in the weeds of configuration, replication settings, and storage classes, and frankly, things weren't going well. It felt fragile, complex, and frankly, not something I'd trust in production. This is the classic PostgreSQL-on-Kubernetes experience, historically somewhere between "challenging" and "why would you do this to yourself."

That's when I discovered CloudNativePG, and it changed the entire equation. It's a PostgreSQL operator that understands how databases should actually work in Kubernetes environments.

Connecting to the created EKS Cluster

aws configure
    access_key
    secret_key
    region
aws eks update-kubeconfig --region <aws region> --name <eks cluster name>

Installing the CloudNativePG Operator

First, you need to install the operator itself. This creates the custom resource definitions and controllers that will manage PostgreSQL clusters:

helm repo add cnpg https://cloudnative-pg.github.io/charts
helm repo update
kubectl create ns database
helm install cnpg --namespace database  cnpg/cloudnative-pg

PostgreSQL Cluster Configuration

postgress-cluster.yaml

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: postgresql-cluster
  namespace: database
spec:
  instances: 3

  storage:
    size: 1Gi

  managed:
    roles:
      # Make app user a superuser
      - name: app
        ensure: present
        comment: "Application user with superuser privileges"
        login: true
        superuser: true # Grant full superuser privileges
        inherit: true
        connectionLimit: -1

Then kubectl apply -f postgress-cluster.yaml

StatefulSet Behavior: CloudNativePG creates a StatefulSet behind the scenes, but handles all the complexity. Each PostgreSQL instance gets:

Stable network identity (postgresql-cluster-1, postgresql-cluster-2, etc.)
Persistent storage that survives pod restarts
Ordered deployment and scaling
Automatic DNS entries for service discovery

High Availability: With three instances, you get:

One primary (read/write)
Two replicas (read-only, automatic failover targets)
Streaming replication is configured automatically
Automatic failover if the primary becomes unavailable

Each pod goes through the complete PostgreSQL initialization, replication setup, and health checks automatically.

Secret Management and Access Credentials

CloudNativePG automatically generates secure credentials and stores them in Kubernetes secrets. This is infinitely better than hardcoded passwords or environment variables.

Getting the application credentials is straightforward once you know the secret names:

# Database password
kubectl get secret postgresql-cluster-app -n database -o jsonpath='{.data.password}' | base64 --decode

# Username  
kubectl get secret postgresql-cluster-app -n database -o jsonpath='{.data.username}' | base64 --decode

# Database name
kubectl get secret postgresql-cluster-app -n database -o jsonpath='{.data.dbname}' | base64 --decode

Service Discovery and Connection Patterns

CloudNativePG creates several services automatically:

postgresql-cluster-rw: Read/write service (connects to primary)
postgresql-cluster-ro: Read-only service (connects to replicas)
postgresql-cluster-r: Read service (connects to any instance)

For testing purposes, I exposed the read/write service externally:

kubectl patch svc postgresql-cluster-rw -n database -p '{"spec": {"type": "LoadBalancer"}}'

Important note: This creates an internet-facing load balancer. In production, you'd use ClusterIP services and connect through private networking, service mesh, or VPN. But for development and testing, this approach lets you connect with any PostgreSQL client.

Connecting to the database using the VS Code extension

Database Performance and Monitoring Integration

The PostgreSQL configuration includes performance optimizations appropriate for containerized environments:

Connection limits that work well with connection pooling
Memory settings optimized for container resource limits
WAL settings that balance performance and durability
Statistics collection for query optimization

CloudNativePG also exposes PostgreSQL metrics automatically, which integrate beautifully with Prometheus.

Monitoring Infrastructure: Prometheus and Grafana That Actually Work

Here's where the magic really happens. Good monitoring transforms a collection of services into a platform you can actually operate with confidence.

The kube-prometheus-stack Installation

The Prometheus community's kube-prometheus-stack Helm chart is genuinely impressive. It's everything you need for comprehensive Kubernetes monitoring in a single deployment:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=10Gi \
  --set grafana.persistence.enabled=true \
  --set grafana.persistence.size=5Gi \
  --set grafana.adminPassword=admin123 \
  --set grafana.service.type=LoadBalancer \
  --set alertmanager.persistentVolume.size=5Gi \
  --set prometheus.prometheusSpec.retention=15d \
  --set prometheus.prometheusSpec.retentionSize=8GB

This single command deploys:

Prometheus server with 10GB persistent storage and 15-day retention
Grafana with persistent dashboards and LoadBalancer access
AlertManager for handling alerts and notifications
Node Exporter on every worker node for system metrics
kube-state-metrics for Kubernetes API metrics
Prometheus Operator for managing monitoring configurations
Pre-configured ServiceMonitors for automatic metric discovery

What Gets Monitored Automatically

The beauty of kube-prometheus-stack is the comprehensive monitoring it provides out of the box:

Cluster-level metrics:

Node CPU, memory, disk, and network utilization
Kubernetes API server performance
etcd health and performance
Container resource usage and limits

Application metrics:

Pod CPU and memory consumption
Container restart counts and reasons
Service endpoint availability
Ingress traffic and response times

Grafana Dashboards That Don't Suck

Accessing Grafana through the LoadBalancer reveals something genuinely impressive - dozens of pre-built dashboards that are actually useful:

Kubernetes Cluster Overview:

Real-time cluster resource utilization
Node health and capacity planning
Pod distribution and scheduling efficiency
Network traffic patterns across the cluster

Node-level Monitoring:

Per-node CPU, memory, and disk metrics
System load and process information
Network interface statistics
Hardware health indicators

Application Performance:

Pod resource consumption over time
Container restarts and error rates

Logging Infrastructure: EFK Stack That Actually Helps Debug Issues

Centralized logging in Kubernetes isn't optional - it's essential. When something breaks, you need to quickly find relevant log entries across dozens of pods.

EFK Stack Architecture

The EFK (Elasticsearch, Fluentd, Kibana) stack creates a complete log aggregation pipeline:

Fluentd: Runs as a DaemonSet on every node, collecting logs from:

All container stdout/stderr
Kubernetes audit logs
Node system logs
Application-specific log files

Elasticsearch: Stores, indexes, and makes logs searchable:

Automatic index rotation and retention
Full-text search across all log entries
Time-based querying and filtering
Aggregation and analysis capabilities

Kibana: Provides the interface for log exploration:

Real-time log streaming
Powerful query language
Custom dashboards for common patterns
Alert configuration for log-based triggers

EFK Deployment Strategy

Used manifests pushed to GitHub

Advanced Features and Future Enhancements

GitOps Implementation

ArgoCD integration enables:

Automated deployments from Git repositories
Configuration drift detection
Rollback capabilities
Multi-environment management

Advanced Monitoring

Additional monitoring capabilities:

Distributed tracing with Jaeger
Log aggregation with custom parsers
Custom metrics and alerting rules
Capacity planning automation

Disaster Recovery

Implement cross-region capabilities:

Database replication to secondary regions
Configuration backup and restore procedures
Automated failover processes
Recovery time objective planning

Closing Thoughts: Building Infrastructure That Actually Works

This infrastructure project taught me that modern cloud-native tools, when properly configured and understood, can create incredibly robust and self-managing systems. The key insights:

Embrace the Platform: Don't fight Kubernetes patterns. Use operators, StatefulSets, and native features instead of trying to recreate traditional deployment patterns.

Observability is Essential: You can't operate what you can't see. Comprehensive monitoring and logging aren't nice-to-have features - they're fundamental requirements.

Automation Prevents Problems: Operators like CloudNativePG handle complex operational tasks better than manual processes. Let them do the heavy lifting.

Security and Performance Go Together: Proper resource limits, network policies, and access controls improve both security and performance.

The final result is infrastructure that I'd genuinely feel comfortable running production workloads on. It scales gracefully, survives failures elegantly, and provides the observability needed to operate it confidently.

Scaling the Heights: Building EKS Cluster with StatefulSets, Real Monitoring, and Logging

Table of contents