Scaling the Heights: Building EKS Cluster with StatefulSets, Real Monitoring, and Logging

Table of contents
- The Reality Check That Started Everything
- AWS EKS: The Foundation That Actually Matters
- PostgreSQL with CloudNativePG: StatefulSets Done Right
- Database Performance and Monitoring Integration
- Logging Infrastructure: EFK Stack That Actually Helps Debug Issues
- Advanced Features and Future Enhancements
- Closing Thoughts: Building Infrastructure That Actually Works

The Reality Check That Started Everything
Picture this: You're confidently like, "Yeah, I'll just throw up a quick EKS cluster for database workloads. Should take maybe an hour, two tops."
Two days later, you're debugging StatefulSet storage classes, wrestling with PostgreSQL operator configurations, and somehow you've built what might be the most robust cloud infrastructure you've ever touched.
That's exactly what happened to me. What started as a "simple database deployment" turned into an adventure through the deepest corners of Kubernetes StatefulSets, high-availability PostgreSQL clusters, and observability stacks.
The end result? A platform that handles:
PostgreSQL clusters that survive node failures without breaking a sweat
Monitoring that actually catches problems before your users do
Logging that makes debugging feel less like detective work
Infrastructure that scales gracefully under real load
AWS EKS: The Foundation That Actually Matters
Setting up EKS sounds straightforward until you realize that "managed Kubernetes" still requires you to understand VPCs, security groups, IAM roles, and about fifty other AWS services that all need to play nicely together.
EKS clusters need specific subnet tags, proper route table configurations, and security groups that allow the right traffic while blocking everything else. The managed node groups need different permissions than the cluster itself
The managed node groups are genuinely fantastic once configured properly. They handle node upgrades, security patches, and scaling automatically. But getting the initial configuration right - the instance types, scaling policies, and subnet placement - requires understanding how your workloads will actually behave under load.
After resource provisioning with Terraform, watching the cluster come alive is genuinely satisfying. Three worker nodes spread across availability zones, each running the full Kubernetes stack, all managed by AWS but under your complete control for workload scheduling.
Link to the Terraform for creating the cluster GitHub
PostgreSQL with CloudNativePG: StatefulSets Done Right
This is where things got really interesting. My journey began the traditional way: meticulously writing YAML for StatefulSets, PersistentVolumeClaims, and PersistentVolumes. I was deep in the weeds of configuration, replication settings, and storage classes, and frankly, things weren't going well. It felt fragile, complex, and frankly, not something I'd trust in production. This is the classic PostgreSQL-on-Kubernetes experience, historically somewhere between "challenging" and "why would you do this to yourself."
That's when I discovered CloudNativePG, and it changed the entire equation. It's a PostgreSQL operator that understands how databases should actually work in Kubernetes environments.
Connecting to the created EKS Cluster
aws configure
access_key
secret_key
region
aws eks update-kubeconfig --region <aws region> --name <eks cluster name>
Installing the CloudNativePG Operator
First, you need to install the operator itself. This creates the custom resource definitions and controllers that will manage PostgreSQL clusters:
helm repo add cnpg https://cloudnative-pg.github.io/charts
helm repo update
kubectl create ns database
helm install cnpg --namespace database cnpg/cloudnative-pg
PostgreSQL Cluster Configuration
postgress-cluster.yaml
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: postgresql-cluster
namespace: database
spec:
instances: 3
storage:
size: 1Gi
managed:
roles:
# Make app user a superuser
- name: app
ensure: present
comment: "Application user with superuser privileges"
login: true
superuser: true # Grant full superuser privileges
inherit: true
connectionLimit: -1
Then kubectl apply -f postgress-cluster.yaml
StatefulSet Behavior: CloudNativePG creates a StatefulSet behind the scenes, but handles all the complexity. Each PostgreSQL instance gets:
Stable network identity (postgresql-cluster-1, postgresql-cluster-2, etc.)
Persistent storage that survives pod restarts
Ordered deployment and scaling
Automatic DNS entries for service discovery
High Availability: With three instances, you get:
One primary (read/write)
Two replicas (read-only, automatic failover targets)
Streaming replication is configured automatically
Automatic failover if the primary becomes unavailable
Each pod goes through the complete PostgreSQL initialization, replication setup, and health checks automatically.
Secret Management and Access Credentials
CloudNativePG automatically generates secure credentials and stores them in Kubernetes secrets. This is infinitely better than hardcoded passwords or environment variables.
Getting the application credentials is straightforward once you know the secret names:
# Database password
kubectl get secret postgresql-cluster-app -n database -o jsonpath='{.data.password}' | base64 --decode
# Username
kubectl get secret postgresql-cluster-app -n database -o jsonpath='{.data.username}' | base64 --decode
# Database name
kubectl get secret postgresql-cluster-app -n database -o jsonpath='{.data.dbname}' | base64 --decode
Service Discovery and Connection Patterns
CloudNativePG creates several services automatically:
postgresql-cluster-rw
: Read/write service (connects to primary)postgresql-cluster-ro
: Read-only service (connects to replicas)postgresql-cluster-r
: Read service (connects to any instance)
For testing purposes, I exposed the read/write service externally:
kubectl patch svc postgresql-cluster-rw -n database -p '{"spec": {"type": "LoadBalancer"}}'
Important note: This creates an internet-facing load balancer. In production, you'd use ClusterIP services and connect through private networking, service mesh, or VPN. But for development and testing, this approach lets you connect with any PostgreSQL client.
Connecting to the database using the VS Code extension
Database Performance and Monitoring Integration
The PostgreSQL configuration includes performance optimizations appropriate for containerized environments:
Connection limits that work well with connection pooling
Memory settings optimized for container resource limits
WAL settings that balance performance and durability
Statistics collection for query optimization
CloudNativePG also exposes PostgreSQL metrics automatically, which integrate beautifully with Prometheus.
Monitoring Infrastructure: Prometheus and Grafana That Actually Work
Here's where the magic really happens. Good monitoring transforms a collection of services into a platform you can actually operate with confidence.
The kube-prometheus-stack Installation
The Prometheus community's kube-prometheus-stack Helm chart is genuinely impressive. It's everything you need for comprehensive Kubernetes monitoring in a single deployment:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=10Gi \
--set grafana.persistence.enabled=true \
--set grafana.persistence.size=5Gi \
--set grafana.adminPassword=admin123 \
--set grafana.service.type=LoadBalancer \
--set alertmanager.persistentVolume.size=5Gi \
--set prometheus.prometheusSpec.retention=15d \
--set prometheus.prometheusSpec.retentionSize=8GB
This single command deploys:
Prometheus server with 10GB persistent storage and 15-day retention
Grafana with persistent dashboards and LoadBalancer access
AlertManager for handling alerts and notifications
Node Exporter on every worker node for system metrics
kube-state-metrics for Kubernetes API metrics
Prometheus Operator for managing monitoring configurations
Pre-configured ServiceMonitors for automatic metric discovery
What Gets Monitored Automatically
The beauty of kube-prometheus-stack is the comprehensive monitoring it provides out of the box:
Cluster-level metrics:
Node CPU, memory, disk, and network utilization
Kubernetes API server performance
etcd health and performance
Container resource usage and limits
Application metrics:
Pod CPU and memory consumption
Container restart counts and reasons
Service endpoint availability
Ingress traffic and response times
Grafana Dashboards That Don't Suck
Accessing Grafana through the LoadBalancer reveals something genuinely impressive - dozens of pre-built dashboards that are actually useful:
Kubernetes Cluster Overview:
Real-time cluster resource utilization
Node health and capacity planning
Pod distribution and scheduling efficiency
Network traffic patterns across the cluster
Node-level Monitoring:
Per-node CPU, memory, and disk metrics
System load and process information
Network interface statistics
Hardware health indicators
Application Performance:
Pod resource consumption over time
Container restarts and error rates
Logging Infrastructure: EFK Stack That Actually Helps Debug Issues
Centralized logging in Kubernetes isn't optional - it's essential. When something breaks, you need to quickly find relevant log entries across dozens of pods.
EFK Stack Architecture
The EFK (Elasticsearch, Fluentd, Kibana) stack creates a complete log aggregation pipeline:
Fluentd: Runs as a DaemonSet on every node, collecting logs from:
All container stdout/stderr
Kubernetes audit logs
Node system logs
Application-specific log files
Elasticsearch: Stores, indexes, and makes logs searchable:
Automatic index rotation and retention
Full-text search across all log entries
Time-based querying and filtering
Aggregation and analysis capabilities
Kibana: Provides the interface for log exploration:
Real-time log streaming
Powerful query language
Custom dashboards for common patterns
Alert configuration for log-based triggers
EFK Deployment Strategy
Used manifests pushed to GitHub
Advanced Features and Future Enhancements
GitOps Implementation
ArgoCD integration enables:
Automated deployments from Git repositories
Configuration drift detection
Rollback capabilities
Multi-environment management
Advanced Monitoring
Additional monitoring capabilities:
Distributed tracing with Jaeger
Log aggregation with custom parsers
Custom metrics and alerting rules
Capacity planning automation
Disaster Recovery
Implement cross-region capabilities:
Database replication to secondary regions
Configuration backup and restore procedures
Automated failover processes
Recovery time objective planning
Closing Thoughts: Building Infrastructure That Actually Works
This infrastructure project taught me that modern cloud-native tools, when properly configured and understood, can create incredibly robust and self-managing systems. The key insights:
Embrace the Platform: Don't fight Kubernetes patterns. Use operators, StatefulSets, and native features instead of trying to recreate traditional deployment patterns.
Observability is Essential: You can't operate what you can't see. Comprehensive monitoring and logging aren't nice-to-have features - they're fundamental requirements.
Automation Prevents Problems: Operators like CloudNativePG handle complex operational tasks better than manual processes. Let them do the heavy lifting.
Security and Performance Go Together: Proper resource limits, network policies, and access controls improve both security and performance.
The final result is infrastructure that I'd genuinely feel comfortable running production workloads on. It scales gracefully, survives failures elegantly, and provides the observability needed to operate it confidently.
Subscribe to my newsletter
Read articles from Muthuri KE directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
