My Journey Building a Production-Grade Kubernetes Home Lab

Table of contents
- "How Hard Could It Be?"
- My Current Setup: What I Built
- Why I Chose Each Component
- The GitOps Journey
- Monitoring Everything
- No More Port Forwarding Nightmares and Dynamic IP addresses
- The Backup Strategy That Saved Me
- My Deployment Workflow Now
- When Disaster Struck (And Recovery)
- What I Learned Along the Way
- What's Next for My Lab
- Closing Thoughts

"How Hard Could It Be?"
It started innocently enough. I wanted to learn Kubernetes properly. Fast forward, and I've somehow built something I describe as "production-ready cluster with backup."
Fair warning: This got way more complex than I initially planned. But that's half the fun, right?
My Current Setup: What I Built
The Hardware/OS: Just a single Arch Linux server. Nothing fancy - 16GB RAM, decent CPU, and enough storage to not worry about it.
The Stack: K3s running everything from my personal projects to monitoring tools that would make SREs jealous.
External Access
Instead of dealing with port forwarding and dynamic IP headaches, everything flows through Cloudflare tunnels. Users hit my domain, Cloudflare routes it through an encrypted tunnel to my server. Zero open ports. Zero stress.
GitOps Core (Home Labs needs GitOps)
ArgoCD watches my GitHub repos and automatically deploys changes. I push to git, Image builds and updates Helm charts, ArgoCD notices, and deployment happens. It's having a CI/CD pipeline that actually works.
Monitoring Stack
Prometheus scrapes metrics from everything, Grafana makes them pretty, and Uptime Kuma tells me when things break (usually at 3 AM, naturally).
The Backup Safety Net
Velero backs up everything to S3-compatible storage daily. I learned this was important the hard way, after I made small changes to the server and everything went down๐คฆโโ๏ธ
Why I Chose Each Component
K3s Over For Kubernetes
K3s is Kubernetes without the operational nightmares:
60% smaller memory footprint than standard K8s
Single binary installation (no more etcd headaches)
Batteries included with:
Containerd instead of Docker
Traefik ingress controller
Local storage provider
ArgoCD for GitOps
I wanted to deploy things properly, not with kubectl apply
commands I'd repeat five minutes later. ArgoCD turned my messy deployment process into something resembling professional DevOps.
Cloudflare Tunnels
This was the best moment. No more fighting with router configurations, no more worrying about exposing services to the internet. Cloudflare handles the heavy lifting.
Prometheus + Grafana
Started with "I should monitor this one service" and ended up with dashboards for everything. Now I know exactly when my server is having a bad day.
The GitOps Journey
The transformation from manual deployments to GitOps was... enlightening.
Before GitOps (the dark times):
# Me, every deployment:
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
# Wait, did I apply the right version?
# Quick, check what's running...
After GitOps :
# Just commit to git, ArgoCD handles the rest
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: my-awesome-app
# ... rest of the config lives in git
Now my deployment process is:
Push code to GitHub
ArgoCD notices the change
Application updates automatically
I sleep peacefully
The config structure in my repo looks something like:
app-manifest/
โโโ template/
โ โโโ configmap.yaml
โ โโโ deployment.yaml
โ โโโ ingress.yaml
| |__service.yaml
| |__ hpa.yaml
| |__ serviceaccount.yaml
โโโ Chart.yaml
โโโ values.yaml
Monitoring Everything
The monitoring setup started simple and grew into something beautiful:
What Gets Monitored
Cluster health: Node resources, pod status, the usual suspects
Application metrics: Response times, error rates, business metrics
Infrastructure: Storage usage, network throughput, backup success
External: Website uptime
The Dashboard Addiction
I may have gone overboard with Grafana dashboards. There's something satisfying about seeing everything in neat graphs and knowing exactly what's happening.
The Prometheus configuration scrapes everything that moves:
No More Port Forwarding Nightmares and Dynamic IP addresses
The Cloudflare tunnel setup was a revelation. No more:
Fighting with router configurations
Worrying about exposing services to the internet
Dynamic IP address headaches
SSL certificate management
The tunnel configuration maps services to subdomains:
# Simplified tunnel config
ingress:
- hostname: grafana.mydomain.com
service: http://grafana.monitoring.svc:80
- hostname: argocd.mydomain.com
service: https://argocd-server.argocd.svc:443
# ... more services
Setting up a new service is now:
Deploy to Kubernetes
Add hostname to tunnel config
Update DNS record
Done!
The Backup Strategy That Saved Me
I learned about backup importance the hard way.
Velero: The Lifesaver
Velero backs up both Kubernetes resources AND persistent volume data. The daily schedule runs automatically:
# This runs every 24 hours
velero schedule create daily-backup \
--schedule="@every 24h" \
--include-namespaces='*'
What Gets Backed Up
All Kubernetes manifests (deployments, services, secrets)
Persistent volume data (using Restic)
Custom resources and configurations
The Recovery That Worked
When disaster struck, recovery was surprisingly smooth:
# List available backups
velero backup get
# Restore everything
velero restore create --from-backup daily-backup-20241205
My Deployment Workflow Now
1. Code and Configuration
# deployment.yaml (the important bits)
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-new-app
spec:
replicas: 2
template:
spec:
containers:
- name: app
image: my-registry/app:v1.2.3
ports:
- containerPort: 3000
# Health checks, resource limits, etc.
2. External Access Setup
Add the service to my tunnel configuration:
- hostname: new-app.mydomain.com
service: http://my-new-app-service.default.svc:3000
3. Let GitOps Handle It
git add .
git commit -m "Deploy new application v1.2.3"
git push origin main
# ArgoCD takes it from here
4. Monitor the Deployment
Watch it roll out in ArgoCD's UI, check the Grafana dashboards, and verify everything's healthy.
The whole process takes minutes instead of the error-prone manual steps from before.
When Disaster Struck (And Recovery)
Every home lab has its disasters. Mine came in the form of a failed SSD and a misconfigured update that took out half my cluster.
The Problem
Primary storage died (taking some persistent volumes with it)
A Kubernetes update went wrong
Several services were completely unavailable
I had about 6 hours to fix everything
The Recovery
Rebuilt the server with a fresh K3s installation
Reinstalled Velero with the same S3 credentials
Listed available backups (
velero backup get
)Restored from the latest backup (
velero restore create...
)Waited 20 minutes while everything came back online
What I Learned
Automated backups are worth their weight in gold
Testing recovery procedures before you need them is smart
Having good monitoring means you know exactly what's going to break
GitOps makes rebuilding environments predictable
What I Learned Along the Way
Technical Lessons
Start simple, grow complexity gradually - I didn't build this overnight
Automation saves more time than you think - GitOps eliminates so many manual steps
Monitoring is addictive - Once you start, you want to monitor everything
Backups are boring until you need them - Test your recovery procedures
Operational Insights
Documentation matters - Future-me appreciates notes from past-me
Observability reduces stress - Knowing what's happening beats guessing
Infrastructure as code works - Being able to recreate everything from git is powerful
Security doesn't have to be complicated - Cloudflare tunnels eliminated so many attack vectors
Personal Growth
Building this taught me more about Kubernetes, networking, and operations . There's something special about running your own infrastructure.
What's Next for My Lab
The lab keeps evolving. Here's what's on my roadmap:
Short Term
Service mesh exploration - Istio or Linkerd for advanced traffic management
Better secret management - Moving beyond Kubernetes secrets
Medium Term
Multi-node cluster - Adding more hardware for true high availability
Infrastructure automation - Terraform for the underlying infrastructure
The Big Dreams
Machine learning workloads - GPU support for ML experiments
Advanced networking - Multi-cluster service mesh
Chaos engineering - Breaking things on purpose to improve resilience
Closing Thoughts
What started as "I want to learn Kubernetes" became a journey into modern infrastructure practices. I now have a home lab that:
Deploys applications like a proper DevOps environment
Monitors everything worth monitoring
Recovers from disasters automatically
Scales applications based on demand
Maintains security without complexity
The best part? It all runs on a single server in my home, yet follows enterprise-grade practices.
Tech Stack Summary:
Platform: K3s on Arch Linux
GitOps: ArgoCD with GitHub integration
Monitoring: Prometheus + Grafana + Loki + Uptime Kuma
Networking: Cloudflare Tunnels for secure access
Backup: Velero with S3-compatible storage
Applications: Various microservices and tools
The journey continues...
Subscribe to my newsletter
Read articles from Muthuri KE directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
