Building a Production-Grade Kubernetes Home Lab

"How Hard Could It Be?"

It started innocently enough. I wanted to learn Kubernetes properly. Fast forward, and I've somehow built something I describe as "production-ready cluster with backup."

Fair warning: This got way more complex than I initially planned. But that's half the fun, right?

My Current Setup: What I Built

The Hardware/OS: Just a single Arch Linux server. Nothing fancy - 16GB RAM, decent CPU, and enough storage to not worry about it.

The Stack: K3s running everything from my personal projects to monitoring tools that would make SREs jealous.

External Access

Instead of dealing with port forwarding and dynamic IP headaches, everything flows through Cloudflare tunnels. Users hit my domain, Cloudflare routes it through an encrypted tunnel to my server. Zero open ports. Zero stress.

GitOps Core (Home Labs needs GitOps)

ArgoCD watches my GitHub repos and automatically deploys changes. I push to git, Image builds and updates Helm charts, ArgoCD notices, and deployment happens. It's having a CI/CD pipeline that actually works.

Monitoring Stack

Prometheus scrapes metrics from everything, Grafana makes them pretty, and Uptime Kuma tells me when things break (usually at 3 AM, naturally).

The Backup Safety Net

Velero backs up everything to S3-compatible storage daily. I learned this was important the hard way, after I made small changes to the server and everything went down🤦‍♂️

Why I Chose Each Component

K3s Over For Kubernetes

K3s is Kubernetes without the operational nightmares:

60% smaller memory footprint than standard K8s
Single binary installation (no more etcd headaches)
Batteries included with:
- Containerd instead of Docker
- Traefik ingress controller
- Local storage provider

ArgoCD for GitOps

I wanted to deploy things properly, not with kubectl apply commands I'd repeat five minutes later. ArgoCD turned my messy deployment process into something resembling professional DevOps.

Cloudflare Tunnels

This was the best moment. No more fighting with router configurations, no more worrying about exposing services to the internet. Cloudflare handles the heavy lifting.

Prometheus + Grafana

Started with "I should monitor this one service" and ended up with dashboards for everything. Now I know exactly when my server is having a bad day.

The GitOps Journey

The transformation from manual deployments to GitOps was... enlightening.

Before GitOps (the dark times):

# Me, every deployment:
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
# Wait, did I apply the right version?
# Quick, check what's running...

After GitOps :

# Just commit to git, ArgoCD handles the rest
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-awesome-app
# ... rest of the config lives in git

Now my deployment process is:

Push code to GitHub
ArgoCD notices the change
Application updates automatically
I sleep peacefully

The config structure in my repo looks something like:

app-manifest/
├── template/
│   ├── configmap.yaml
│   ├── deployment.yaml
│   └── ingress.yaml
|   |__service.yaml
|   |__ hpa.yaml
|   |__ serviceaccount.yaml
├── Chart.yaml
└── values.yaml

Monitoring Everything

The monitoring setup started simple and grew into something beautiful:

What Gets Monitored

Cluster health: Node resources, pod status, the usual suspects
Application metrics: Response times, error rates, business metrics
Infrastructure: Storage usage, network throughput, backup success
External: Website uptime

The Dashboard Addiction

I may have gone overboard with Grafana dashboards. There's something satisfying about seeing everything in neat graphs and knowing exactly what's happening.

The Prometheus configuration scrapes everything that moves:

No More Port Forwarding Nightmares and Dynamic IP addresses

The Cloudflare tunnel setup was a revelation. No more:

Fighting with router configurations
Worrying about exposing services to the internet
Dynamic IP address headaches
SSL certificate management

The tunnel configuration maps services to subdomains:

# Simplified tunnel config
ingress:
  - hostname: grafana.mydomain.com
    service: http://grafana.monitoring.svc:80
  - hostname: argocd.mydomain.com
    service: https://argocd-server.argocd.svc:443
  # ... more services

Setting up a new service is now:

Deploy to Kubernetes
Add hostname to tunnel config
Update DNS record
Done!

The Backup Strategy That Saved Me

I learned about backup importance the hard way.

Velero: The Lifesaver

Velero backs up both Kubernetes resources AND persistent volume data. The daily schedule runs automatically:

# This runs every 24 hours
velero schedule create daily-backup \
  --schedule="@every 24h" \
  --include-namespaces='*'

What Gets Backed Up

All Kubernetes manifests (deployments, services, secrets)
Persistent volume data (using Restic)
Custom resources and configurations

The Recovery That Worked

When disaster struck, recovery was surprisingly smooth:

# List available backups
velero backup get

# Restore everything
velero restore create --from-backup daily-backup-20241205

My Deployment Workflow Now

1. Code and Configuration

# deployment.yaml (the important bits)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-new-app
spec:
  replicas: 2
  template:
    spec:
      containers:
        - name: app
          image: my-registry/app:v1.2.3
          ports:
            - containerPort: 3000
          # Health checks, resource limits, etc.

2. External Access Setup

Add the service to my tunnel configuration:

- hostname: new-app.mydomain.com
  service: http://my-new-app-service.default.svc:3000

3. Let GitOps Handle It

git add .
git commit -m "Deploy new application v1.2.3"
git push origin main
# ArgoCD takes it from here

4. Monitor the Deployment

Watch it roll out in ArgoCD's UI, check the Grafana dashboards, and verify everything's healthy.

The whole process takes minutes instead of the error-prone manual steps from before.

When Disaster Struck (And Recovery)

Every home lab has its disasters. Mine came in the form of a failed SSD and a misconfigured update that took out half my cluster.

The Problem

Primary storage died (taking some persistent volumes with it)
A Kubernetes update went wrong
Several services were completely unavailable
I had about 6 hours to fix everything

The Recovery

Rebuilt the server with a fresh K3s installation
Reinstalled Velero with the same S3 credentials
Listed available backups (velero backup get)
Restored from the latest backup (velero restore create...)
Waited 20 minutes while everything came back online

What I Learned

Automated backups are worth their weight in gold
Testing recovery procedures before you need them is smart
Having good monitoring means you know exactly what's going to break
GitOps makes rebuilding environments predictable

What I Learned Along the Way

Technical Lessons

Start simple, grow complexity gradually - I didn't build this overnight
Automation saves more time than you think - GitOps eliminates so many manual steps
Monitoring is addictive - Once you start, you want to monitor everything
Backups are boring until you need them - Test your recovery procedures

Operational Insights

Documentation matters - Future-me appreciates notes from past-me
Observability reduces stress - Knowing what's happening beats guessing
Infrastructure as code works - Being able to recreate everything from git is powerful
Security doesn't have to be complicated - Cloudflare tunnels eliminated so many attack vectors

Personal Growth

Building this taught me more about Kubernetes, networking, and operations . There's something special about running your own infrastructure.

What's Next for My Lab

The lab keeps evolving. Here's what's on my roadmap:

Short Term

Service mesh exploration - Istio or Linkerd for advanced traffic management
Better secret management - Moving beyond Kubernetes secrets

Medium Term

Multi-node cluster - Adding more hardware for true high availability
Infrastructure automation - Terraform for the underlying infrastructure

The Big Dreams

Machine learning workloads - GPU support for ML experiments
Advanced networking - Multi-cluster service mesh
Chaos engineering - Breaking things on purpose to improve resilience

Closing Thoughts

What started as "I want to learn Kubernetes" became a journey into modern infrastructure practices. I now have a home lab that:

Deploys applications like a proper DevOps environment
Monitors everything worth monitoring
Recovers from disasters automatically
Scales applications based on demand
Maintains security without complexity

The best part? It all runs on a single server in my home, yet follows enterprise-grade practices.

Tech Stack Summary:

Platform: K3s on Arch Linux
GitOps: ArgoCD with GitHub integration
Monitoring: Prometheus + Grafana + Loki + Uptime Kuma
Networking: Cloudflare Tunnels for secure access
Backup: Velero with S3-compatible storage
Applications: Various microservices and tools

The journey continues...

My Journey Building a Production-Grade Kubernetes Home Lab

Table of contents