My Journey Building a Production-Grade Kubernetes Home Lab

Muthuri KEMuthuri KE
7 min read

"How Hard Could It Be?"

It started innocently enough. I wanted to learn Kubernetes properly. Fast forward, and I've somehow built something I describe as "production-ready cluster with backup."

Fair warning: This got way more complex than I initially planned. But that's half the fun, right?


My Current Setup: What I Built


The Hardware/OS: Just a single Arch Linux server. Nothing fancy - 16GB RAM, decent CPU, and enough storage to not worry about it.

The Stack: K3s running everything from my personal projects to monitoring tools that would make SREs jealous.


External Access

Instead of dealing with port forwarding and dynamic IP headaches, everything flows through Cloudflare tunnels. Users hit my domain, Cloudflare routes it through an encrypted tunnel to my server. Zero open ports. Zero stress.


GitOps Core (Home Labs needs GitOps)

ArgoCD watches my GitHub repos and automatically deploys changes. I push to git, Image builds and updates Helm charts, ArgoCD notices, and deployment happens. It's having a CI/CD pipeline that actually works.


Monitoring Stack

Prometheus scrapes metrics from everything, Grafana makes them pretty, and Uptime Kuma tells me when things break (usually at 3 AM, naturally).


The Backup Safety Net

Velero backs up everything to S3-compatible storage daily. I learned this was important the hard way, after I made small changes to the server and everything went down๐Ÿคฆโ€โ™‚๏ธ


Why I Chose Each Component

K3s Over For Kubernetes

K3s is Kubernetes without the operational nightmares:

  • 60% smaller memory footprint than standard K8s

  • Single binary installation (no more etcd headaches)

  • Batteries included with:

    • Containerd instead of Docker

    • Traefik ingress controller

    • Local storage provider


ArgoCD for GitOps

I wanted to deploy things properly, not with kubectl apply commands I'd repeat five minutes later. ArgoCD turned my messy deployment process into something resembling professional DevOps.


Cloudflare Tunnels

This was the best moment. No more fighting with router configurations, no more worrying about exposing services to the internet. Cloudflare handles the heavy lifting.


Prometheus + Grafana

Started with "I should monitor this one service" and ended up with dashboards for everything. Now I know exactly when my server is having a bad day.


The GitOps Journey

The transformation from manual deployments to GitOps was... enlightening.

Before GitOps (the dark times):

# Me, every deployment:
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
# Wait, did I apply the right version?
# Quick, check what's running...

After GitOps :

# Just commit to git, ArgoCD handles the rest
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-awesome-app
# ... rest of the config lives in git

Now my deployment process is:

  1. Push code to GitHub

  2. ArgoCD notices the change

  3. Application updates automatically

  4. I sleep peacefully

The config structure in my repo looks something like:

app-manifest/
โ”œโ”€โ”€ template/
โ”‚   โ”œโ”€โ”€ configmap.yaml
โ”‚   โ”œโ”€โ”€ deployment.yaml
โ”‚   โ””โ”€โ”€ ingress.yaml
|   |__service.yaml
|   |__ hpa.yaml
|   |__ serviceaccount.yaml
โ”œโ”€โ”€ Chart.yaml
โ””โ”€โ”€ values.yaml

Monitoring Everything

The monitoring setup started simple and grew into something beautiful:

What Gets Monitored

  • Cluster health: Node resources, pod status, the usual suspects

  • Application metrics: Response times, error rates, business metrics

  • Infrastructure: Storage usage, network throughput, backup success

  • External: Website uptime


The Dashboard Addiction

I may have gone overboard with Grafana dashboards. There's something satisfying about seeing everything in neat graphs and knowing exactly what's happening.

The Prometheus configuration scrapes everything that moves:


No More Port Forwarding Nightmares and Dynamic IP addresses

The Cloudflare tunnel setup was a revelation. No more:

  • Fighting with router configurations

  • Worrying about exposing services to the internet

  • Dynamic IP address headaches

  • SSL certificate management

The tunnel configuration maps services to subdomains:

# Simplified tunnel config
ingress:
  - hostname: grafana.mydomain.com
    service: http://grafana.monitoring.svc:80
  - hostname: argocd.mydomain.com
    service: https://argocd-server.argocd.svc:443
  # ... more services

Setting up a new service is now:

  1. Deploy to Kubernetes

  2. Add hostname to tunnel config

  3. Update DNS record

  4. Done!


The Backup Strategy That Saved Me

I learned about backup importance the hard way.

Velero: The Lifesaver

Velero backs up both Kubernetes resources AND persistent volume data. The daily schedule runs automatically:

# This runs every 24 hours
velero schedule create daily-backup \
  --schedule="@every 24h" \
  --include-namespaces='*'

What Gets Backed Up

  • All Kubernetes manifests (deployments, services, secrets)

  • Persistent volume data (using Restic)

  • Custom resources and configurations

The Recovery That Worked

When disaster struck, recovery was surprisingly smooth:

# List available backups
velero backup get

# Restore everything
velero restore create --from-backup daily-backup-20241205

My Deployment Workflow Now

1. Code and Configuration

# deployment.yaml (the important bits)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-new-app
spec:
  replicas: 2
  template:
    spec:
      containers:
        - name: app
          image: my-registry/app:v1.2.3
          ports:
            - containerPort: 3000
          # Health checks, resource limits, etc.

2. External Access Setup

Add the service to my tunnel configuration:

- hostname: new-app.mydomain.com
  service: http://my-new-app-service.default.svc:3000

3. Let GitOps Handle It

git add .
git commit -m "Deploy new application v1.2.3"
git push origin main
# ArgoCD takes it from here

4. Monitor the Deployment

Watch it roll out in ArgoCD's UI, check the Grafana dashboards, and verify everything's healthy.

The whole process takes minutes instead of the error-prone manual steps from before.


When Disaster Struck (And Recovery)

Every home lab has its disasters. Mine came in the form of a failed SSD and a misconfigured update that took out half my cluster.

The Problem

  • Primary storage died (taking some persistent volumes with it)

  • A Kubernetes update went wrong

  • Several services were completely unavailable

  • I had about 6 hours to fix everything

The Recovery

  1. Rebuilt the server with a fresh K3s installation

  2. Reinstalled Velero with the same S3 credentials

  3. Listed available backups (velero backup get)

  4. Restored from the latest backup (velero restore create...)

  5. Waited 20 minutes while everything came back online

What I Learned

  • Automated backups are worth their weight in gold

  • Testing recovery procedures before you need them is smart

  • Having good monitoring means you know exactly what's going to break

  • GitOps makes rebuilding environments predictable


What I Learned Along the Way

Technical Lessons

  • Start simple, grow complexity gradually - I didn't build this overnight

  • Automation saves more time than you think - GitOps eliminates so many manual steps

  • Monitoring is addictive - Once you start, you want to monitor everything

  • Backups are boring until you need them - Test your recovery procedures

Operational Insights

  • Documentation matters - Future-me appreciates notes from past-me

  • Observability reduces stress - Knowing what's happening beats guessing

  • Infrastructure as code works - Being able to recreate everything from git is powerful

  • Security doesn't have to be complicated - Cloudflare tunnels eliminated so many attack vectors

Personal Growth

Building this taught me more about Kubernetes, networking, and operations . There's something special about running your own infrastructure.


What's Next for My Lab

The lab keeps evolving. Here's what's on my roadmap:

Short Term

  • Service mesh exploration - Istio or Linkerd for advanced traffic management

  • Better secret management - Moving beyond Kubernetes secrets

Medium Term

  • Multi-node cluster - Adding more hardware for true high availability

  • Infrastructure automation - Terraform for the underlying infrastructure

The Big Dreams

  • Machine learning workloads - GPU support for ML experiments

  • Advanced networking - Multi-cluster service mesh

  • Chaos engineering - Breaking things on purpose to improve resilience


Closing Thoughts

What started as "I want to learn Kubernetes" became a journey into modern infrastructure practices. I now have a home lab that:

  • Deploys applications like a proper DevOps environment

  • Monitors everything worth monitoring

  • Recovers from disasters automatically

  • Scales applications based on demand

  • Maintains security without complexity

The best part? It all runs on a single server in my home, yet follows enterprise-grade practices.


Tech Stack Summary:

  • Platform: K3s on Arch Linux

  • GitOps: ArgoCD with GitHub integration

  • Monitoring: Prometheus + Grafana + Loki + Uptime Kuma

  • Networking: Cloudflare Tunnels for secure access

  • Backup: Velero with S3-compatible storage

  • Applications: Various microservices and tools

The journey continues...

3
Subscribe to my newsletter

Read articles from Muthuri KE directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Muthuri KE
Muthuri KE