Bridging the Gap: GitOps for Network Engineers - Part 2

Jeffrey LyonJeffrey Lyon
33 min read

ArgoCD Is Amazing—But Let’s Make It Do Something!

Intro

In Part 1, we laid the foundation by installing ArgoCD and setting up the basic structure for a GitOps-driven platform. If you've followed along, you should now have a working Kubernetes cluster, ArgoCD deployed and accessible, and your first project created in the UI.

Now it's time to turn that foundation into something usable.

In Part 2, we'll start deploying the critical infrastructure pieces that power everything else. That includes MetalLB for external load balancing, Traefik for ingress, persistent storage using Rook + Ceph, and secrets management with External Secrets and HashiCorp Vault. All of these will be deployed through ArgoCD, GitOps-style.

We’ll kick things off with MetalLB, which enables us to expose services outside the cluster, an essential first step in making your platform actually accessible. Let’s get into it.

MetalLB: Load Balancing for Bare Metal and Home Labs

If you're running Kubernetes in a cloud environment, you typically get a load balancer as part of the package, something like an AWS ELB or an Azure Load Balancer that magically routes traffic to your services. But when you're running on bare metal, in a lab, or on-prem (which, let’s be real, a lot of network engineers are), you're on your own. That's where MetalLB comes in.

What is MetalLB?

MetalLB is a load balancer implementation for Kubernetes clusters that don’t have access to cloud-native load balancer resources. It allows you to assign external IP addresses to your Kubernetes services so that they can be accessed from outside the cluster, exactly what you'd expect from a "real" load balancer, just built for the DIY crowd.

Why You Need It

In any Kubernetes-based GitOps platform, exposing services to the outside world is non-negotiable. Whether it’s ArgoCD, Traefik, Vault, or any of your network automation tools, they all need to be reachable by users, APIs, or other systems. While NodePorts can get the job done in a lab, they’re clunky, inconsistent, and definitely not production-grade.

MetalLB solves this by handling Service type: LoadBalancer in environments where a cloud load balancer doesn’t exist, like bare metal or your home lab. You define a pool of IP addresses from your local network, and MetalLB assigns those IPs to services that request them.

Here’s where the networking magic comes in: MetalLB (when running in Layer 2 mode) announces those external IPs using ARP. If a device outside of the cluster ARPs for an exposed service IP, MetalLB replies with the MAC address of the node running the service. It’s simple, reliable, and doesn’t require BGP or complex router configs.

So when a LoadBalancer service is created, for example, to expose ArgoCD or Traefik, MetalLB makes that service’s external IP reachable from anywhere on your local network, just like a real load balancer would in a cloud environment.

How It Powers the Platform

MetalLB becomes one of the core enablers of our GitOps stack. It allows you to:

  • Expose ArgoCD with a proper external IP

  • Route external traffic to Traefik, our ingress controller

  • Provide consistent access to internal services that need to be reachable from your network

  • Maintain a production-like networking experience, even in a lab or homelab environment

Without MetalLB, you’d either be stuck manually forwarding ports, messing with IP tables, or leaning on NodePorts. With it, your platform starts acting like it belongs in a real, routable network, and that’s exactly what we want.

Now that we understand what MetalLB does and how it fits into the big picture, let’s deploy it the GitOps way, starting with adding the Helm chart repository to our config

Quick Review: Helm Charts and How They Fit into ArgoCD

Before we deploy MetalLB, let’s quickly go over how Helm works, especially how it integrates with ArgoCD.

Helm is a package manager for Kubernetes. Instead of manually writing and applying a bunch of YAML files, Helm lets you deploy versioned, configurable "charts", pre-packaged bundles of Kubernetes manifests that define an application. These charts live in remote Helm repositories, similar to how apt or yum fetch packages on a Linux system.

In a GitOps workflow, Helm charts are referenced as part of an ArgoCD Application manifest, specifically as a source. ArgoCD uses this source definition to pull the chart directly from the repo, apply any custom values.yaml overrides you’ve stored in Git, and deploy everything into your cluster automatically.

Using the MetalLB Helm Chart with ArgoCD

The official MetalLB Helm chart is hosted at:

https://metallb.github.io/metallb

When creating your ArgoCD Application, one of your sources will look like this:

ArgoCD will then treat this Helm chart as part of the desired state. It will sync the chart, merge in your values (if you’re overriding anything), and deploy MetalLB as part of your platform, all driven from Git.

MetalLB Installation

These initial steps, adding the Helm repo or other base sources, creating the app in ArgoCD, and wiring up the basic Helm configuration, are mostly the same for every application we’ll deploy. Because of that, I’ll only walk through this process in detail once (here), and only call out major differences for other apps later in the post. Screenshots are included below where it helps, but once you’ve done it once, you’ll be able to rinse and repeat for everything else.

Step 1: Add the Helm Repo

ArgoCD needs to know where to fetch the Helm chart from. For MetalLB, we’ll be using the Github-hosted chart:

  • Helm Repo URL:
    https://metallb.github.io/metallb

In the ArgoCD UI:

  • Go to Settings → Repositories

  • Click + CONNECT REPO

  • Enter the Helm repo URL

  • Choose Helm as the type

  • Give the repo a name (Optional)

  • Chose the project you created earlier to associate this repo to (mine was ‘prod-home’)

  • No authentication is needed for this public repo

  • Once done, click CONNECT

Once added, ArgoCD can now pull charts from this source.

Note: You’ll also need to add the GitHub repo that contains your custom configuration files, like Helm values.yml files and Kustomize overlays.

  • If you're using my example repo, add https://github.com/leothelyon17/kubernetes-gitops-playground.git as another source, of type Git.

  • If you're using your own repo, just make sure it's added in the same way so ArgoCD can pull your values and overlays when syncing.

Step 2: Create the ArgoCD Application

Head to the Applications tab and click + NEW APP to start the deployment.

Here’s how to fill it out:

  • Application Name: metallb

  • Project: Select your project (e.g., lab-home)

  • Sync Policy: Manual for now (we’ll automate later)

  • Repository URL: Select the Helm repo you just added

  • Chart Name: metallb

  • Target Revision: Use the latest or specify a version (recommended once things are stable)

  • Cluster URL: Use https://kubernetes.default.svc if deploying to the same cluster (mine might be different than the default, dont worry.)

  • Namespace: metallb-system (check to create it if it doesn’t exist)

Click CREATE when finished.

If everything is in order you should see the App created like the screenshot below, though your’s will be all yellow status and ‘OutOfSync’ -

Click into the app and you’ll see that ArgoCD has pulled in all the Kubernetes objects defined by the Helm chart. Everything will show as OutOfSync for now, ArgoCD knows what needs to be deployed, but we’re not quite ready to hit sync just yet. You're doing great, let’s move on to the next step

Step 3: Add the Kustomize Configuration Layer

For MetalLB, we’re keeping things straightforward (kind of): the Helm chart gets deployed using its default values, no need to touch values.yml here. But MetalLB still needs to be told how to operate: what IP ranges it can assign, and how it should advertise them. We handle that using a second source: a Kustomize overlay.

Here’s what to do next:

  1. In the ArgoCD UI, go to the Application you just created for MetalLB.

  2. Click the App details (🖉 edit) icon in the top right to open the manifest editor.

  3. Scroll down to the source section.

  4. You’ll now be editing this app to include a second source.

Add the following block under source: to include the Kustomize overlay for your MetalLB custom resources:

project: prod-home
destination:
  server: https://prod-kube-vip.jjland.local:6443
  namespace: metallb-prod
syncPolicy:
  syncOptions:
    - CreateNamespace=true
sources:
  - repoURL: https://metallb.github.io/metallb
    targetRevision: 0.14.9
    chart: metallb
  - repoURL: https://github.com/leothelyon17/kubernetes-gitops-playground.git
    path: apps/metallb/overlays/lab
    targetRevision: HEAD

NOTE: ‘source’ needs to be changed to ‘sources’, as there are now more than one.

This tells ArgoCD to deploy not just the Helm chart, but also the additional Kubernetes objects (like IPAddressPool and L2Advertisement) defined in your overlay. These are located in your apps/metallb directory and should include a kustomization.yml that pulls them together.

Once saved, ArgoCD will treat both the Helm install and the Kustomize overlay as part of the same application, and sync them together.

Step 4: Sync the App

Once everything looks good, hit Sync. ArgoCD will pull the chart, merge/build your kustomize files, and deploy MetalLB into the cluster.

You can click into the app to watch MetalLB’s resources come online; Deployments, ConfigMaps, the speaker DaemonSet, and more. If the sync fails on the first try, don’t panic, just retry it. This can happen if the chart includes CRDs (Custom Resource Definitions), which sometimes cause the sync to complete out of order while the CRDs are still registering.

Once things settle, you should see the application status show “Healthy” and “Synced”. You’ll also see multiple healthy MetalLB pods running in your cluster, just like the screenshot above.

Congrats! MetalLB is now deployed and ready to hand out external IPs like a proper load balancer.

MetalLB Custom Configuration

I wanted to provide a breakdown of the custom MetalLB files I’m using and why. This directory contains a Kustomize overlay used to deploy MetalLB's custom configuration in a lab environment. It layers environment-specific resources, like IP pools and advertisements, on top of the base Helm chart deployment, following GitOps best practices.

File Breakdown

ip-address-pool.yml

Defines a IPAddressPool custom resource:

  • Specifies a range of IP addresses MetalLB can assign to LoadBalancer services

  • Ensures services are reachable from the local network

  • Helps avoid IP conflicts in your lab environment

l2-advertisement.yml

Defines an L2Advertisement custom resource:

  • Tells MetalLB to advertise the IPs via Layer 2 (e.g., ARP)

  • Perfect for home labs and bare metal where BGP isn’t in use

  • Allows MetalLB to function like a basic network-aware load balancer

kustomization.yml

Kustomize overlay file:

  • Combines and applies the above resources

  • Enables clean separation between base and environment-specific config

  • Keeps your repo organized and scalable

Why It Matters

This overlay is what makes MetalLB actually work in your lab. While the Helm chart installs the MetalLB controller and speaker pods, these custom resources tell MetalLB what IPs to use and how to announce them to your network.

By keeping these files in Git and applying them via ArgoCD, you’re not just deploying MetalLB, you’re making your configuration declarative, version-controlled, and repeatable across environments.

Moving on…

Traefik: Ingress Routing Built for GitOps

Once MetalLB is in place and capable of handing out external IPs, we need something that can route incoming HTTP and HTTPS traffic to the right service inside the cluster. That’s where an ingress controller comes in, and for our GitOps setup, Traefik is a perfect fit.

What is Traefik?

Traefik is a modern, Kubernetes-native ingress controller that handles routing external traffic into your cluster based on rules you define in Kubernetes. It supports things like:

  • Routing traffic based on hostname or path

  • TLS termination (including Let’s Encrypt support)

  • Load balancing between multiple pods

  • Middleware support for things like authentication, redirects, rate limiting, etc.

Traefik is also highly compatible with GitOps workflows. It uses Kubernetes Custom Resource Definitions (CRDs) like IngressRoute and Middleware, which makes it easy to manage all of your ingress behavior declaratively, right from your Git repo.

Why You Need It

Without an ingress controller, every service you want to expose needs its own LoadBalancer service (i.e., a dedicated external IP). That scales poorly, especially in a lab environment with limited IP space.

Traefik solves that problem by letting you expose multiple services through a single external IP, usually on ports 80 and 443, by routing requests based on hostnames or paths. This means:

  • You can access services like argocd.yourdomain.local and vault.yourdomain.local through the same IP.

  • You get clean, centralized HTTPS management with built-in TLS support.

  • You dramatically reduce the number of open ports and public IPs you need.

Paired with MetalLB, Traefik becomes the front door to your entire GitOps platform.

How It Powers the Platform

Traefik is the gateway that makes all the services behind it easily and securely accessible. It enables you to:

  • Route HTTP/HTTPS traffic to services like ArgoCD, Vault, and your internal tools

  • Handle TLS (with optional Let’s Encrypt integration)

  • Define ingress behavior declaratively via CRDs

  • Share a single external IP across multiple services, using hostnames or paths

All of this is deployed using ArgoCD, meaning every route, certificate, and service exposure is version-controlled and reproducible.

Traefik Installation

As we covered during the MetalLB install, adding Helm repositories, creating the app in ArgoCD, and configuring the basic Helm parameters is mostly the same for each app we deploy. Because we've already gone through that in detail with MetalLB, I'll just briefly outline the steps again here. No detailed screenshots needed unless there’s a significant difference.

Step 1: Add the Traefik Helm Repo

ArgoCD needs to know where to pull the Traefik Helm chart from. For Traefik, we’ll use the official Traefik Helm repository:

Helm Repo URL:

https://helm.traefik.io/traefik

In the ArgoCD UI:

  • Navigate to Settings → Repositories

  • Click + CONNECT REPO

  • Enter the Traefik Helm repo URL listed above

  • Select Helm as the repository type

  • Provide a name (optional, something like traefik-charts)

  • Associate the repo with the appropriate ArgoCD project (mine was lab-home)

  • No authentication is required since this repo is publicly accessible

  • Click CONNECT to finish

Once connected, ArgoCD is ready to deploy the Traefik Helm chart into your cluster.

Step 2: Create the ArgoCD Application (Traefik)

Head to the Applications tab in ArgoCD, and click + NEW APP to start deploying Traefik.

Here's how you'll fill it out:

  • Application Name: traefik

  • Project: Select your ArgoCD project

  • Sync Policy: Manual (for now)

  • Repository URL: Select the Traefik Helm repo you just connected

  • Chart Name: traefik

  • Target Revision: Use latest, or specify a stable version once you've tested and confirmed compatibility

  • Cluster URL: Typically https://kubernetes.default.svc for an in-cluster deploy (if yours differs, just use the appropriate URL)

  • Namespace: Use kube-system (check the option to create it if it doesn’t exist yet)

Why kube-system namespace?
Deploying Traefik to the kube-system namespace makes sense because Traefik is essentially a core infrastructure service. Placing it here aligns with Kubernetes best practices, core infrastructure and networking-related services belong in this namespace, separating them clearly from user or application workloads.

When finished, click CREATE to finalize the setup.

Step 3: Add Custom Helm Values for Traefik

Unlike MetalLB, our Traefik deployment uses custom Helm values directly from our Git repository, without Kustomize. We'll define these custom values as a second source within our ArgoCD Application manifest.

Here's how you'll set this up in the ArgoCD UI:

  1. Navigate to the Traefik Application you created earlier.

  2. Click the App details (🖉 edit) icon in the top-right corner to open the manifest editor.

  3. Scroll down to the manifest, and ensure you're using sources: (plural), since we're adding an additional source.

  4. Modify your ArgoCD Application manifest to look similar to this:

yamlCopyEditproject: home-lab
destination:
  server: https://172.16.99.25:6443
  namespace: kube-system
syncPolicy:
  syncOptions:
    - CreateNamespace=true
sources:
  - repoURL: https://helm.traefik.io/traefik
    targetRevision: 35.0.1
    helm:
      valueFiles:
        - $values/apps/traefik/values-lab.yml
    chart: traefik
  - repoURL: https://github.com/leothelyon17/kubernetes-gitops-playground.git
    targetRevision: HEAD
    ref: values

Explanation:

  • The first source references the official Traefik Helm repository, specifying the chart version.

  • The second source references my GitHub repo (or your own), where your custom Helm values (values-lab.yml) are stored.

  • ArgoCD merges these values when syncing Traefik, allowing environment-specific customizations, such as ingress rules, TLS settings, dashboard exposure, middleware options, and other important configurations.

Once you've updated and saved this manifest, ArgoCD will apply the changes, and Traefik will deploy using your customized configuration, all neatly managed by GitOps.

Step 4: Sync the Traefik Application

Once everything looks good, click Sync in ArgoCD. It will pull the Traefik Helm chart, merge your custom Helm values (values-lab.yml), and deploy Traefik into your cluster.

You can click into the application details to watch Traefik’s resources spin up; Deployments, Services, IngressRoutes, and more. If the sync fails initially, don't worry, just retry it.

After a short period, you should see Traefik showing a status of “Healthy” and “Synced”. Verify that Traefik pods are running successfully in your cluster (similar to MetalLB earlier).

Congratulations! Traefik is now up and running as your ingress controller, ready to handle external HTTP(S) traffic into your cluster.

Traefik Custom Helm Values

Let’s take a look at the custom Helm values we’re using for Traefik, pulled from apps/traefik/values-lab.yml. These provide a simple but functional starting point for ingress, dashboard access, and authentication in a lab environment.

Key Configuration Highlights

IngressRoute for the Traefik Dashboard

ingressRoute:
  dashboard:
    enabled: true
    matchRule: Host(`YOUR-URL`)
    entryPoints: ["web", "websecure"]
    middlewares:
      - name: traefik-dashboard-auth
  • Enables the Traefik dashboard and exposes it via both HTTP and HTTPS.

  • Routes traffic based on hostname, i.e. (traefik-dashboard-lab.jjland.local).

  • Adds a middleware for basic authentication to protect access.

Basic Authentication Middleware

extraObjects:
  - kind: Secret
    type: kubernetes.io/basic-auth
    stringData:
      username: admin
      password: changeme
  - kind: Middleware
    spec:
      basicAuth:
        secret: traefik-dashboard-auth-secret
  • Creates a Kubernetes Secret with hardcoded credentials (admin / changeme).

  • Defines a Traefik Middleware that references the secret and applies HTTP basic auth to protected routes.

NOTE: These credentials are hardcoded and intended only for lab/demo use. You should absolutely replace "changeme" with a strong, securely managed password, or better yet, use a more robust authentication mechanism in production.

Static LoadBalancer IP Assignment

service:
  spec:
    loadBalancerIP: <YOUR IP SET ASIDE BY METALLB>
  • This assigns a specific external IP to Traefik’s LoadBalancer service, ensuring stable access through MetalLB.

Accessing the Dashboard

Once deployed and synced in ArgoCD, you can access the Traefik dashboard by visiting the URL set in the custom values file.

To make this work:

  • Add a DNS record (or local /etc/hosts entry) pointing to your Traefik service IP (in my case, 172.16.99.30).

  • Use the credentials you set in the values file (admin / changeme) to log in via the basic auth prompt.

Why It Matters

This configuration gives you:

  • A working Traefik dashboard protected by basic auth

  • A predictable IP address exposed by MetalLB

  • A GitOps-managed ingress setup, all stored in Git and synced automatically via ArgoCD

These are just starter settings. They work great in a lab, but you’ll want to harden and expand them for production use. Still, even at this basic level, you’re getting all the core benefits: visibility, consistency, and version-controlled configuration.

Let’s move on to the next part of the platform.

Rook + Ceph: Persistent Storage for Stateful Applications

So far, we’ve deployed the pieces that make your platform accessible, MetalLB for external IPs, and Traefik for routing traffic. But modern platforms don’t just serve traffic, they store data. If you’re planning to run apps like Nautobot, NetBox, or Postgres, you’ll need reliable, persistent storage to keep data alive across restarts and node failures.

That’s where Rook + Ceph comes in.

What is Rook + Ceph?

Ceph is a distributed storage system that provides block, object, and file storage, all highly available and scalable. It’s used in enterprise environments for cloud-native storage, and it’s rock solid.

Rook is the Kubernetes operator that makes deploying and managing Ceph clusters easier and more native to the Kubernetes ecosystem. Together, they turn a set of disks across your nodes into a resilient, self-healing storage platform.

Why You Need It

Kubernetes doesn’t come with a built-in storage backend. While it allows you to declare PersistentVolumeClaims, it’s up to you to provide the actual storage behind them. In cloud environments, that’s easy, just hook into EBS, Azure Disks, or whatever your platform provides. But in a lab or on-prem cluster? You’re on your own.

Rook + Ceph fills that gap. Once deployed, it becomes your cluster’s dynamic, self-healing storage layer. You can provision persistent volumes for any stateful workload—databases, internal tooling, monitoring stacks, and more, without having to manually manage local disks or worry about data loss.

How It Powers the Platform

Rook + Ceph is the backbone of persistent infrastructure in this setup. It enables you to:

  • Create PersistentVolumes dynamically, on demand, using StorageClass definitions

  • Run stateful apps like NetBox, Nautobot, PostgreSQL, and Prometheus with confidence

  • Survive pod restarts and node reboots, your data stays intact and available

  • Manage it all declaratively, deployed and version-controlled with ArgoCD, just like everything else

What This Looks Like When Deployed

Once your Rook + Ceph configuration is applied and the cluster becomes active, you’ll effectively have a resilient, distributed storage system spanning all your nodes. In this setup:

  • Ceph stores data redundantly across all three nodes, similar in concept to a 3-node RAID-1 (mirrored) configuration.

  • When one node goes offline or a disk fails, your data is still accessible and safe.

  • The Ceph monitor daemons ensure quorum and cluster health, while OSDs (Object Storage Daemons) replicate data across your available storage devices (e.g., /dev/vdb on each node).

This redundancy is built-in and automatically managed by the Ceph cluster itself, no manual RAID configuration needed. It’s a core reason why Ceph is trusted in both enterprise and lab-scale deployments.

What We’re Deploying: The Operator + StorageCluster

As with many Kubernetes-native tools, Rook uses the Operator pattern to manage Ceph. We’ll be deploying two key components:

  • The Rook-Ceph Operator – Acts as a controller that manages Ceph-specific resources and keeps everything in the desired state.

  • A CephCluster resource – Defines how the storage backend should be built using the disks available across your nodes.

What’s an Operator?
A Kubernetes Operator is a purpose-built controller that manages complex stateful applications by watching for custom resources (like CephCluster) and continuously reconciling their desired state—creating, healing, scaling, and configuring everything automatically.

By deploying both the operator and the cluster config together, we get a hands-off, fully declarative storage setup. Everything is defined in Git, synced by ArgoCD, and managed by the operator—including provisioning, recovery, and upgrades.

Step 1: Add the Rook-Ceph Helm Repo

ArgoCD needs to know where to pull the Rook-Ceph Helm chart from. For this, we’ll use the official Rook Helm repository:

Helm Repo URL:

https://charts.rook.io/release

In the ArgoCD UI:

  • Navigate to Settings → Repositories

  • Click + CONNECT REPO

  • Enter the Helm repo URL listed above

  • Select Helm as the repository type

  • Optionally give it a name (e.g., rook-ceph-charts)

  • Associate the repo with your ArgoCD project (mine was lab-home)

  • No authentication is required since it’s publicly accessible

  • Click CONNECT to finish

Once connected, ArgoCD will be able to deploy both the Rook-Ceph operator and storage cluster using this chart.

Step 2: Create the ArgoCD Application (Rook-Ceph)

Now that the repo is connected, head to the Applications tab in ArgoCD and click + NEW APP to start the deployment.

Here’s how to fill it out:

  • Application Name: rook-ceph

  • Project: Select your ArgoCD project (e.g., lab-home)

  • Sync Policy: Manual (for now)

  • Repository URL: Select the Rook Helm repo you just connected

  • Chart Name: rook-ceph

  • Target Revision: Use latest, or pin to a stable version you’ve tested

  • Cluster URL: Typically https://kubernetes.default.svc if deploying in-cluster

  • Namespace: rook-ceph (check the box to create it if it doesn’t exist)

Why the rook-ceph Namespace?

Rook and Ceph manage a lot of moving parts—monitors, OSDs, managers, etc.—and isolating those components into their own namespace (rook-ceph) helps keep your cluster clean and easier to troubleshoot. It also aligns with common community best practices and makes upgrades and deletions much safer.

Once you’ve filled everything out, click CREATE to finish provisioning the application.

Step 3: Add Custom Helm Values + Kustomize Overlay for Rook-Ceph

Rook-Ceph is one of the more complex components in our GitOps platform. It’s not just a single deployment, it involves multiple controllers, CRDs, and cluster-level storage logic. Because of that, we’ll be using both a Helm chart (with custom values) and a Kustomize overlay to deploy it cleanly and maintainably.

This dual-source approach lets us:

  • Use the Helm chart to install the Rook-Ceph operator and core components

  • Apply custom values to tailor behavior for our environment (resource tuning, monitor placement, dashboard settings, etc.)

  • Layer in Kustomize-based manifests for complex resources like CephCluster, StorageClass, CephFilesystem, resources that often require more precise control

ArgoCD Application Sources

When editing your ArgoCD Application manifest, your sources block will look similar to this:

sources:
  - repoURL: https://charts.rook.io/release
    targetRevision: v1.17.0
    helm:
      valueFiles:
        - $values/apps/rook-ceph/values-lab.yml
    chart: rook-ceph
  - repoURL: https://github.com/leothelyon17/kubernetes-gitops-playground.git
    path: apps/rook-ceph/overlays/lab
    targetRevision: HEAD
    ref: values

Why Both Sources?

  • The Helm chart deploys the operator and all required CRDs in the correct order.

  • The Kustomize overlay (from your Git repo) contains environment-specific resources like:

    • CephCluster – the main storage cluster definition

    • StorageClass – so other apps can request storage using PersistentVolumeClaims

    • CephFileSystem – enables shared POSIX-compliant volumes for apps needing ReadWriteMany access

    • Optional extras like CephBlockPool or a toolbox deployment for CLI-based Ceph management

You can find these manifests in the repo under:
apps/rook-ceph/overlays/lab/

Once saved, ArgoCD will treat both sources as part of the same application and sync them together, ensuring everything is deployed in the right order and stays in sync with Git.

Understanding the Rook-Ceph Overlay: Managing Complexity with GitOps

I wanted to cover this now before we try and sync. Setting up Rook-Ceph in a GitOps workflow involves more than just deploying a Helm chart. You’re orchestrating a sophisticated storage platform made up of tightly coupled components: an operator, CRDs, a distributed Ceph cluster, storage classes, ingress routes, and more. Each piece needs to be configured correctly and deployed in the proper order.

To keep all of this manageable and repeatable, we separate concerns using a combination of custom Helm values and a Kustomize overlay. The overlay found in apps/rook-ceph/overlays/lab brings together the critical resources required for a working Ceph deployment—block pools, shared filesystems, storage classes, and even a dashboard ingress.

The sections below break down each of these files so you can understand what’s happening, why it’s needed, and how it fits into the larger GitOps puzzle.

apps/rook-ceph/values-lab.yml

csi:
  enableRbdDriver: false
  • Purpose: Disables the RBD (block-device) CSI driver in this lab setup, since we’re only using CephFS here.

  • Why it matters: Keeps the cluster lean by not installing unused CSI components.

apps/rook-ceph/overlays/lab/

ceph-cluster.yml

apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: rook-ceph
  namespace: rook-ceph
spec:
  cephVersion:
    image: quay.io/ceph/ceph:v19.2.1
  dataDirHostPath: /var/lib/rook
  mon:
    count: 3
    allowMultiplePerNode: false
  dashboard:
    enabled: true
  storage:
    useAllNodes: true
    useAllDevices: false
    deviceFilter: vdb
  • Defines the core CephCluster resource.

  • Key settings:

    • Runs 3 monitors for quorum.

    • Uses each node’s vdb device for OSDs (fits your lab VM disk layout).

    • Enables the Ceph dashboard for visual health checks.

⚠️ NOTE: These settings are specific to my 3-node lab cluster, where each node has:

  • One OS disk (vda)

  • One dedicated Ceph data disk (vdb)

Example disk layout (lsblk output from one node):

bashCopyEdit[jeff@rocky9-lab-node1 ~]$ lsblk
NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sr0          11:0    1  1.7G  0 rom  
vda         252:0    0   50G  0 disk 
├─vda1      252:1    0    1G  0 part /boot
└─vda2      252:2    0   49G  0 part 
  ├─rl-root 253:0    0   44G  0 lvm  /
  └─rl-swap 253:1    0    5G  0 lvm  
vdb         252:16   0  250G  0 disk

Your disk layout will likely be different. I’ve configured Ceph to use only the vdb disk via the deviceFilter setting to avoid accidentally wiping the OS disk.

⚠️ Be careful: If you don’t tailor these values to your hardware, you could unintentionally destroy existing data. Always verify your node’s disk setup and adjust your configuration accordingly.

ceph-filesystem.yml

apiVersion: ceph.rook.io/v1
kind: CephFilesystem
metadata:
  name: k8s-ceph-fs
  namespace: rook-ceph
spec:
  metadataPool:
    failureDomain: host
    replicated:
      size: 3
  dataPools:
    - name: replicated
      failureDomain: host
      replicated:
        size: 3
  preserveFilesystemOnDelete: true
  metadataServer:
    activeCount: 1
    activeStandby: true
  • Creates a CephFilesystem (CephFS) for shared, POSIX-style volumes.

  • Why CephFS? Enables ReadWriteMany storage, which block pools alone can’t provide.

ceph-storageclass-delete.yml & ceph-storageclass-retain.yml

Both define Kubernetes StorageClass objects that front the CephFS CSI driver:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: rook-cephfs-delete      # or rook-cephfs-retain
provisioner: rook-ceph.cephfs.csi.ceph.com
parameters:
  clusterID: rook-ceph
  fsName: k8s-ceph-fs
  pool: k8s-ceph-fs-replicated
  csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
reclaimPolicy: Delete       # or Retain
allowVolumeExpansion: true
  • Difference:

    • rook-cephfs-delete will delete PV data when PVCs are removed.

    • rook-cephfs-retain will retain data for manual cleanup or backup.

  • Why two classes? Gives you flexibility for different workloads (ephemeral test vs. persistent data).

ingress-route-gui.yml

apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: ceph-ingressroute-gui
  namespace: rook-ceph
spec:
  entryPoints:
    - web
    - websecure
  routes:
    - match: Host(`ceph-dashboard-lab.jjland.local`) # EXAMPLE
      kind: Rule
      services:
        - name: rook-ceph-mgr-dashboard
          port: 7000
  • Exposes the Ceph dashboard through Traefik on your chosen host.

  • Why: Lets you reach the Ceph UI (after DNS/hosts setup) without manually port-forwarding.

kustomization.yml

resources:
  - ceph-cluster.yml
  - ingress-route-gui.yml
  - ceph-filesystem.yml
  - ceph-storageclass-delete.yml
  - ceph-storageclass-retain.yml
  • Aggregates all the above files into a single overlay that ArgoCD can sync.

  • Why Kustomize? Keeps base Helm installs separate from environment-specific definitions, making updates cleaner and more maintainable.

Step 4: Sync the Rook-Ceph Application

Ready? Go ahead and click Sync in ArgoCD for the rook-ceph application.

This one’s going to take a little more time, and for good reason. There’s a lot happening under the hood.

When you sync, ArgoCD will:

  • Deploy the Rook-Ceph Operator, which is responsible for watching and managing Ceph resources in your cluster

  • Install CephFS CSI drivers, RBAC roles, and CRDs needed to support persistent volumes

  • Apply your CephCluster, CephFilesystem, and StorageClass definitions via the Kustomize overlay

But the real magic starts after the operator is running.

Once the operator is up, it will immediately start watching for additional Ceph custom resources in the rook-ceph namespace. When it discovers the CephCluster definition, it will:

  • Initialize the monitors (MONs) for quorum

  • Deploy the manager (MGR) for handling cluster state and dashboard

  • Start spinning up the OSDs (Object Storage Daemons) using the storage devices you specified (in this case, vdb on each node)

This process can take several minutes depending on your hardware, node performance, and the size of your disks.

How do you know it worked?
The cluster is healthy when you see:

  • 3 running OSD pods, one for each disk across your 3 nodes

  • The rook-ceph application status in ArgoCD shows “Healthy” and “Synced”

  • Optionally: access the Ceph dashboard and verify health checks (covered earlier)

Troubleshooting Tips

Rook-Ceph is powerful, but complex. And with that complexity comes the potential for a lot of things to go sideways. I won’t dive into every failure mode here, but I’ll leave you with a few quick tips that can help when something’s not working as expected:

  • Use the ArgoCD UI to inspect pod logs.
    Click into the rook-ceph application, navigate to the "PODS" tab, and use the logs view to get real-time output from key components like the operator, mons, OSDs, and mgr. Most issues will reveal themselves here.

  • Resync the operator app to restart it.
    If the cluster gets stuck or fails to initialize certain pieces, manually syncing the operator application in ArgoCD will redeploy the pod. This is often enough to force a retry or pull in updated CRDs.

  • Disk issues?
    If Ceph is skipping disks or refusing to reuse them, it’s usually leftover metadata. Try running a full zap with ceph-volume or fallback to wipefs, sgdisk, and dd to fully clean the disk.

Congratulations! Once everything is green, you now have a fully functional Ceph storage backend—redundant, self-healing, and fully managed through GitOps.

Secrets Management: External Secrets + HashiCorp Vault

In any production platform, secrets management isn’t optional, it’s foundational. We're talking about things like API tokens, database passwords, SSH keys, and TLS certs. Storing these directly in your Git repo? Not an option. Hardcoding them into manifests? Definitely not.

That’s where External Secrets and HashiCorp Vault come in, and together, they solve this problem the right way.

What is HashiCorp Vault?

Vault is a centralized secrets manager that securely stores, encrypts, and dynamically serves secrets to applications and users. It supports access control, auditing, and integration with identity systems and cloud providers. In this stack, Vault acts as the secure system of record for all sensitive data.

What is External Secrets?

External Secrets is a Kubernetes operator that bridges external secret stores (like Vault) with native Kubernetes Secret objects. It watches for custom resources like ExternalSecret and automatically pulls values from Vault into the cluster, keeping them updated and consistent without manual intervention.

Why Network Automation Needs This

Network automation platforms—like NetBox, Nautobot, and custom Python tooling—frequently need access to sensitive data:

  • Device credentials for SSH or API-based provisioning

  • Authentication tokens for systems like GitHub, Slack, or ServiceNow

  • Vaulted credentials for orchestrating changes via Ansible or Nornir

You don’t want these values floating around in plaintext in Git. But you still want to declare your intent (what secrets are needed and where) in version control. This is especially critical when you're deploying infrastructure with GitOps and need environments to be reproducible and secure.

With Vault + External Secrets, you can:

  • Keep the actual secret values outside of Git

  • Still declare your ExternalSecret manifests in Git as part of your ArgoCD-managed platform

  • Let Kubernetes handle syncing and refreshing secrets automatically

This pattern ensures your network automation stack is secure, scalable, and compliant, without losing any GitOps benefits.

Installing External Secrets Operator

Setting up External Secrets is straightforward and follows the same pattern we’ve used throughout this platform. In this section, we’ll deploy the External Secrets Operator using its official Helm chart with default values, no custom overlays, or secret stores just yet.

Step 1: Add the Helm Repo

First, add the External Secrets Helm repository to ArgoCD:

  1. In the ArgoCD UI, go to Settings → Repositories

  2. Click + CONNECT REPO

  3. Fill in the following:

    • Type: Helm

    • URL: https://charts.external-secrets.io

    • Name (optional): external-secrets

    • Project: Choose your ArgoCD project (e.g., lab-home)

    • Authentication: Leave empty (this is a public repo)

  4. Click CONNECT to save

Step 2: Create the ArgoCD Application

Navigate to Applications → + NEW APP, and fill out the form like this:

  • Application Name: external-secrets

  • Project: lab-home (or your equivalent)

  • Sync Policy: Manual

  • Repository URL: Select the Helm repo you just added

  • Chart: external-secrets

  • Target Revision: latest (or a specific version like 0.16.1)

  • Cluster URL: https://kubernetes.default.svc

  • Namespace: external-secrets
    (Check the box to create the namespace if it doesn’t exist)

Click CREATE to finish.

Step 3: Sync the Application

Once the app is created, hit SYNC in the ArgoCD UI. This will:

  • Deploy the External Secrets Operator into your cluster

  • Create the necessary CRDs and controller components

  • Make the ExternalSecret, SecretStore, and ClusterSecretStore resource types available

You should see the app enter a Synced and Healthy state once everything is up and running. No custom values or overlays are needed at this stage.

Installing HashiCorp Vault

Vault is our centralized secrets store, and in this setup we’re deploying it with two main goals in mind:

  • Enable its built-in GUI for easy inspection and management

  • Ensure secret data is persisted using our Rook-Ceph-backed StorageClass

To accomplish this, we’ll combine a Helm-based deployment with a Kustomize overlay that adds a Traefik IngressRoute for secure browser access.

Step 1: Add the Helm Repo

Add the official HashiCorp Helm chart repo to ArgoCD:

  1. In the ArgoCD UI, go to Settings → Repositories

  2. Click + CONNECT REPO

  3. Fill in:

    • Type: Helm

    • URL: https://helm.releases.hashicorp.com

    • Project: lab-home (or whatever you're using)

    • Authentication: Leave blank (public repo)

  4. Click CONNECT to save

Step 2: Prepare Your Vault Application

Vault is more stateful and config-heavy than most apps, so we’re using two sources in our ArgoCD Application:

  • A Helm chart to install Vault and enable persistent storage

  • A Kustomize overlay that exposes the Vault UI through Traefik

Here’s an example Application manifest (adjust values as needed for your setup):

project: lab-home
destination:
  server: https://kubernetes.default.svc
  namespace: vault
syncPolicy:
  syncOptions:
    - CreateNamespace=true
sources:
  - repoURL: https://helm.releases.hashicorp.com
    chart: vault
    targetRevision: 0.30.0 # or latest stable
    helm:
      valueFiles:
        - $values/apps/hashicorp-vault/values-lab.yml
  - repoURL: https://github.com/leothelyon17/kubernetes-gitops-playground.git
    targetRevision: HEAD
    path: apps/hashicorp-vault/overlays/lab
    ref: values

Note: The Git repo and folder structure here are based on my kubernetes-gitops-playground. If you’re using your own repo, be sure to adjust the repoURL, path, and valueFiles references accordingly.

Step 3: Custom Helm Values

In your Git repo, the file at apps/vault/values-lab.yml should enable:

  • The Vault UI (ui: true)

  • Persistent storage via the Rook-Ceph-backed StorageClass you created earlier

Example configuration:

server:

  dataStorage:
    enabled: true
    # Size of the PVC created
    size: 1Gi
    # Location where the PVC will be mounted.
    mountPath: "/vault/data"
    # Name of the storage class to use.  If null it will use the
    # configured default Storage Class.
    storageClass: rook-cephfs-retain
    # Access Mode of the storage device being used for the PVC
    accessMode: ReadWriteOnce

# Vault UI
ui:
  enabled: true

Step 4: Expose Vault Securely with Traefik

In your apps/vault/overlays/lab directory, define a Kustomize file to expose the UI via Traefik.

Example: kustomization.yml

resources:
  - ingress-route-gui.yml

And in ingress-route-gui.yml:

apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: vault-dashboard
  namespace: vault
spec:
  entryPoints:
    - websecure
  routes:
    - match: Host(`vault-lab.jjland.local`) # EXAMPLE
      kind: Rule
      services:
        - name: vault
          port: 8200

Note: vault-lab.jjland.local is an example hostname used in my lab.
If you're following along exactly, feel free to use it, just be sure to add a local DNS or /etc/hosts entry that maps this to your cluster’s ingress IP.
Otherwise, replace this hostname with one appropriate for your environment.

Step 5: Sync the Application

Once your Helm values and Kustomize overlay are in place and committed to Git, go ahead and sync the Vault application from ArgoCD.

ArgoCD will deploy all Vault components into the vault-lab namespace, including:

  • The StatefulSet for the Vault server

  • The service account, RBAC roles, and services

  • A PersistentVolumeClaim (PVC) for storing Vault data

  • Your custom IngressRoute for exposing the GUI

After syncing, head to the Vault app in ArgoCD to verify the following:

  • The app status should be Synced

  • The PVC should be Bound and Healthy

  • The main Vault pod will likely remain in a Progressing state, this is expected

That “Progressing” status is normal because Vault isn’t fully initialized yet. It won’t report itself as ready until it has been manually initialized and unsealed for the first time.

Before moving forward, it’s a good idea to:

  • Inspect the pod logs in the ArgoCD UI if anything seems stuck

  • Check kubectl get pvc -n vault-lab to confirm the PVC is attached and healthy

  • Use kubectl describe pod or describe pvc to troubleshoot issues

If all looks good, navigate to the Vault UI in your browser:

https://vault-lab.jjland.local # EXAMPLE

If you’re using a different hostname, be sure you’ve created the appropriate DNS or /etc/hosts entry.

From the web UI, you can initialize Vault, generate unseal keys, and perform the first unseal operation, all interactively.

Initializing Vault Through the GUI

Once the Vault UI is accessible, it’s time to initialize the system. Vault doesn’t become “ready” until this step is completed, and it only needs to be done once per cluster.

Step 1: Open the Vault UI

Navigate to the Vault dashboard in your browser:

https://vault-lab.jjland.local

(Or your custom hostname if you’re using a different setup.)

You’ll be presented with a message that Vault has not yet been initialized. Click the “Initialize” button to begin the process.

Step 2: Generate Unseal Keys

The GUI will prompt you to configure key shares and key threshold. Leave these at the defaults unless you have a specific security model in mind:

  • Key Shares: 5

  • Key Threshold: 3

This means Vault will generate 5 unseal keys, and any 3 of them will be required to unseal the Vault.

Click "Initialize" to proceed. Vault will generate a JSON file containing:

  • The root token (used to log in as admin)

  • All 5 unseal keys

Download this file immediately and store it in a secure location. These keys cannot be recovered later.

⚠️ Do not skip this download. If you lose these keys before unsealing, you’ll have to wipe and redeploy Vault from scratch.

Step 3: Unseal the Vault

After downloading the key file, Vault will prompt you to enter the unseal keys one by one.

  • Copy a single unseal key from the JSON file

  • Paste it into the field and click “Unseal”

  • Repeat with two more keys (for a total of 3)

Once the required threshold is met, Vault will unlock and become active.

Step 4: Log In with the Root Token

After unsealing, return to the login screen and paste in the root token from your downloaded JSON file.

Once logged in, you’ll have full admin access to Vault.

Step 5: Verify in ArgoCD

Flip back to the ArgoCD UI and check the status of the Vault application. At this point, the main pod should switch from Progressing to Healthy, and your application should show as fully operational.

You're now ready to configure Vault as a backend for External Secrets, so your GitOps-managed workloads can securely retrieve credentials, tokens, and other sensitive data on demand.

This completes Part 2 of this series.

Summary & What’s Next

In Part 2, we took our GitOps foundation and turned it into a functional, production-capable platform. We integrated critical infrastructure components like MetalLB for external access, Traefik for routing, Rook-Ceph for persistent storage, and a full-fledged secrets management stack using External Secrets and HashiCorp Vault, all deployed declaratively using ArgoCD.

At this point, you have a GitOps-powered Kubernetes environment that’s capable of:

  • Exposing services securely with external IPs and ingress rules

  • Persisting data across workloads using Ceph-backed volumes

  • Managing secrets securely without embedding them in Git

  • Deploying and managing infrastructure the same way you'll deploy apps: as code

This platform is now ready to host real-world applications, whether it’s NetBox, Nautobot, or custom tooling built for your network automation workflows.

In Part 3, we’ll finally do just that: deploy a real application on top of everything we’ve built. I haven’t finalized which app we’ll use yet, but it’ll be something practical and network-engineer focused. Stay tuned and thank you for reading!

0
Subscribe to my newsletter

Read articles from Jeffrey Lyon directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Jeffrey Lyon
Jeffrey Lyon