Running the Llama 3.1 8B Large Language Model Cheaply on Google Cloud Kubernetes

Simon CroweSimon Crowe
5 min read

This is another short post about my pet project, Shortlist. The project's main aims were to help me pass my CKAD exam (which I luckily did) and give me some exposure to operating LLMs in the cloud. I wanted to do this with as small an impact on my bank balance as possible.

Cost-Cutting Strategies

What follows is by no means production-grade ML/FinOps advice, but it might help if someone wants to run a similar hobby project.

Use Single-Zone Spot VMs

Google Cloud’s Kubernetes offering is geared towards highly available cloud-native services, and as such, a typical node pool will be spread across three (availability) zones, meaning I’d need to run three expensive GPU VMs.

Here’s my Terraform/OpenTofu config for the node pool, in which I’ve avoided multiple zones:

resource "google_container_node_pool" "primary_gpu" {
  name       = "shortlist-gpu"
  location   = "europe-west1-b"
  cluster    = google_container_cluster.primary.name
  node_count = 1

  node_config {
    machine_type = "g2-standard-4"
    spot         = true

    service_account = google_service_account.kubernetes.email
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]
  }
}

location = "europe-west1-b" tells Google Cloud to only provision nodes in one zone (hence the -b) within their Belgium region. This is roughly equivalent to “one data centre”. This ensures that only one node is deployed at any one time. This would be a bad idea if I were running production HTTP workers, but in this case, I don’t care if the data centre my machine is running in is hit by a natural disaster and it goes down completely.

The next piece of config is simple: spot = true. Spot VMs are cheaper but preemptable. They can be provisioned at any time. Again, while I’d have to take special care with this in a production setting, in this case, it’s perfect.

The g2-standard-4 is one of the cheapest GPU VMs. It has a L4 GPU, which has 24GB of VRAM and is more than enough for a smaller LLM. Llama 3.1 8B took up just under 16GB of VRAM so I might have managed a slightly larger model, or even Gemma 2 27B with some quantisation.

Google’s Limits are Helpful Here

It turns out, in their GPU-stinginess, Google had my back, too. There’s a per-project global limit on the number of GPUs.

Caching Model Weights

In my ignorance, I first attempted to run the 70B version of Llama, which included over 100GB of weights. This overwhelmed the ephemeral storage on the node. So began my journey from knowing nothing about running LLMs in the cloud to knowing a bit.

I spent quite a while playing with different combinations of:

  • Manually provisioned storage volumes

  • Storage provisioned through Google Cloud’s default storage classes

  • Persistent Volumes

  • Persistent Volume claims with and without manually defined persistent volumes

In the end, I concluded that there wasn’t an easy way to recycle a volume from one pod to another. I would have probably had to make quite a few direct K8s API calls to keep re-using the same volume.

Before I went that far, I came across something very useful: Cloud Storage Fuse. I could mount a cloud storage bucket as though it were a filesystem. This turned out to be an ideal and simple solution. I could even store the cached weights for multiple models in the same bucket, using different prefixes.

All that was required at the Terraform/Tofu level was this

resource "google_container_cluster" "primary" {
  name     = "shortlist-dev"
  location = "europe-west1-b"

  ...

  addons_config {
    gcs_fuse_csi_driver_config {
      enabled = true
    }
  }
}

What was slightly trickier was configuring the workload, as I was imperatively creating the job and configmap using Go code.

I needed to add an annotation to the pod so that kubernetes adds a sidecar container to the pod that handles turning object storage into a filesystem.

    jobTemplate := corev1.PodTemplateSpec{
        Spec: pod,
        ObjectMeta: metav1.ObjectMeta{
            Annotations: map[string]string{"gke-gcsfuse/volumes": "true"},
        },      
    }

Plus, the volume and volume mount:

    cacheDir := "/var/cache/shortlist-assessor"
    cacheVol := corev1.Volume{
        Name: "cache",
        VolumeSource: corev1.VolumeSource{
            CSI: &corev1.CSIVolumeSource{
                Driver: "gcsfuse.csi.storage.gke.io",
                VolumeAttributes: map[string]string{
                    "bucketName": os.Getenv("ASSESSOR_CACHE_BUCKET_NAME"),
                },
            },
        },
    }
    cacheVolMnt := corev1.VolumeMount{
        Name:      "cache",
        MountPath: cacheDir,
    }

Here’s the full code if you’re interested.

At the infrastructure level, I created the bucket manually so it persists when I tear down the dev cluster.

These two resources below relate to the workload identity that allows my container to access the storage bucket. The google_service_account_iam_member resource references the Kubernetes ServiceAccount resource within the cluster; it ties the two kinds of service accounts together.

resource "google_storage_bucket_iam_binding" "binding" {
  bucket = var.assessor_cache_bucket_name
  role   = "roles/storage.admin"
  members = [
    "serviceAccount:${google_service_account.kubernetes.email}"
  ]
}

resource "google_service_account_iam_member" "assessor_workload_identity_binding" {
  service_account_id = google_service_account.kubernetes.name
  role               = "roles/iam.workloadIdentityUser"
  member             = "serviceAccount:${data.google_project.project.project_id}.svc.id.goog[shortlist/run-dev-shortlist-runner-assessor]"
}

Summing Up

This post detailed my journey of trying to run AI inference on the cheap while learning more about Google Cloud Kubernetes. I wanted to use the tools I’d be using in an actual job to maximise what I learned. If budget were my only constraint, I’d have tried to find a suitable free LLM API. I could even run Kubernetes on my machine using something like Kind.

So far, running my development cluster for a day or so a month, my monthly Google Cloud bill hasn’t exceeded £40. Last month’s bill is £33.68, mostly for Compute Engine. Just over £7 was for “Nvidia L4 GPU attached to Spot Preemptible VMs running in Belgium”. The GPU itself only cost me 19p an hour. The other big charges seem to be for development missteps. My second largest charge is for persistent discs, probably from when I was loading the 70B LLama weights onto persistent volumes. Then there’s £5 worth of network egress from EMEA to the Americas. This is likely when I accidentally put my Google Storage bucket in a US region.

I’ll write another post after I’ve seen what the running cost of this project is. If it’s high, I might switch to running it on my laptop and talking to an LLM inference API. Most of the code, including the Terraform/Tofu helm release stuff, should be reusable.

0
Subscribe to my newsletter

Read articles from Simon Crowe directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Simon Crowe
Simon Crowe

I'm a backend engineer currently working in the DevOps space. In addition to cloud-native technologies like Kubernetes, I maintain an active interest in coding, particularly Python, Go and Rust. I started coding over ten years ago with C# and Unity as a hobbyist. Some years later I learned Python and began working as a backend software engineer. This has taken me through several companies and tech stacks and given me a lot of exposure to cloud technologies.