NVIDIA Tesla GPU Scheduling: From HPC to Kubernetes with Volcano MLOps

Akash PawarAkash Pawar
4 min read

When You Need Gang Scheduling

Applications that require coordinated multi-pod execution:

  • Distributed ML Training: Multi-GPU model training (PyTorch DDP, TensorFlow Distributed)

  • High-Performance Computing: Weather simulation, molecular dynamics

  • Parallel Data Processing: Large-scale ETL with coordinated workers

  • Multi-node Databases: Distributed database initialization

The Pain Point: Traditional Kubernetes schedules pods individually. For a 4-GPU training job:

  • Pod 1 starts → claims 1 GPU

  • Pods 2-4 wait indefinitely for resources

  • Result: $8/hour burning while GPUs sit idle

Our Demo: Simulates distributed TensorFlow training requiring 2 GPUs across 2 nodes.

Repository: nvidia-gpu-volcano-k8s


The Problem

Traditional Kubernetes scheduling with expensive GPU workloads:

Job needs: 2 GPUs total
Available: 2 GPUs across 2 nodes

❌ Standard Scheduler:
Pod 1: Scheduled immediately (claims 1 GPU)
Pod 2: Waits for "sufficient resources" 
Result: 1 GPU idle, job fails

✅ Volcano Gang Scheduler: 
Both pods: Wait until 2 GPUs available
Then: Start simultaneously
Result: Efficient resource utilization

Implementation

1. GPU Node Setup

# GPU-enabled AMI with pre-installed NVIDIA drivers
eksctl create nodegroup \
  --node-type g4dn.xlarge \
  --nodes 2 --spot \
  --node-ami AL2_x86_64_GPU

2. GPU Support

# Install NVIDIA device plugin
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml

# Setup GPU node taints
chmod +x gpu_node_setup.sh
./gpu_node_setup.sh

From gpu_node_setup.sh:

#!/bin/bash
echo "Setting up GPU nodes for Volcano demo..."

# Get GPU node names
GPU_NODES=$(kubectl get nodes --no-headers | grep g4dn | awk '{print $1}')

# Apply standard NVIDIA GPU taints
echo "Applying NVIDIA GPU taints..."
for node in $GPU_NODES; do
    kubectl taint nodes $node nvidia.com/gpu=present:NoSchedule --overwrite
    echo "Tainted node: $node"
done

3. Install Volcano

helm repo add volcano-sh https://volcano-sh.github.io/helm-charts
helm install volcano volcano-sh/volcano -n volcano-system --create-namespace

4. Gang Scheduled Job

From manifests/tensorflow-job.yaml:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: gang-tf-job
spec:
  minAvailable: 2          # Critical: Both pods or none
  schedulerName: volcano   # Use Volcano instead of default
  queue: ml-queue
  tasks:
    - replicas: 2
      name: trainer
      template:
        spec:
          tolerations:
            - key: "nvidia.com/gpu"
              operator: "Exists"
              effect: "NoSchedule"
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  - key: "node.kubernetes.io/instance-type"
                    operator: In
                    values: ["g4dn.xlarge", "g4dn.2xlarge", "g5.xlarge"]
          containers:
            - name: trainer
              image: akash202k/tf-volcano-demo:v1
              resources:
                requests:
                  cpu: "2"
                  memory: "4Gi"
                  nvidia.com/gpu: 1
                limits:
                  cpu: "2"
                  memory: "4Gi"
                  nvidia.com/gpu: 1

Demo: Pain Point Simulation

Normal Gang Scheduling

kubectl apply -f manifests/volcano-queue.yaml
kubectl apply -f manifests/tensorflow-job.yaml
kubectl get pods -l volcano.sh/job-name=gang-tf-job -w

Result: Both pods start simultaneously ✅

Resource Contention (The Pain Point)

From manifests/gpu-blocker-pod.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: blocker
spec:
  tolerations:
    - key: "nvidia.com/gpu"
      operator: "Exists"
      effect: "NoSchedule"
  containers:
    - name: stress
      image: nvidia/cuda:11.0-base
      command: ["/bin/sh"]
      args: ["-c", "sleep 60"]
      resources:
        requests:
          nvidia.com/gpu: 1
# Simulate real-world contention - block 1 GPU
kubectl apply -f manifests/gpu-blocker-pod.yaml

# Deploy gang job (needs 2 GPUs, only 1 available)
kubectl apply -f manifests/tensorflow-job.yaml

Traditional scheduler would: Start 1 pod, waste resources
Volcano gang scheduler: Keeps both pods Pending until sufficient resources ✅

Resource Release

kubectl delete pod blocker

Result: Both pods start together immediately ✅


Results Achieved

Gang Scheduling Success

NAME                    READY   STATUS      RESTARTS   AGE
gang-tf-job-trainer-0   0/1     Completed   0          3m46s
gang-tf-job-trainer-1   0/1     Completed   0          3m46s

GPU Training Confirmed

From our actual training logs in logs/volcano_training_success_20250726_045527_d4ead352.json:

Pod 1 (ap-southeast-1c):

{
  "aws_metadata": {
    "instance_id": "i-0bd485d90b62fcc01",
    "instance_type": "g4dn.xlarge",
    "availability_zone": "ap-southeast-1c"
  },
  "gpu_info": {
    "gpu_details": [{
      "details": {
        "compute_capability": [7, 5],
        "device_name": "Tesla T4"
      }
    }],
    "nvidia_smi": [{
      "name": "Tesla T4",
      "memory_total_mb": "15360",
      "memory_used_mb": "103",
      "utilization_percent": "0",
      "temperature_c": "26"
    }]
  },
  "training_results": {
    "device_used": "/GPU:0",
    "final_loss": 0.08402693271636963,
    "status": "success"
  }
}

Pod 2 (ap-southeast-1b) from logs/volcano_training_success_20250726_045528_65078d2b.json:

{
  "aws_metadata": {
    "instance_id": "i-0edff736d5e7d3a59",
    "instance_type": "g4dn.xlarge", 
    "availability_zone": "ap-southeast-1b"
  },
  "training_results": {
    "device_used": "/GPU:0",
    "final_loss": 0.08474162966012955,
    "status": "success"
  }
}

Perfect coordination: Both pods trained on separate Tesla T4 GPUs simultaneously across different AZs.


Pain Point Solved

Why Distributed Training Needs Gang Scheduling

Single Machine Limitations:

  • Large models can't fit in single GPU memory (e.g., LLaMA, GPT models)

  • Training time becomes prohibitive (weeks vs days)

  • Memory constraints limit batch sizes and model complexity

Distributed Training Requirements:

  • All workers must start together for synchronized gradient updates

  • Coordinated parameter sharing across nodes

  • Consistent training state - partial deployments corrupt the training process

Before Gang Scheduling

  • Resource waste: Partial deployments burn GPU time while waiting for missing workers

  • Training failures: Incomplete worker allocation breaks distributed algorithms

  • Cost inefficiency: $2-3/hour per idle GPU waiting for coordination

  • Model corruption: Partial worker sets produce invalid gradients

After Gang Scheduling

  • Resource efficiency: 100% GPU utilization across all workers simultaneously

  • Training reliability: All workers start together, ensuring proper distributed training

  • Cost control: No GPU time wasted on incomplete worker deployments

  • Model integrity: Consistent distributed training with complete worker sets

Demo Economics: $0.16 for complete coordinated training vs. potential hours of idle GPU costs waiting for worker coordination.


Conclusion

For distributed GPU workloads requiring multiple nodes, gang scheduling isn't optional—it's mandatory for functional training.

Single machine can't handle: Modern ML workloads requiring distributed computation
Volcano delivers: Coordinated multi-node scheduling that ensures training actually works

Repository: nvidia-gpu-volcano-k8s

0
Subscribe to my newsletter

Read articles from Akash Pawar directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Akash Pawar
Akash Pawar

Devops | 3x AWS Certified | CKA