NVIDIA Tesla GPU Scheduling: From HPC to Kubernetes with Volcano MLOps


When You Need Gang Scheduling
Applications that require coordinated multi-pod execution:
Distributed ML Training: Multi-GPU model training (PyTorch DDP, TensorFlow Distributed)
High-Performance Computing: Weather simulation, molecular dynamics
Parallel Data Processing: Large-scale ETL with coordinated workers
Multi-node Databases: Distributed database initialization
The Pain Point: Traditional Kubernetes schedules pods individually. For a 4-GPU training job:
Pod 1 starts → claims 1 GPU
Pods 2-4 wait indefinitely for resources
Result: $8/hour burning while GPUs sit idle
Our Demo: Simulates distributed TensorFlow training requiring 2 GPUs across 2 nodes.
Repository: nvidia-gpu-volcano-k8s ⭐
The Problem
Traditional Kubernetes scheduling with expensive GPU workloads:
Job needs: 2 GPUs total
Available: 2 GPUs across 2 nodes
❌ Standard Scheduler:
Pod 1: Scheduled immediately (claims 1 GPU)
Pod 2: Waits for "sufficient resources"
Result: 1 GPU idle, job fails
✅ Volcano Gang Scheduler:
Both pods: Wait until 2 GPUs available
Then: Start simultaneously
Result: Efficient resource utilization
Implementation
1. GPU Node Setup
# GPU-enabled AMI with pre-installed NVIDIA drivers
eksctl create nodegroup \
--node-type g4dn.xlarge \
--nodes 2 --spot \
--node-ami AL2_x86_64_GPU
2. GPU Support
# Install NVIDIA device plugin
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml
# Setup GPU node taints
chmod +x gpu_node_setup.sh
./gpu_node_setup.sh
From gpu_node_setup.sh
:
#!/bin/bash
echo "Setting up GPU nodes for Volcano demo..."
# Get GPU node names
GPU_NODES=$(kubectl get nodes --no-headers | grep g4dn | awk '{print $1}')
# Apply standard NVIDIA GPU taints
echo "Applying NVIDIA GPU taints..."
for node in $GPU_NODES; do
kubectl taint nodes $node nvidia.com/gpu=present:NoSchedule --overwrite
echo "Tainted node: $node"
done
3. Install Volcano
helm repo add volcano-sh https://volcano-sh.github.io/helm-charts
helm install volcano volcano-sh/volcano -n volcano-system --create-namespace
4. Gang Scheduled Job
From manifests/tensorflow-job.yaml
:
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: gang-tf-job
spec:
minAvailable: 2 # Critical: Both pods or none
schedulerName: volcano # Use Volcano instead of default
queue: ml-queue
tasks:
- replicas: 2
name: trainer
template:
spec:
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "node.kubernetes.io/instance-type"
operator: In
values: ["g4dn.xlarge", "g4dn.2xlarge", "g5.xlarge"]
containers:
- name: trainer
image: akash202k/tf-volcano-demo:v1
resources:
requests:
cpu: "2"
memory: "4Gi"
nvidia.com/gpu: 1
limits:
cpu: "2"
memory: "4Gi"
nvidia.com/gpu: 1
Demo: Pain Point Simulation
Normal Gang Scheduling
kubectl apply -f manifests/volcano-queue.yaml
kubectl apply -f manifests/tensorflow-job.yaml
kubectl get pods -l volcano.sh/job-name=gang-tf-job -w
Result: Both pods start simultaneously ✅
Resource Contention (The Pain Point)
From manifests/gpu-blocker-pod.yaml
:
apiVersion: v1
kind: Pod
metadata:
name: blocker
spec:
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
containers:
- name: stress
image: nvidia/cuda:11.0-base
command: ["/bin/sh"]
args: ["-c", "sleep 60"]
resources:
requests:
nvidia.com/gpu: 1
# Simulate real-world contention - block 1 GPU
kubectl apply -f manifests/gpu-blocker-pod.yaml
# Deploy gang job (needs 2 GPUs, only 1 available)
kubectl apply -f manifests/tensorflow-job.yaml
Traditional scheduler would: Start 1 pod, waste resources
Volcano gang scheduler: Keeps both pods Pending
until sufficient resources ✅
Resource Release
kubectl delete pod blocker
Result: Both pods start together immediately ✅
Results Achieved
Gang Scheduling Success
NAME READY STATUS RESTARTS AGE
gang-tf-job-trainer-0 0/1 Completed 0 3m46s
gang-tf-job-trainer-1 0/1 Completed 0 3m46s
GPU Training Confirmed
From our actual training logs in logs/volcano_training_success_20250726_045527_d4ead352.json
:
Pod 1 (ap-southeast-1c):
{
"aws_metadata": {
"instance_id": "i-0bd485d90b62fcc01",
"instance_type": "g4dn.xlarge",
"availability_zone": "ap-southeast-1c"
},
"gpu_info": {
"gpu_details": [{
"details": {
"compute_capability": [7, 5],
"device_name": "Tesla T4"
}
}],
"nvidia_smi": [{
"name": "Tesla T4",
"memory_total_mb": "15360",
"memory_used_mb": "103",
"utilization_percent": "0",
"temperature_c": "26"
}]
},
"training_results": {
"device_used": "/GPU:0",
"final_loss": 0.08402693271636963,
"status": "success"
}
}
Pod 2 (ap-southeast-1b) from logs/volcano_training_success_20250726_045528_65078d2b.json
:
{
"aws_metadata": {
"instance_id": "i-0edff736d5e7d3a59",
"instance_type": "g4dn.xlarge",
"availability_zone": "ap-southeast-1b"
},
"training_results": {
"device_used": "/GPU:0",
"final_loss": 0.08474162966012955,
"status": "success"
}
}
Perfect coordination: Both pods trained on separate Tesla T4 GPUs simultaneously across different AZs.
Pain Point Solved
Why Distributed Training Needs Gang Scheduling
Single Machine Limitations:
Large models can't fit in single GPU memory (e.g., LLaMA, GPT models)
Training time becomes prohibitive (weeks vs days)
Memory constraints limit batch sizes and model complexity
Distributed Training Requirements:
All workers must start together for synchronized gradient updates
Coordinated parameter sharing across nodes
Consistent training state - partial deployments corrupt the training process
Before Gang Scheduling
Resource waste: Partial deployments burn GPU time while waiting for missing workers
Training failures: Incomplete worker allocation breaks distributed algorithms
Cost inefficiency: $2-3/hour per idle GPU waiting for coordination
Model corruption: Partial worker sets produce invalid gradients
After Gang Scheduling
Resource efficiency: 100% GPU utilization across all workers simultaneously
Training reliability: All workers start together, ensuring proper distributed training
Cost control: No GPU time wasted on incomplete worker deployments
Model integrity: Consistent distributed training with complete worker sets
Demo Economics: $0.16 for complete coordinated training vs. potential hours of idle GPU costs waiting for worker coordination.
Conclusion
For distributed GPU workloads requiring multiple nodes, gang scheduling isn't optional—it's mandatory for functional training.
Single machine can't handle: Modern ML workloads requiring distributed computation
Volcano delivers: Coordinated multi-node scheduling that ensures training actually works
Repository: nvidia-gpu-volcano-k8s ⭐
Subscribe to my newsletter
Read articles from Akash Pawar directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Akash Pawar
Akash Pawar
Devops | 3x AWS Certified | CKA