Dynamic Resource Allocation for better device usage efficiency


Introduction
Dynamic Resource Allocation(DRA) is a new feature in Kubernetes to address the pain point of managing assignment of hardware devices like GPUs in a better fashion. This can be understood as higher level generalization of storage allocation pattern in K8s using apis like StorageClass, PV and PVC. Similar to them, you’ll have DeviceClass, ResourceClaimTemplate and ResourceClaim. I have tried this feature with the help of dra-example-driver created by kubernetes-sigs
The Need for DRA
Since the AI boom, there is an increased focus in K8s to address the painpoints of AI/ML workloads. One such problem is k8s ability to allocate GPUs and schedule pods in the node that have GPU, resource sharing between pods etc. This feature is not just limited to GPUs but also for other hardware like network interfaces for which one has to rely on solutions like multus.
The DRA is a step in right direction to address the gap in the synchrony of hardware and pod that uses it.
How it works?
Here is how Kubernetes documentation defines the APIs
ResourceClaim
Describes a request for access to resources in the cluster, for use by workloads. For example, if a workload needs an accelerator device with specific properties, this is how that request is expressed. The status stanza tracks whether this claim has been satisfied and what specific resources have been allocated.
ResourceClaimTemplate
Defines the spec and some metadata for creating ResourceClaims. Created by a user when deploying a workload. The per-Pod ResourceClaims are then created and removed by Kubernetes automatically.
DeviceClass
Contains pre-defined selection criteria for certain devices and configuration for them. DeviceClasses are created by a cluster administrator when installing a resource driver. Each request to allocate a device in a ResourceClaim must reference exactly one DeviceClass.
ResourceSlice
Used by DRA drivers to publish information about resources (typically devices) that are available in the cluster.
DeviceTaintRule
Used by admins or control plane components to add device taints to the devices described in ResourceSlices.
So basically, when the device driver is installed, it also comes with its own DeviceClass(like StorageClass) and also ResourceSlice which manages the list of resources of that DeviceClass that are available. The pods will have reference to either ResourceClaim or ResourceClaimTemplate.
Testing DRA
So I have created kind cluster using latest version 1.33. We need to explicitly enable DRA feature. Here is the kind config
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
featureGates:
DynamicResourceAllocation: true
containerdConfigPatches:
# Enable CDI as described in
# https://tags.cncf.io/container-device-interface#containerd-configuration
- |-
[plugins."io.containerd.grpc.v1.cri"]
enable_cdi = true
nodes:
- role: control-plane
kubeadmConfigPatches:
- |
kind: ClusterConfiguration
apiServer:
extraArgs:
runtime-config: "resource.k8s.io/v1beta1=true"
scheduler:
extraArgs:
v: "1"
controllerManager:
extraArgs:
v: "1"
- |
kind: InitConfiguration
nodeRegistration:
kubeletExtraArgs:
v: "1"
- role: worker
kubeadmConfigPatches:
- |
kind: JoinConfiguration
nodeRegistration:
kubeletExtraArgs:
v: "1"
Once the cluster is created, we need to install the helmchart dra-example-driver. This creates the daemon-set
➜ kind k get daemonsets.apps -n dra-example-driver
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
dra-example-driver-kubeletplugin 1 1 1 1 1 <none> 8h
you can now see the device class by listing k get deviceclasses
command and you can also list the resourceslice available for that deviceclass
➜ dra-example-driver git:(main) kubectl get resourceslice -o yaml
apiVersion: v1
items:
- apiVersion: resource.k8s.io/v1beta1
kind: ResourceSlice
metadata:
creationTimestamp: "2025-05-25T20:02:27Z"
generateName: kind-worker-gpu.example.com-
generation: 1
name: kind-worker-gpu.example.com-tkbqm
ownerReferences:
- apiVersion: v1
controller: true
kind: Node
name: kind-worker
uid: 3c2af9d8-c6e1-449d-8869-85bc68131687
resourceVersion: "1010"
uid: ffeeda3f-b43b-4be9-bb8f-e6937284907b
spec:
devices:
- basic:
attributes:
driverVersion:
version: 1.0.0
index:
int: 6
model:
string: LATEST-GPU-MODEL
uuid:
string: gpu-121b8219-b8d6-015c-b2eb-1e320ee07510
capacity:
memory:
value: 80Gi
name: gpu-6
- basic:
attributes:
driverVersion:
version: 1.0.0
index:
int: 7
model:
string: LATEST-GPU-MODEL
uuid:
string: gpu-f270be4e-7cd6-da75-39e7-b707122f9b70
capacity:
memory:
value: 80Gi
name: gpu-7
- basic:
attributes:
driverVersion:
version: 1.0.0
index:
int: 0
model:
string: LATEST-GPU-MODEL
uuid:
string: gpu-4cbf87f3-433e-6717-5588-c33e6886832f
capacity:
memory:
value: 80Gi
name: gpu-0
- basic:
attributes:
driverVersion:
version: 1.0.0
index:
int: 1
model:
string: LATEST-GPU-MODEL
uuid:
string: gpu-58bd415e-dee8-f0a5-ca03-02d000554b1a
capacity:
memory:
value: 80Gi
name: gpu-1
- basic:
attributes:
driverVersion:
version: 1.0.0
index:
int: 2
model:
string: LATEST-GPU-MODEL
uuid:
string: gpu-6ab67185-8eff-3a23-32fd-75bfbe37b488
capacity:
memory:
value: 80Gi
name: gpu-2
- basic:
attributes:
driverVersion:
version: 1.0.0
index:
int: 3
model:
string: LATEST-GPU-MODEL
uuid:
string: gpu-6b77fb80-2d68-809d-4bf1-285e5f47dcc5
capacity:
memory:
value: 80Gi
name: gpu-3
- basic:
attributes:
driverVersion:
version: 1.0.0
index:
int: 4
model:
string: LATEST-GPU-MODEL
uuid:
string: gpu-417d66cd-4546-0786-59a3-ef7eb54c564d
capacity:
memory:
value: 80Gi
name: gpu-4
- basic:
attributes:
driverVersion:
version: 1.0.0
index:
int: 5
model:
string: LATEST-GPU-MODEL
uuid:
string: gpu-f0fdf728-dccb-f484-bbf5-33f63a90b820
capacity:
memory:
value: 80Gi
name: gpu-5
driver: gpu.example.com
nodeName: kind-worker
pool:
generation: 1
name: kind-worker
resourceSliceCount: 1
kind: List
metadata:
resourceVersion: ""
It lists all the GPUs available in the cluster. Now I have tried running the pods listed in the repo and observed their behaviour
For example gpu-test-1
Two pods, one container each
Each container asking for 1 distinct GPU
Once I installed the pod, here is the list of resourceclaims(RC) and RCTs that I can see
➜ demo git:(main) k get resourceclaim
NAME STATE AGE
pod0-gpu-724sp allocated,reserved 106s
pod1-gpu-kvs27 allocated,reserved 106s
➜ demo git:(main) k get resourceclaimtemplates.resource.k8s.io
NAME AGE
single-gpu 2m4s
Here are the logs from one of the pod showing the GPU it was allocated with
➜ demo git:(main) k logs pod0
declare -x DRA_RESOURCE_DRIVER_NAME="gpu.example.com"
declare -x GPU_DEVICE_6="gpu-6"
declare -x GPU_DEVICE_6_RESOURCE_CLAIM="0b66bdc4-7112-4eb7-a371-6df9e0a08167"
declare -x GPU_DEVICE_6_SHARING_STRATEGY="TimeSlicing"
declare -x GPU_DEVICE_6_TIMESLICE_INTERVAL="Default"
declare -x HOME="/root"
declare -x HOSTNAME="pod0"
declare -x KUBERNETES_NODE_NAME="kind-worker"
declare -x KUBERNETES_PORT="tcp://10.96.0.1:443"
declare -x KUBERNETES_PORT_443_TCP="tcp://10.96.0.1:443"
declare -x KUBERNETES_PORT_443_TCP_ADDR="10.96.0.1"
declare -x KUBERNETES_PORT_443_TCP_PORT="443"
declare -x KUBERNETES_PORT_443_TCP_PROTO="tcp"
declare -x KUBERNETES_SERVICE_HOST="10.96.0.1"
declare -x KUBERNETES_SERVICE_PORT="443"
declare -x KUBERNETES_SERVICE_PORT_HTTPS="443"
declare -x OLDPWD
declare -x PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
declare -x PWD="/"
declare -x SHLVL="1"
Similarly I also tried other definition files as well and everything worked as expected.
Conclusion
Though this is just an example testing, I believe that DRA will work in real-world just as expected. Though DRA is still in its initial stages, a lot of new features can be added and all cloud providers will come up with new device classes to make them available for pods.
Subscribe to my newsletter
Read articles from Srujan Reddy directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Srujan Reddy
Srujan Reddy
I am a Kubernetes Engineer passionate about leveraging Cloud Native ecosystem to run secure, efficient and effective workloads.