NVIDIA DRA Driver for GPUs#
Dynamic Resource Allocation (DRA) is a Kubernetes concept for flexibly requesting, configuring, and sharing specialized devices like GPUs. This page describes how to install and upgrade the DRA Driver for NVIDIA GPUs v0.4.0 with the NVIDIA GPU Operator.
Before using the DRA Driver for NVIDIA GPUs, it is recommended that you are familiar with the following:
Overview#
With the DRA Driver for NVIDIA GPUs, your Kubernetes workloads can allocate and consume the following two types of resources:
GPU allocation: for controlled sharing and dynamic reconfiguration of GPUs. This functionality replaces the traditional GPU allocation method used by the NVIDIA Kubernetes Device Plugin.
ComputeDomains: an abstraction for robust and secure Multi-Node NVLink (MNNVL) for NVIDIA GB200 and similar systems.
You can use these features independently or together in the same cluster.
Known Issues#
This section covers known issues for the DRA Driver when used with the NVIDIA GPU Operator. For known issues specific to the DRA Driver itself, refer to the DRA Driver v0.4.0 release notes.
There is a known issue where the NVIDIA Driver Manager is not aware of the DRA driver kubelet plugin, and will not correctly evict it on pod restarts. You must label the nodes you plan to use with DRA GPU allocation and pass the node label in the GPU Operator Helm command in the
driver.manager.envflag. This enables the NVIDIA Driver Manager to evict the GPU kubelet plugin correctly on driver container upgrades.
Prerequisites#
Installing GPU Operator v26.3.1 (in the next step) handles configuring the following and the DRA Driver for NVIDIA GPUs prerequisites for you:
Container Device Interface (CDI) to be enabled in the underlying container runtime (such as containerd or CRI-O)
NVIDIA Driver version 580 or later.
Deploying Node Feature Discovery (NFD) and GPU Feature Discovery (GFD).
Refer to the DRA Driver for NVIDIA GPUs prerequisites documentation for more information.
Make sure your GPUs and cluster align with the GPU Operator support matrix. The DRA Driver also requires the following:
Kubernetes v1.34.2 or later.
Note
If you plan to use traditional extended resource requests such as
nvidia.com/gpualongside the DRA driver, you must enable the DRAExtendedResource feature gate. This allows the scheduler to translate extended resource requests into ResourceClaims for the DRA driver.For GPU allocation, label nodes that will support GPU allocation with
nvidia.com/dra-kubelet-plugin=trueand use this label as a node selector in the DRA driver Helm chart. This is required to avoid the known issue when using the GPU Operator with the DRA Driver. Steps for labeling nodes are provided in the install section. The label is also passed to the GPU Operator Helm command via thedriver.manager.envflag.
Additional prerequisites for ComputeDomain:
NVIDIA Grace Blackwell GPUs with Multi-Node NVLink (MNNVL) available on your cluster. For example, NVIDIA HGX GB200 NVL72 or NVIDIA HGX GB300 NVL72. Refer to the NVIDIA Multi-Node NVLink Systems documentation for details on Multi-Node NVLink systems.
Install#
This section covers fresh installs of the GPU Operator and DRA Driver for NVIDIA GPUs. If you are upgrading from an earlier version for the DRA Driver for NVIDIA GPUs, refer to the Upgrade section.
Note
The nvidiaDriverRoot flag sets the root directory for the NVIDIA GPU driver.
The default value is /, which is typical for drivers installed directly on the host.
With GPU Operator–managed drivers (default), drivers are installed to /run/nvidia/driver.
If you are using pre-installed drivers, remove the nvidiaDriverRoot flag or set it to /.
Label every node that will support GPU allocation through DRA:
kubectl label node $HOSTNAME nvidia.com/dra-kubelet-plugin=trueAdd the NVIDIA Helm repository:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ && helm repo update
Install the GPU Operator with the NVIDIA Kubernetes Device Plugin disabled:
helm upgrade --install gpu-operator nvidia/gpu-operator \ --version=v26.3.1 \ --create-namespace \ --namespace gpu-operator \ --set devicePlugin.enabled=false \ --set driver.manager.env[0].name=NODE_LABEL_FOR_GPU_POD_EVICTION \ --set driver.manager.env[0].value="nvidia.com/dra-kubelet-plugin"
Make sure the value of
driver.manager.envmatches the node label applied in step 1.Make sure the
devicePlugin.enabledflag is set tofalseto disable the NVIDIA Kubernetes Device Plugin. The DRA Driver for NVIDIA GPUs will be used to allocate GPUs.Refer to the GPU Operator installation guide for additional configuration options. If you plan to use MIG devices, refer to the GPU Operator MIG documentation to configure your cluster for MIG support.
Create a
values.yamlfile for the DRA driver:image: pullPolicy: IfNotPresent kubeletPlugin: nodeSelector: nvidia.com/dra-kubelet-plugin: "true"
If you are using Google Kubernetes Engine (GKE), the DRA driver requires additional overrides for the driver root, controller affinity, and tolerations:
# GKE helm values example # "/home/kubernetes/bin/nvidia" is the default driver root on GKE. nvidiaDriverRoot: "/home/kubernetes/bin/nvidia" controller: priorityClassName: "" affinity: null image: pullPolicy: IfNotPresent kubeletPlugin: priorityClassName: "" tolerations: - effect: NoSchedule key: nvidia.com/gpu operator: Exists nodeSelector: nvidia.com/dra-kubelet-plugin: "true"
Install the DRA driver:
helm upgrade -i dra-driver-nvidia-gpu nvidia/dra-driver-nvidia-gpu \ --version=0.4.0 \ --namespace nvidia-dra-driver-gpu \ --create-namespace \ --set nvidiaDriverRoot=/run/nvidia/driver \ --set gpuResourcesEnabledOverride=true \ -f values.yaml
For GKE, omit
--set nvidiaDriverRoot=/run/nvidia/driver; the value comes from the GKEvalues.yamlfile.
Add the NVIDIA Helm repository:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo updateInstall the GPU Operator:
helm upgrade --install gpu-operator nvidia/gpu-operator \ --version=v26.3.1 \ --create-namespace \ --namespace gpu-operator
Refer to the GPU Operator installation guide for additional configuration options.
Install the DRA driver.
Example for an Operator-provided GPU driver:
helm upgrade -i dra-driver-nvidia-gpu nvidia/dra-driver-nvidia-gpu \ --version=0.4.0 \ --namespace nvidia-dra-driver-gpu \ --create-namespace \ --set resources.gpus.enabled=false \ --set nvidiaDriverRoot=/run/nvidia/driver
Example for a host-provided GPU driver:
helm upgrade -i dra-driver-nvidia-gpu nvidia/dra-driver-nvidia-gpu \ --version=0.4.0 \ --namespace nvidia-dra-driver-gpu \ --create-namespace \ --set resources.gpus.enabled=false
Validate Installation#
Confirm that the DRA driver components are running:
kubectl get pods -n nvidia-dra-driver-gpuExample Output
NAME READY STATUS RESTARTS AGE dra-driver-nvidia-gpu-controller-67cb99d84b-5q7kj 1/1 Running 0 7m26s dra-driver-nvidia-gpu-kubelet-plugin-h5xsn 1/1 Running 0 7m27s
The controller pod runs the ComputeDomain controller (1 container). The kubelet-plugin pod runs two containers, one for GPU resources (gpus) and one for ComputeDomain resources (compute-domains), so it shows 2/2 when both are enabled. One kubelet-plugin pod appears per GPU node.
If you installed with –set resources.computeDomains.enabled=false, the controller pod will not be present and the kubelet-plugin pod will show 1/1. The same is true if you disabled GPU allocation during install.
Note
If you upgraded an existing v25.x installation, the pod names retain the
nvidia-dra-driver-gpu-prefix (for example,nvidia-dra-driver-gpu-controller-*) because the upgrade preserves the original resource names through thenameOverrideflag.Verify that GPU DeviceClasses are available:
kubectl get deviceclassExample Output
NAME AGE compute-domain-daemon.nvidia.com 55s compute-domain-default-channel.nvidia.com 55s gpu.nvidia.com 55s mig.nvidia.com 55s
The compute-domain-daemon.nvidia.com and compute-domain-default-channel.nvidia.com DeviceClasses are installed when ComputeDomain support is enabled.
The gpu.nvidia.com and mig.nvidia.com DeviceClasses are installed when GPU allocation support is enabled.
Additional validation steps are available in the upstream DRA Driver documentation:
Upgrade#
Starting with v0.4.0, the DRA Driver for NVIDIA GPUs moved to kubernetes-sigs/dra-driver-nvidia-gpu and adopted semantic versioning.
The Helm chart was renamed from nvidia-dra-driver-gpu to dra-driver-nvidia-gpu and is published to new NGC Helm and container registries.
When upgrading from v25.x you must explicitly set nameOverride and --version to avoid creating duplicate Kubernetes manifests under different names.
Without --set nameOverride=nvidia-dra-driver-gpu, the upgrade creates new daemonsets and deployments under the new chart name instead of upgrading the existing resources in place.
Important
After upgrading to v0.4.0, downgrading to v25.x is not supported.
Upgrade from v25.x to v0.4.0#
Apply the v0.4.0 CRDs for ComputeDomains and ComputeDomainsCliques before upgrading the Helm chart. Refer to the v0.4.0 release page for the CRD manifests.
Run
helm upgradewithnameOverrideand--version, preserving any original install flags such asgpuResourcesEnabledOverrideandnvidiaDriverRoot:helm upgrade -i nvidia-dra-driver-gpu nvidia/dra-driver-nvidia-gpu \ --version=0.4.0 \ --namespace nvidia-dra-driver-gpu \ --set nameOverride=nvidia-dra-driver-gpu \ --set gpuResourcesEnabledOverride=true \ --set nvidiaDriverRoot=/run/nvidia/driver
Verify the upgrade:
kubectl get pods -n nvidia-dra-driver-gpuAll controller and kubelet-plugin pods should reach
Runningstatus, and existing ResourceClaims should remain in theallocated,reservedstate.
Refer to the upstream upgrade guide for additional detail.
Additional Documentation#
Refer to the DRA Driver for NVIDIA GPUs repository for additional documentation, including:
Release notes for the DRA Driver are available on the v0.4.0 releases page.