NVIDIA DRA Driver for GPUs#

Dynamic Resource Allocation (DRA) is a Kubernetes concept for flexibly requesting, configuring, and sharing specialized devices like GPUs. This page describes how to install and upgrade the DRA Driver for NVIDIA GPUs v0.4.0 with the NVIDIA GPU Operator.

Before using the DRA Driver for NVIDIA GPUs, it is recommended that you are familiar with the following:

Overview#

With the DRA Driver for NVIDIA GPUs, your Kubernetes workloads can allocate and consume the following two types of resources:

  • GPU allocation: for controlled sharing and dynamic reconfiguration of GPUs. This functionality replaces the traditional GPU allocation method used by the NVIDIA Kubernetes Device Plugin.

  • ComputeDomains: an abstraction for robust and secure Multi-Node NVLink (MNNVL) for NVIDIA GB200 and similar systems.

You can use these features independently or together in the same cluster.

Known Issues#

This section covers known issues for the DRA Driver when used with the NVIDIA GPU Operator. For known issues specific to the DRA Driver itself, refer to the DRA Driver v0.4.0 release notes.

  • There is a known issue where the NVIDIA Driver Manager is not aware of the DRA driver kubelet plugin, and will not correctly evict it on pod restarts. You must label the nodes you plan to use with DRA GPU allocation and pass the node label in the GPU Operator Helm command in the driver.manager.env flag. This enables the NVIDIA Driver Manager to evict the GPU kubelet plugin correctly on driver container upgrades.

Prerequisites#

Installing GPU Operator v26.3.1 (in the next step) handles configuring the following and the DRA Driver for NVIDIA GPUs prerequisites for you:

  • Container Device Interface (CDI) to be enabled in the underlying container runtime (such as containerd or CRI-O)

  • NVIDIA Driver version 580 or later.

  • Deploying Node Feature Discovery (NFD) and GPU Feature Discovery (GFD).

Refer to the DRA Driver for NVIDIA GPUs prerequisites documentation for more information.

Make sure your GPUs and cluster align with the GPU Operator support matrix. The DRA Driver also requires the following:

  • Kubernetes v1.34.2 or later.

    Note

    If you plan to use traditional extended resource requests such as nvidia.com/gpu alongside the DRA driver, you must enable the DRAExtendedResource feature gate. This allows the scheduler to translate extended resource requests into ResourceClaims for the DRA driver.

  • For GPU allocation, label nodes that will support GPU allocation with nvidia.com/dra-kubelet-plugin=true and use this label as a node selector in the DRA driver Helm chart. This is required to avoid the known issue when using the GPU Operator with the DRA Driver. Steps for labeling nodes are provided in the install section. The label is also passed to the GPU Operator Helm command via the driver.manager.env flag.

Additional prerequisites for ComputeDomain:

  • NVIDIA Grace Blackwell GPUs with Multi-Node NVLink (MNNVL) available on your cluster. For example, NVIDIA HGX GB200 NVL72 or NVIDIA HGX GB300 NVL72. Refer to the NVIDIA Multi-Node NVLink Systems documentation for details on Multi-Node NVLink systems.

Install#

This section covers fresh installs of the GPU Operator and DRA Driver for NVIDIA GPUs. If you are upgrading from an earlier version for the DRA Driver for NVIDIA GPUs, refer to the Upgrade section.

Note

The nvidiaDriverRoot flag sets the root directory for the NVIDIA GPU driver. The default value is /, which is typical for drivers installed directly on the host. With GPU Operator–managed drivers (default), drivers are installed to /run/nvidia/driver. If you are using pre-installed drivers, remove the nvidiaDriverRoot flag or set it to /.

  1. Label every node that will support GPU allocation through DRA:

    kubectl label node $HOSTNAME nvidia.com/dra-kubelet-plugin=true
    
  2. Add the NVIDIA Helm repository:

    helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
      && helm repo update
    
  3. Install the GPU Operator with the NVIDIA Kubernetes Device Plugin disabled:

    helm upgrade --install gpu-operator nvidia/gpu-operator \
      --version=v26.3.1 \
      --create-namespace \
      --namespace gpu-operator \
      --set devicePlugin.enabled=false \
      --set driver.manager.env[0].name=NODE_LABEL_FOR_GPU_POD_EVICTION \
      --set driver.manager.env[0].value="nvidia.com/dra-kubelet-plugin"
    

    Make sure the value of driver.manager.env matches the node label applied in step 1.

    Make sure the devicePlugin.enabled flag is set to false to disable the NVIDIA Kubernetes Device Plugin. The DRA Driver for NVIDIA GPUs will be used to allocate GPUs.

    Refer to the GPU Operator installation guide for additional configuration options. If you plan to use MIG devices, refer to the GPU Operator MIG documentation to configure your cluster for MIG support.

  4. Create a values.yaml file for the DRA driver:

    image:
      pullPolicy: IfNotPresent
    kubeletPlugin:
      nodeSelector:
        nvidia.com/dra-kubelet-plugin: "true"
    

    If you are using Google Kubernetes Engine (GKE), the DRA driver requires additional overrides for the driver root, controller affinity, and tolerations:

    # GKE helm values example
    # "/home/kubernetes/bin/nvidia" is the default driver root on GKE.
    nvidiaDriverRoot: "/home/kubernetes/bin/nvidia"
    
    controller:
      priorityClassName: ""
      affinity: null
    image:
      pullPolicy: IfNotPresent
    kubeletPlugin:
      priorityClassName: ""
      tolerations:
        - effect: NoSchedule
          key: nvidia.com/gpu
          operator: Exists
      nodeSelector:
        nvidia.com/dra-kubelet-plugin: "true"
    
  5. Install the DRA driver:

    helm upgrade -i dra-driver-nvidia-gpu nvidia/dra-driver-nvidia-gpu \
      --version=0.4.0 \
      --namespace nvidia-dra-driver-gpu \
      --create-namespace \
      --set nvidiaDriverRoot=/run/nvidia/driver \
      --set gpuResourcesEnabledOverride=true \
      -f values.yaml
    

    For GKE, omit --set nvidiaDriverRoot=/run/nvidia/driver; the value comes from the GKE values.yaml file.

  1. Add the NVIDIA Helm repository:

    helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
    
  2. Install the GPU Operator:

    helm upgrade --install gpu-operator nvidia/gpu-operator \
      --version=v26.3.1 \
      --create-namespace \
      --namespace gpu-operator
    

    Refer to the GPU Operator installation guide for additional configuration options.

  3. Install the DRA driver.

    Example for an Operator-provided GPU driver:

    helm upgrade -i dra-driver-nvidia-gpu nvidia/dra-driver-nvidia-gpu \
      --version=0.4.0 \
      --namespace nvidia-dra-driver-gpu \
      --create-namespace \
      --set resources.gpus.enabled=false \
      --set nvidiaDriverRoot=/run/nvidia/driver
    

    Example for a host-provided GPU driver:

    helm upgrade -i dra-driver-nvidia-gpu nvidia/dra-driver-nvidia-gpu \
      --version=0.4.0 \
      --namespace nvidia-dra-driver-gpu \
      --create-namespace \
      --set resources.gpus.enabled=false
    

Validate Installation#

  1. Confirm that the DRA driver components are running:

    kubectl get pods -n nvidia-dra-driver-gpu
    

    Example Output

    NAME                                                READY   STATUS    RESTARTS   AGE
    dra-driver-nvidia-gpu-controller-67cb99d84b-5q7kj   1/1     Running   0          7m26s
    dra-driver-nvidia-gpu-kubelet-plugin-h5xsn          1/1     Running   0          7m27s
    

    The controller pod runs the ComputeDomain controller (1 container). The kubelet-plugin pod runs two containers, one for GPU resources (gpus) and one for ComputeDomain resources (compute-domains), so it shows 2/2 when both are enabled. One kubelet-plugin pod appears per GPU node.

    If you installed with –set resources.computeDomains.enabled=false, the controller pod will not be present and the kubelet-plugin pod will show 1/1. The same is true if you disabled GPU allocation during install.

    Note

    If you upgraded an existing v25.x installation, the pod names retain the nvidia-dra-driver-gpu- prefix (for example, nvidia-dra-driver-gpu-controller-*) because the upgrade preserves the original resource names through the nameOverride flag.

  2. Verify that GPU DeviceClasses are available:

    kubectl get deviceclass
    

    Example Output

    NAME                                        AGE
    compute-domain-daemon.nvidia.com            55s
    compute-domain-default-channel.nvidia.com   55s
    gpu.nvidia.com                              55s
    mig.nvidia.com                              55s
    

The compute-domain-daemon.nvidia.com and compute-domain-default-channel.nvidia.com DeviceClasses are installed when ComputeDomain support is enabled. The gpu.nvidia.com and mig.nvidia.com DeviceClasses are installed when GPU allocation support is enabled.

Additional validation steps are available in the upstream DRA Driver documentation:

Upgrade#

Starting with v0.4.0, the DRA Driver for NVIDIA GPUs moved to kubernetes-sigs/dra-driver-nvidia-gpu and adopted semantic versioning. The Helm chart was renamed from nvidia-dra-driver-gpu to dra-driver-nvidia-gpu and is published to new NGC Helm and container registries.

When upgrading from v25.x you must explicitly set nameOverride and --version to avoid creating duplicate Kubernetes manifests under different names. Without --set nameOverride=nvidia-dra-driver-gpu, the upgrade creates new daemonsets and deployments under the new chart name instead of upgrading the existing resources in place.

Important

After upgrading to v0.4.0, downgrading to v25.x is not supported.

Upgrade from v25.x to v0.4.0#

  1. Apply the v0.4.0 CRDs for ComputeDomains and ComputeDomainsCliques before upgrading the Helm chart. Refer to the v0.4.0 release page for the CRD manifests.

  2. Run helm upgrade with nameOverride and --version, preserving any original install flags such as gpuResourcesEnabledOverride and nvidiaDriverRoot:

    helm upgrade -i nvidia-dra-driver-gpu nvidia/dra-driver-nvidia-gpu \
      --version=0.4.0 \
      --namespace nvidia-dra-driver-gpu \
      --set nameOverride=nvidia-dra-driver-gpu \
      --set gpuResourcesEnabledOverride=true \
      --set nvidiaDriverRoot=/run/nvidia/driver
    
  3. Verify the upgrade:

    kubectl get pods -n nvidia-dra-driver-gpu
    

    All controller and kubelet-plugin pods should reach Running status, and existing ResourceClaims should remain in the allocated,reserved state.

Refer to the upstream upgrade guide for additional detail.

Additional Documentation#

Refer to the DRA Driver for NVIDIA GPUs repository for additional documentation, including:

Release notes for the DRA Driver are available on the v0.4.0 releases page.