NVIDIA DRA Driver for GPUs#

This page provides an overview of the NVIDIA DRA Driver for GPUs, its supported functionality, instructions for how to install it alongside the GPU Operator, and usage examples.

Before continuing, you should be familiar with the concept of Dynamic Resource Allocation (DRA) in Kubernetes (docs).

Supported Use Cases#

The NVIDIA DRA Driver for GPUs v25.3.0 enables allocating

  • so-called ComputeDomains for enabling Multi-Node NVLink (MNNVL) workloads on NVIDIA GB200 systems (full support).

  • GPUs, as an alternative to the traditional device plugin method (Technology Preview support).

Future versions of the GPU Operator (newer than 25.3.x) will include the NVIDIA DRA Driver for GPUs. For now, it needs to be installed manually (via its Helm chart, see below).

GPU resource allocation (Technology Preview)#

Note

The GPU allocation features are not supported in production environments and are not functionally complete. Technology Preview features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process. These releases may not have any documentation, and testing is limited.

Full support for defining and allocating GPU resources using DRA

Installation#

The NVIDIA DRA Driver for GPUs is an additional component that can be installed alongside the GPU Operator on your Kubernetes cluster.

Prerequisites#

  • A Multi-Node NVIDIA GB200 system.

  • A Kubernetes cluster (v1.32 or newer) with the DynamicResourceAllocation feature gate enabled and the resource.k8s.io API group enabled.

    The following is a sample for enabling the required feature gates and API groups. Refer to the Kubernetes documentation for full details on enabling DRA on your cluster.

    Sample Kubeadm Init Config with DRA Feature Gates Enabled#
    apiVersion: kubeadm.k8s.io/v1beta4
    kind: ClusterConfiguration
    apiServer:
      extraArgs:
      - name: "feature-gates"
        value: "DynamicResourceAllocation=true"
      - name: "runtime-config"
        value: "resource.k8s.io/v1beta1=true"
    controllerManager:
      extraArgs:
      - name: "feature-gates"
        value: "DynamicResourceAllocation=true"
    scheduler:
      extraArgs:
      - name: "feature-gates"
        value: "DynamicResourceAllocation=true"
    ---
    apiVersion: kubelet.config.k8s.io/v1beta1
    kind: KubeletConfiguration
    featureGates:
      DynamicResourceAllocation: true
    
  • The NVIDIA GPU Operator v25.3.0 or later installed with CDI enabled on all nodes and NVIDIA GPU Driver 565 or later.

    A sample Helm install command below includes enabling CDI with cdi.enabled=true. Refer to the install documentation for details on enabling CDI.

    $ helm install --wait --generate-name \
            -n gpu-operator --create-namespace \
            nvidia/gpu-operator \
            --version=v25.3.1 \
            --set cdi.enabled=true
    

    If you want to install the DRA Driver for GPUs using pre-installed drivers, you must install NVIDIA GPU Driver 565 or later, the corresponding IMEX packages on GPU nodes, and disable the IMEX systemd service before installing the GPU Operator. Refer to the documentation on installing the GPU Operator with pre-installed drivers for more details.

Install the NVIDIA DRA Driver for GPUs with Helm#

  1. Add the NVIDIA Helm repository:

    $ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
        && helm repo update
    
  2. Install the NVIDIA DRA Driver for GPUs:

    $ helm install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
          --version="25.3.0" \
          --create-namespace \
          --namespace nvidia-dra-driver-gpu \
          --set nvidiaDriverRoot=/run/nvidia/driver \
          --set resources.gpus.enabled=false
    

Note that --set nvidiaDriverRoot=/run/nvidia/driver above expects a GPU Operator-provided GPU driver at that location. That configuration parameter must be changed in case the GPU drivers are installed straight on the host (typically at /, which is the default value for nvidiaDriverRoot).

Common Chart Customization Options#

The following options are available when installing the Helm chart. These options can be used with --set when installing with Helm (these are just the most frequently used options – all parameters can be listed by running helm show values nvidia/nvidia-dra-driver-gpu).

Parameter

Description

Default

nvidiaDriverRoot

Specifies the driver root on the host. For GPU Operator-managed drivers (recommended), use /run/nvidia/driver. For pre-installed drivers, use /.

/

nvidiaCtkPath

Specifies the path of the NVIDIA Container Toolkit CLI binary (nvidia-ctk) on the host. For GPU Operator-installed NVIDIA Container Toolkit (recommended), use /usr/local/nvidia/toolkit/nvidia-ctk. For a pre-installed NVIDIA Container Toolkit, use /usr/bin/nvidia-ctk.

/usr/bin/nvidia-ctk

resources.gpus.enabled

Specifies whether to enable the NVIDIA DRA Driver for GPUs to manage GPU resource allocation. This feature is in Technolody Preview and only recommended for testing, not production enviroments. To use with MNNVL, set to false.

true

Verify installation#

  1. Validate that the components are running and in a READY state.

    $ kubectl get pod -n nvidia-dra-driver-gpu
    

    Example Output

    NAME                                                           READY   STATUS    RESTARTS   AGE
    nvidia-dra-driver-k8s-dra-driver-controller-67cb99d84b-5q7kj   1/1     Running   0          7m26s
    nvidia-dra-driver-k8s-dra-driver-kubelet-plugin-7kdg9          1/1     Running   0          7m27s
    nvidia-dra-driver-k8s-dra-driver-kubelet-plugin-bd6gn          1/1     Running   0          7m27s
    nvidia-dra-driver-k8s-dra-driver-kubelet-plugin-bzm6p          1/1     Running   0          7m26s
    nvidia-dra-driver-k8s-dra-driver-kubelet-plugin-xjm4p          1/1     Running   0          7m27s
    
  2. Confirm that all GPU nodes are labeled with clique ids. The following command used jq to format the output.

    $ (echo -e "NODE\tLABEL\tCLIQUE"; kubectl get nodes -o json | \
      jq -r '.items[] | [.metadata.name, "nvidia.com/gpu.clique", .metadata.labels["nvidia.com/gpu.clique"]] | @tsv') | \
      column -t
    

    Example Output

    NODE                  LABEL                  CLIQUE
    node1                 nvidia.com/gpu.clique  9277d399-0674-44a9-b64e-d85bb19ce2b0.32766
    node2                 nvidia.com/gpu.clique  9277d399-0674-44a9-b64e-d85bb19ce2b0.32766
    

The NVIDIA GPU Feature Discovery adds a Clique ID to each GPU node. This is a unique identifier within an NVLink domain (physically connected GPUs over NVLink) that indicates which GPUs within that domain are physically capable of talking to each other.

The partitioning of GPUs into a set of cliques is done at the NVSwitch layer, not at the individual node layer. All GPUs on a given node are guaranteed to have the same <ClusterUUID.Clique ID> pair.

The ClusterUUID is a unique identifier for a given NVLink Domain. It can be queried on a GPU by GPU basis via the nvidia-smi commandline tool. All GPUs on a given node are guaranteed to have the same Cluster UUID.

About the ComputeDomain Custom Resource#

The NVIDIA DRA Driver for GPUs introduces a Kubernetes custom resource named ComputeDomain which you use to define multi-node resource requirements. As you deploy multi-node workloads, you can reference the ComputeDomain and the DRA Driver for GPUs will handle automtically provisioning the required resources to allow a set of GPUs to directly read and write each other’s memory over a high-bandwidth NVLink. The ComputeDomain custom resource defines a Kubernetes DRA ResourceClaimTemplate and numNodes needed to run your multi-node workload on Multi-Node NVLink (MNNVL) capable GPUs.

Sample NVIDIA DRA Driver ComputeDomain Custom Resource Manifest#
apiVersion: resource.nvidia.com/v1beta1
kind: ComputeDomain
metadata:
  name: imex-channel-injection
spec:
  numNodes: 1
  channel:
    resourceClaimTemplate:
      name: imex-channel-0

You can then reference the ResourceClaimTemplate in your workload specs as a resourceClaims.resourceClaimTemplateName.

apiVersion: v1
kind: Pod
metadata:
  name: imex-channel-injection
spec:
  ...
  containers:
  - name: ctr
    image: ubuntu:22.04
    command: ["bash", "-c"]
    args: ["ls -la /dev/nvidia-caps-imex-channels; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: imex-channel-0
  resourceClaims:
  - name: imex-channel-0
    resourceClaimTemplateName: imex-channel-0

If a subset of the nodes associated with a ComputeDomain are capable of communicating over MNNVL, the NVIDIA DRA Driver for GPUs will set up a one-off IMEX domain to allow GPUs to communicate over their multi-node NVLink connections. Multiple IMEX domains will be created as necessary depending on the number and availability of nodes allocated to the ComputeDomain.

A multi-node workload should run in its own compute domain. When you create the compute domain you can specify how many nodes you want to be a part of it in the numNodes field. This can be any number, less than a rack, equal to a rack, more than a rack. The compute domain controller is able to create 0-or-more IMEX domains depending on where the workers of a multi-node job that reference the compute domain actually land in your cluster

When a worker for a multi-node job that references a ComputeDomain’s ResourceClaimTemplate is scheduled on your cluster, the DRA Driver for GPUs triggers an IMEX daemon to started on the node the worker lands on and will block the worker from running until the compute domain is ready. Once the number of IMEX daemons running equals the number of nodes specified in the compute domain, the DRA Driver for GPUs will mark the compute domain as ready and will release the worker pods themselves, allowing them to start running. Since the compute domain is per workload, only one channel is needed to link all of the worker pods of the workload.

The value of the <cluster-uuid, clique-id> tuple associated with the node where a workload lands determines which IMEX domain it will be a part of. Nodes with the same <cluster-uuid, clique-id> values will be part of the same IMEX domain and will be able to communicate over MNNVL with each other. Nodes with different <cluster-uuid, clique-id> values will be associated with different IMEX domains and will not be able to communicate over MNNVL with each other. Nodes without a <cluster-uuid, clique-id> setting at all are still allowed, but no IMEX daemon will be started on such nodes and no MNNVL communication with them is possible from any other nodes in the compute domain. The nodes are still be able to communicate over IB or Ethernet.

Once all workloads running in a ComputeDomain have run to completion, the label gets removed even if the ComputeDomain itself hasn’t been deleted yet. This allows these nodes to be reused for other ComputeDomains.

Configuration Options for ComputeDomain#

The following table describes some of the fields in the ComputeDomain custom resource.

Field

Description

Default Value

channel.resourceClaimTemplate.name (required)

Specifies the name of the ResourceClaimTemplate to create.

None

numNodes (required)

Specifies the number of nodes in the ComputeDomain.

None

Node and Pod Affinity Strategies#

For the DRA Driver for GPUs, a ComputeDomain means running workloads across a group of compute nodes. This means even if some nodes are not MNNVL capable, they can still be part of the same ComputeDomain. You must apply NodeAffinity and PodAffinity rules to your nodes and pods to make sure your workloads run on MNNVL capable nodes.

For example you could set PodAffinity with a required topologyKey set to nvidia.com/gpu.clique when you require all workloads deployed into the same NVLink domain, but don’t care which one. Or use a preferred topologyKey set to nvidia.com/gpu.clique for workloads to span MNNVL domains but want them packed as tightly as possible.

Example: Create a ComputeDomain and Run a Workload#

  1. Create a file like imex-channel-injection.yaml below.

    ---
    apiVersion: resource.nvidia.com/v1beta1
    kind: ComputeDomain
    metadata:
      name: imex-channel-injection
    spec:
      numNodes: 1
      channel:
        resourceClaimTemplate:
          name: imex-channel-0
    ---
    apiVersion: v1
    kind: Pod
    metadata:
      name: imex-channel-injection
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: nvidia.com/gpu.clique
                operator: Exists
      containers:
      - name: ctr
        image: ubuntu:22.04
        command: ["bash", "-c"]
        args: ["ls -la /dev/nvidia-caps-imex-channels; trap 'exit 0' TERM; sleep 9999 & wait"]
        resources:
          claims:
          - name: imex-channel-0
      resourceClaims:
      - name: imex-channel-0
        resourceClaimTemplateName: imex-channel-0
    
  2. Apply the manifest.

    $ kubectl apply -f imex-channel-injection.yaml
    
  3. Optional: View the imex-channel-injection pod.

    $ kubectl get pods
    

    Example Output

    NAME                     READY   STATUS    RESTARTS   AGE
    imex-channel-injection   1/1     Running   0          3s
    
  4. Optional: View logs for the imex-channel-injection pod, where the IMEX channel was injected.

    $ kubectl logs imex-channel-injection
    

    Example Output

    total 0
    drwxr-xr-x 2 root root     60 Feb 19 10:43 .
    drwxr-xr-x 6 root root    380 Feb 19 10:43 ..
    crw-rw-rw- 1 root root 507, 0 Feb 19 10:43 channel0
    
  5. Optional: View the ComputeDomain pod.

    $ kubectl get pods -n nvidia-dra-driver-gpu -l resource.nvidia.com/computeDomain
    

    Example Output

    NAME                                 READY   STATUS    RESTARTS   AGE
    imex-channel-injection-6k9sx-ffgpf   1/1     Running   0          3s
    
  6. Optional: View IMEX channel creation logs.

    $ kubectl logs -n nvidia-dra-driver-gpu -l resource.nvidia.com/computeDomain --tail=-1
    

    Example Output

    /etc/nvidia-imex/nodes_config.cfg:
    10.115.131.8
    IMEX Log initializing at: 3/27/2025 15:47:10.092
    [Mar 27 2025 15:47:10] [INFO] [tid 39] IMEX version 570.124.06 is running with the following configuration options
    
    [Mar 27 2025 15:47:10] [INFO] [tid 39] Logging level = 4
    
    [Mar 27 2025 15:47:10] [INFO] [tid 39] Logging file name/path = /var/log/nvidia-imex.log
    
    [Mar 27 2025 15:47:10] [INFO] [tid 39] Append to log file = 0
    
    [Mar 27 2025 15:47:10] [INFO] [tid 39] Max Log file size = 1024 (MBs)
    
    [Mar 27 2025 15:47:10] [INFO] [tid 39] Use Syslog file = 0
    
    [Mar 27 2025 15:47:10] [INFO] [tid 39] IMEX Library communication bind interface =
    
    [Mar 27 2025 15:47:10] [INFO] [tid 39] IMEX library communication bind port = 50000
    
    [Mar 27 2025 15:47:10] [INFO] [tid 39] Identified this node as ID 0, using bind IP of '10.115.131.8', and network interface of enP5p9s0
    [Mar 27 2025 15:47:10] [INFO] [tid 39] nvidia-imex persistence file /var/run/nvidia-imex/persist.dat does not exist.  Assuming no previous importers.
    [Mar 27 2025 15:47:10] [INFO] [tid 39] NvGpu Library version matched with GPU Driver version
    [Mar 27 2025 15:47:10] [INFO] [tid 63] Started processing of incoming messages.
    [Mar 27 2025 15:47:10] [INFO] [tid 64] Started processing of incoming messages.
    [Mar 27 2025 15:47:10] [INFO] [tid 65] Started processing of incoming messages.
    [Mar 27 2025 15:47:10] [INFO] [tid 39] Creating gRPC channels to all peers (nPeers = 1).
    [Mar 27 2025 15:47:10] [INFO] [tid 66] Started processing of incoming messages.
    [Mar 27 2025 15:47:10] [INFO] [tid 39] IMEX_WAIT_FOR_QUORUM != FULL, continuing initialization without waiting for connections to all nodes.
    [Mar 27 2025 15:47:10] [INFO] [tid 67] Connection established to node 0 with ip address 10.115.131.8. Number of times connected: 1
    [Mar 27 2025 15:47:10] [INFO] [tid 39] GPU event successfully subscribed
    
  7. Delete imex-channel-injection example.

    $ kubectl delete -f imex-channel-injection.yaml
    

    Example Output

    computedomain.resource.nvidia.com "imex-channel-injection" deleted
    pod "imex-channel-injection" deleted
    

Run a Multi-node nvbandwidth Test Requiring IMEX Channels with MPI#

This example demonstrates how to run a workload across multiple nodes using a ComputeDomain. The nvbandwidth test will measure the bandwidth between GPUs across different nodes using IMEX channels, helping you verify that your MNNVL setup is working correctly.

Example notes:

  • This example uses Kubeflow MPI Operator.

  • This example is configured for a 2 node cluster with 4 GPUs per node.

    If you are using a cluster with a different number of nodes and GPUs per node, you must adjust the following parameters in the sample files:

Parameter to update

Description

Value in example

ComputeDomain.spec.numNodes

Total number of nodes in the cluster

2

MPIJob.spec.slotsPerWorker

Number of GPUs per node, this should match the ppr number

4

MPIJob.spec.mpiReplicaSpecs.Worker.replicas

Number of worker nodes

2

mpirun command argument -ppr:4:node

  • Number of GPUs per node as the process-per-resource number

4

mpirun command argument -np value

Total processes to be the number of GPU per node * the number of nodes in the cluster

8

Example Steps:

  1. Install Kubeflow MPI Operator.

    $ kubectl create -f https://github.com/kubeflow/mpi-operator/releases/download/v0.6.0/mpi-operator.yaml
    
  2. Create a nvbandwidth test job file called nvbandwidth-test-job.yaml.

    ---
    apiVersion: resource.nvidia.com/v1beta1
    kind: ComputeDomain
    metadata:
      name: nvbandwidth-test-compute-domain
    spec:
      # Update numNodes to match the total number of nodes in your cluster
      numNodes: 2
      channel:
        resourceClaimTemplate:
          name: nvbandwidth-test-compute-domain-channel
    
    ---
    apiVersion: kubeflow.org/v2beta1
    kind: MPIJob
    metadata:
      name: nvbandwidth-test
    spec:
      # Update slotsPerWorker to match the number of GPUs per node
      slotsPerWorker: 4
      launcherCreationPolicy: WaitForWorkersReady
      runPolicy:
        cleanPodPolicy: Running
      sshAuthMountPath: /home/mpiuser/.ssh
      mpiReplicaSpecs:
        Launcher:
          replicas: 1
          template:
            metadata:
              labels:
                nvbandwidth-test-replica: mpi-launcher
            spec:
              affinity:
                nodeAffinity:
                  requiredDuringSchedulingIgnoredDuringExecution:
                    nodeSelectorTerms:
                    - matchExpressions:
                      - key: node-role.kubernetes.io/control-plane
                        operator: Exists
              containers:
              - image: ghcr.io/nvidia/k8s-samples:nvbandwidth-v0.7-8d103163
                name: mpi-launcher
                securityContext:
                  runAsUser: 1000
                command:
                - mpirun
                args:
                - --bind-to
                - core
                - --map-by
                # Update the number (4) to match the number of GPUs per node
                - ppr:4:node
                - -np
                # Update the number (8) to match the total number of GPUs in the cluster, this example has 2 nodes * 4 GPUs per node
                - "8"
                - --report-bindings
                - -q
                - nvbandwidth
                - -t
                - multinode_device_to_device_memcpy_read_ce
        Worker:
          # Update replicas to match the number of worker nodes
          replicas: 2
          template:
            metadata:
              labels:
                nvbandwidth-test-replica: mpi-worker
            spec:
              affinity:
                podAffinity:
                  requiredDuringSchedulingIgnoredDuringExecution:
                  - labelSelector:
                      matchExpressions:
                      - key: nvbandwidth-test-replica
                        operator: In
                        values:
                        - mpi-worker
                    topologyKey: nvidia.com/gpu.clique
              containers:
              - image: ghcr.io/nvidia/k8s-samples:nvbandwidth-v0.7-8d103163
                name: mpi-worker
                securityContext:
                  runAsUser: 1000
                env:
                command:
                - /usr/sbin/sshd
                args:
                - -De
                - -f
                - /home/mpiuser/.sshd_config
                resources:
                  limits:
                    nvidia.com/gpu: 4
                  claims:
                  - name: compute-domain-channel
              resourceClaims:
              - name: compute-domain-channel
                resourceClaimTemplateName: nvbandwidth-test-compute-domain-channel
    
  3. Apply the manifest.

    $ kubectl apply -f nvbandwidth-test-job.yaml
    

    Example Output

    computedomain.resource.nvidia.com/nvbandwidth-test-compute-domain configured
    mpijob.kubeflow.org/nvbandwidth-test configured
    
  4. Verify that the nvbandwidth pods were created.

    $ kubectl get pods
    

    Example Output

    NAME                              READY   STATUS    RESTARTS   AGE
    nvbandwidth-test-launcher-lzv84   1/1     Running   0          8s
    nvbandwidth-test-worker-0         1/1     Running   0          15s
    nvbandwidth-test-worker-1         1/1     Running   0          15s
    
  5. Verify that the ComputeDomain pods were created for each node.

    $ kubectl get pods -n nvidia-dra-driver-gpu -l resource.nvidia.com/computeDomain
    

    Example Output

    NAME                                          READY   STATUS    RESTARTS   AGE
    nvbandwidth-test-compute-domain-ht24d-9jhmj   1/1     Running   0          20s
    nvbandwidth-test-compute-domain-ht24d-rcn2c   1/1     Running   0          20s
    
  6. Verify the nvbandwidth test results.

    $ kubectl logs --tail=-1 -l job-name=nvbandwidth-test-launcher
    

    Example Output

    Warning: Permanently added '[nvbandwidth-test-worker-0.nvbandwidth-test.default.svc]:2222' (ECDSA) to the list of known hosts.
    Warning: Permanently added '[nvbandwidth-test-worker-1.nvbandwidth-test.default.svc]:2222' (ECDSA) to the list of known hosts.
    [nvbandwidth-test-worker-0:00025] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
    [nvbandwidth-test-worker-0:00025] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
    [nvbandwidth-test-worker-0:00025] MCW rank 2 bound to socket 0[core 2[hwt 0]]: [././B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
    [nvbandwidth-test-worker-0:00025] MCW rank 3 bound to socket 0[core 3[hwt 0]]: [./././B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
    [nvbandwidth-test-worker-1:00025] MCW rank 4 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
    [nvbandwidth-test-worker-1:00025] MCW rank 5 bound to socket 0[core 1[hwt 0]]: [./B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
    [nvbandwidth-test-worker-1:00025] MCW rank 6 bound to socket 0[core 2[hwt 0]]: [././B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
    [nvbandwidth-test-worker-1:00025] MCW rank 7 bound to socket 0[core 3[hwt 0]]: [./././B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
    nvbandwidth Version: v0.7
    Built from Git version: v0.7
    
    MPI version: Open MPI v4.1.4, package: Debian OpenMPI, ident: 4.1.4, repo rev: v4.1.4, May 26, 2022
    CUDA Runtime Version: 12080
    CUDA Driver Version: 12080
    Driver Version: 570.124.06
    
    Process 0 (nvbandwidth-test-worker-0): device 0: HGX GB200 (00000008:01:00)
    Process 1 (nvbandwidth-test-worker-0): device 1: HGX GB200 (00000009:01:00)
    Process 2 (nvbandwidth-test-worker-0): device 2: HGX GB200 (00000018:01:00)
    Process 3 (nvbandwidth-test-worker-0): device 3: HGX GB200 (00000019:01:00)
    Process 4 (nvbandwidth-test-worker-1): device 0: HGX GB200 (00000008:01:00)
    Process 5 (nvbandwidth-test-worker-1): device 1: HGX GB200 (00000009:01:00)
    Process 6 (nvbandwidth-test-worker-1): device 2: HGX GB200 (00000018:01:00)
    Process 7 (nvbandwidth-test-worker-1): device 3: HGX GB200 (00000019:01:00)
    
    Running multinode_device_to_device_memcpy_read_ce.
    memcpy CE GPU(row) -> GPU(column) bandwidth (GB/s)
              0         1         2         3         4         5         6         7
    0       N/A    798.02    798.25    798.02    798.02    797.88    797.73    797.95
    1    798.10       N/A    797.80    798.02    798.02    798.25    797.88    798.02
    2    797.95    797.95       N/A    797.73    797.80    797.95    797.95    797.65
    3    798.10    798.02    797.95       N/A    798.02    798.10    797.88    797.73
    4    797.80    798.02    798.02    798.02       N/A    797.95    797.80    798.02
    5    797.80    797.95    798.10    798.10    797.95       N/A    797.95    797.88
    6    797.73    797.95    798.10    798.02    797.95    797.88       N/A    797.80
    7    797.88    798.02    797.95    798.02    797.88    797.95    798.02       N/A
    
    SUM multinode_device_to_device_memcpy_read_ce 44685.29
    
    NOTE: The reported results may not reflect the full capabilities of the platform.
    
  7. Delete test.

    $ kubectl delete -f nvbandwidth-test-job.yaml
    

    Example Output

    computedomain.resource.nvidia.com "nvbandwidth-test-compute-domain" deleted
    mpijob.kubeflow.org "nvbandwidth-test" deleted