NVIDIA DRA Driver for GPUs#
This page provides an overview of the NVIDIA DRA Driver for GPUs, its supported functionality, instructions for how to install it alongside the GPU Operator, and usage examples.
Before continuing, you should be familiar with the concept of Dynamic Resource Allocation (DRA) in Kubernetes (docs).
Supported Use Cases#
The NVIDIA DRA Driver for GPUs v25.3.0 enables allocating
so-called
ComputeDomain
s for enabling Multi-Node NVLink (MNNVL) workloads on NVIDIA GB200 systems (full support).GPUs, as an alternative to the traditional device plugin method (Technology Preview support).
Future versions of the GPU Operator (newer than 25.3.x) will include the NVIDIA DRA Driver for GPUs. For now, it needs to be installed manually (via its Helm chart, see below).
Multi-Node NVLink Support for NVIDIA GB200#
NVIDIA GB200 systems are designed specifically around Multi-Node NVLink (MNNVL) to turn a rack of GPU machines – each with a small number of GPUs – into a giant supercomputer with up to 72 GPUs communicating at full NVLink bandwidth.
The NVIDIA DRA Driver for GPUs supports MNNVL by introducing a new concept called ComputeDomain
.
Under the hood, when a workload is created that references a specific ComputeDomain
, the NVIDIA DRA Driver for GPUs will handle establishing a shared IMEX domain and IMEX channel that guarantees MNNVL-reachability between all pods in the workload.
GPU resource allocation (Technology Preview)#
Note
The GPU allocation features are not supported in production environments and are not functionally complete. Technology Preview features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process. These releases may not have any documentation, and testing is limited.
Full support for defining and allocating GPU resources using DRA
Installation#
The NVIDIA DRA Driver for GPUs is an additional component that can be installed alongside the GPU Operator on your Kubernetes cluster.
Prerequisites#
A Multi-Node NVIDIA GB200 system.
A Kubernetes cluster (v1.32 or newer) with the DynamicResourceAllocation feature gate enabled and the resource.k8s.io API group enabled.
The following is a sample for enabling the required feature gates and API groups. Refer to the Kubernetes documentation for full details on enabling DRA on your cluster.
Sample Kubeadm Init Config with DRA Feature Gates Enabled#apiVersion: kubeadm.k8s.io/v1beta4 kind: ClusterConfiguration apiServer: extraArgs: - name: "feature-gates" value: "DynamicResourceAllocation=true" - name: "runtime-config" value: "resource.k8s.io/v1beta1=true" controllerManager: extraArgs: - name: "feature-gates" value: "DynamicResourceAllocation=true" scheduler: extraArgs: - name: "feature-gates" value: "DynamicResourceAllocation=true" --- apiVersion: kubelet.config.k8s.io/v1beta1 kind: KubeletConfiguration featureGates: DynamicResourceAllocation: true
The NVIDIA GPU Operator v25.3.0 or later installed with CDI enabled on all nodes and NVIDIA GPU Driver 565 or later.
A sample Helm install command below includes enabling CDI with
cdi.enabled=true
. Refer to the install documentation for details on enabling CDI.$ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ --version=v25.3.1 \ --set cdi.enabled=true
If you want to install the DRA Driver for GPUs using pre-installed drivers, you must install NVIDIA GPU Driver 565 or later, the corresponding IMEX packages on GPU nodes, and disable the IMEX systemd service before installing the GPU Operator. Refer to the documentation on installing the GPU Operator with pre-installed drivers for more details.
Install the NVIDIA DRA Driver for GPUs with Helm#
Add the NVIDIA Helm repository:
$ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ && helm repo update
Install the NVIDIA DRA Driver for GPUs:
$ helm install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \ --version="25.3.0" \ --create-namespace \ --namespace nvidia-dra-driver-gpu \ --set nvidiaDriverRoot=/run/nvidia/driver \ --set resources.gpus.enabled=false
Note that --set nvidiaDriverRoot=/run/nvidia/driver
above expects a GPU Operator-provided GPU driver at that location.
That configuration parameter must be changed in case the GPU drivers are installed straight on the host (typically at /
, which is the default value for nvidiaDriverRoot
).
Common Chart Customization Options#
The following options are available when installing the Helm chart.
These options can be used with --set
when installing with Helm (these are just the most frequently used options –
all parameters can be listed by running helm show values nvidia/nvidia-dra-driver-gpu
).
Parameter |
Description |
Default |
---|---|---|
|
Specifies the driver root on the host.
For GPU Operator-managed drivers (recommended), use |
|
|
Specifies the path of the NVIDIA Container Toolkit CLI binary (nvidia-ctk) on the host.
For GPU Operator-installed NVIDIA Container Toolkit (recommended), use |
|
|
Specifies whether to enable the NVIDIA DRA Driver for GPUs to manage GPU resource allocation.
This feature is in Technolody Preview and only recommended for testing, not production enviroments.
To use with MNNVL, set to |
|
Verify installation#
Validate that the components are running and in a
READY
state.$ kubectl get pod -n nvidia-dra-driver-gpu
Example Output
NAME READY STATUS RESTARTS AGE nvidia-dra-driver-k8s-dra-driver-controller-67cb99d84b-5q7kj 1/1 Running 0 7m26s nvidia-dra-driver-k8s-dra-driver-kubelet-plugin-7kdg9 1/1 Running 0 7m27s nvidia-dra-driver-k8s-dra-driver-kubelet-plugin-bd6gn 1/1 Running 0 7m27s nvidia-dra-driver-k8s-dra-driver-kubelet-plugin-bzm6p 1/1 Running 0 7m26s nvidia-dra-driver-k8s-dra-driver-kubelet-plugin-xjm4p 1/1 Running 0 7m27s
Confirm that all GPU nodes are labeled with clique ids. The following command used jq to format the output.
$ (echo -e "NODE\tLABEL\tCLIQUE"; kubectl get nodes -o json | \ jq -r '.items[] | [.metadata.name, "nvidia.com/gpu.clique", .metadata.labels["nvidia.com/gpu.clique"]] | @tsv') | \ column -t
Example Output
NODE LABEL CLIQUE node1 nvidia.com/gpu.clique 9277d399-0674-44a9-b64e-d85bb19ce2b0.32766 node2 nvidia.com/gpu.clique 9277d399-0674-44a9-b64e-d85bb19ce2b0.32766
The NVIDIA GPU Feature Discovery adds a Clique ID to each GPU node. This is a unique identifier within an NVLink domain (physically connected GPUs over NVLink) that indicates which GPUs within that domain are physically capable of talking to each other.
The partitioning of GPUs into a set of cliques is done at the NVSwitch layer, not at the individual node layer. All GPUs on a given node are guaranteed to have the same <ClusterUUID.Clique ID> pair.
The ClusterUUID is a unique identifier for a given NVLink Domain.
It can be queried on a GPU by GPU basis via the nvidia-smi
commandline tool.
All GPUs on a given node are guaranteed to have the same Cluster UUID.
About the ComputeDomain Custom Resource#
The NVIDIA DRA Driver for GPUs introduces a Kubernetes custom resource named ComputeDomain
which you use to define multi-node resource requirements.
As you deploy multi-node workloads, you can reference the ComputeDomain and the DRA Driver for GPUs will handle automtically provisioning the required resources to allow a set of GPUs to directly read and write each other’s memory over a high-bandwidth NVLink.
The ComputeDomain custom resource defines a Kubernetes DRA ResourceClaimTemplate and numNodes
needed to run your multi-node workload on Multi-Node NVLink (MNNVL) capable GPUs.
apiVersion: resource.nvidia.com/v1beta1
kind: ComputeDomain
metadata:
name: imex-channel-injection
spec:
numNodes: 1
channel:
resourceClaimTemplate:
name: imex-channel-0
You can then reference the ResourceClaimTemplate in your workload specs as a resourceClaims.resourceClaimTemplateName
.
apiVersion: v1
kind: Pod
metadata:
name: imex-channel-injection
spec:
...
containers:
- name: ctr
image: ubuntu:22.04
command: ["bash", "-c"]
args: ["ls -la /dev/nvidia-caps-imex-channels; trap 'exit 0' TERM; sleep 9999 & wait"]
resources:
claims:
- name: imex-channel-0
resourceClaims:
- name: imex-channel-0
resourceClaimTemplateName: imex-channel-0
If a subset of the nodes associated with a ComputeDomain are capable of communicating over MNNVL, the NVIDIA DRA Driver for GPUs will set up a one-off IMEX domain to allow GPUs to communicate over their multi-node NVLink connections. Multiple IMEX domains will be created as necessary depending on the number and availability of nodes allocated to the ComputeDomain.
A multi-node workload should run in its own compute domain.
When you create the compute domain you can specify how many nodes you want to be a part of it in the numNodes
field.
This can be any number, less than a rack, equal to a rack, more than a rack.
The compute domain controller is able to create 0-or-more IMEX domains depending on where the workers of a multi-node job that reference the compute domain actually land in your cluster
When a worker for a multi-node job that references a ComputeDomain’s ResourceClaimTemplate is scheduled on your cluster, the DRA Driver for GPUs triggers an IMEX daemon to started on the node the worker lands on and will block the worker from running until the compute domain is ready. Once the number of IMEX daemons running equals the number of nodes specified in the compute domain, the DRA Driver for GPUs will mark the compute domain as ready and will release the worker pods themselves, allowing them to start running. Since the compute domain is per workload, only one channel is needed to link all of the worker pods of the workload.
The value of the <cluster-uuid, clique-id> tuple associated with the node where a workload lands determines which IMEX domain it will be a part of. Nodes with the same <cluster-uuid, clique-id> values will be part of the same IMEX domain and will be able to communicate over MNNVL with each other. Nodes with different <cluster-uuid, clique-id> values will be associated with different IMEX domains and will not be able to communicate over MNNVL with each other. Nodes without a <cluster-uuid, clique-id> setting at all are still allowed, but no IMEX daemon will be started on such nodes and no MNNVL communication with them is possible from any other nodes in the compute domain. The nodes are still be able to communicate over IB or Ethernet.
Once all workloads running in a ComputeDomain have run to completion, the label gets removed even if the ComputeDomain itself hasn’t been deleted yet. This allows these nodes to be reused for other ComputeDomains.
Configuration Options for ComputeDomain#
The following table describes some of the fields in the ComputeDomain custom resource.
Field |
Description |
Default Value |
---|---|---|
|
Specifies the |
None |
|
Specifies the number of nodes in the ComputeDomain. |
None |
Node and Pod Affinity Strategies#
For the DRA Driver for GPUs, a ComputeDomain means running workloads across a group of compute nodes. This means even if some nodes are not MNNVL capable, they can still be part of the same ComputeDomain. You must apply NodeAffinity and PodAffinity rules to your nodes and pods to make sure your workloads run on MNNVL capable nodes.
For example you could set PodAffinity with a required topologyKey set to nvidia.com/gpu.clique
when you require all workloads deployed into the same NVLink domain, but don’t care which one.
Or use a preferred topologyKey set to nvidia.com/gpu.clique
for workloads to span MNNVL domains but want them packed as tightly as possible.
Example: Create a ComputeDomain and Run a Workload#
Create a file like
imex-channel-injection.yaml
below.--- apiVersion: resource.nvidia.com/v1beta1 kind: ComputeDomain metadata: name: imex-channel-injection spec: numNodes: 1 channel: resourceClaimTemplate: name: imex-channel-0 --- apiVersion: v1 kind: Pod metadata: name: imex-channel-injection spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: nvidia.com/gpu.clique operator: Exists containers: - name: ctr image: ubuntu:22.04 command: ["bash", "-c"] args: ["ls -la /dev/nvidia-caps-imex-channels; trap 'exit 0' TERM; sleep 9999 & wait"] resources: claims: - name: imex-channel-0 resourceClaims: - name: imex-channel-0 resourceClaimTemplateName: imex-channel-0
Apply the manifest.
$ kubectl apply -f imex-channel-injection.yaml
Optional: View the
imex-channel-injection
pod.$ kubectl get pods
Example Output
NAME READY STATUS RESTARTS AGE imex-channel-injection 1/1 Running 0 3s
Optional: View logs for the
imex-channel-injection
pod, where the IMEX channel was injected.$ kubectl logs imex-channel-injection
Example Output
total 0 drwxr-xr-x 2 root root 60 Feb 19 10:43 . drwxr-xr-x 6 root root 380 Feb 19 10:43 .. crw-rw-rw- 1 root root 507, 0 Feb 19 10:43 channel0
Optional: View the ComputeDomain pod.
$ kubectl get pods -n nvidia-dra-driver-gpu -l resource.nvidia.com/computeDomain
Example Output
NAME READY STATUS RESTARTS AGE imex-channel-injection-6k9sx-ffgpf 1/1 Running 0 3s
Optional: View IMEX channel creation logs.
$ kubectl logs -n nvidia-dra-driver-gpu -l resource.nvidia.com/computeDomain --tail=-1
Example Output
/etc/nvidia-imex/nodes_config.cfg: 10.115.131.8 IMEX Log initializing at: 3/27/2025 15:47:10.092 [Mar 27 2025 15:47:10] [INFO] [tid 39] IMEX version 570.124.06 is running with the following configuration options [Mar 27 2025 15:47:10] [INFO] [tid 39] Logging level = 4 [Mar 27 2025 15:47:10] [INFO] [tid 39] Logging file name/path = /var/log/nvidia-imex.log [Mar 27 2025 15:47:10] [INFO] [tid 39] Append to log file = 0 [Mar 27 2025 15:47:10] [INFO] [tid 39] Max Log file size = 1024 (MBs) [Mar 27 2025 15:47:10] [INFO] [tid 39] Use Syslog file = 0 [Mar 27 2025 15:47:10] [INFO] [tid 39] IMEX Library communication bind interface = [Mar 27 2025 15:47:10] [INFO] [tid 39] IMEX library communication bind port = 50000 [Mar 27 2025 15:47:10] [INFO] [tid 39] Identified this node as ID 0, using bind IP of '10.115.131.8', and network interface of enP5p9s0 [Mar 27 2025 15:47:10] [INFO] [tid 39] nvidia-imex persistence file /var/run/nvidia-imex/persist.dat does not exist. Assuming no previous importers. [Mar 27 2025 15:47:10] [INFO] [tid 39] NvGpu Library version matched with GPU Driver version [Mar 27 2025 15:47:10] [INFO] [tid 63] Started processing of incoming messages. [Mar 27 2025 15:47:10] [INFO] [tid 64] Started processing of incoming messages. [Mar 27 2025 15:47:10] [INFO] [tid 65] Started processing of incoming messages. [Mar 27 2025 15:47:10] [INFO] [tid 39] Creating gRPC channels to all peers (nPeers = 1). [Mar 27 2025 15:47:10] [INFO] [tid 66] Started processing of incoming messages. [Mar 27 2025 15:47:10] [INFO] [tid 39] IMEX_WAIT_FOR_QUORUM != FULL, continuing initialization without waiting for connections to all nodes. [Mar 27 2025 15:47:10] [INFO] [tid 67] Connection established to node 0 with ip address 10.115.131.8. Number of times connected: 1 [Mar 27 2025 15:47:10] [INFO] [tid 39] GPU event successfully subscribed
Delete
imex-channel-injection
example.$ kubectl delete -f imex-channel-injection.yaml
Example Output
computedomain.resource.nvidia.com "imex-channel-injection" deleted pod "imex-channel-injection" deleted
Run a Multi-node nvbandwidth Test Requiring IMEX Channels with MPI#
This example demonstrates how to run a workload across multiple nodes using a ComputeDomain. The nvbandwidth test will measure the bandwidth between GPUs across different nodes using IMEX channels, helping you verify that your MNNVL setup is working correctly.
Example notes:
This example uses Kubeflow MPI Operator.
This example is configured for a 2 node cluster with 4 GPUs per node.
If you are using a cluster with a different number of nodes and GPUs per node, you must adjust the following parameters in the sample files:
Parameter to update |
Description |
Value in example |
---|---|---|
|
Total number of nodes in the cluster |
2 |
|
Number of GPUs per node, this should match the ppr number |
4 |
|
Number of worker nodes |
2 |
|
|
4 |
|
Total processes to be the number of GPU per node * the number of nodes in the cluster |
8 |
Example Steps:
Install Kubeflow MPI Operator.
$ kubectl create -f https://github.com/kubeflow/mpi-operator/releases/download/v0.6.0/mpi-operator.yaml
Create a nvbandwidth test job file called
nvbandwidth-test-job.yaml
.--- apiVersion: resource.nvidia.com/v1beta1 kind: ComputeDomain metadata: name: nvbandwidth-test-compute-domain spec: # Update numNodes to match the total number of nodes in your cluster numNodes: 2 channel: resourceClaimTemplate: name: nvbandwidth-test-compute-domain-channel --- apiVersion: kubeflow.org/v2beta1 kind: MPIJob metadata: name: nvbandwidth-test spec: # Update slotsPerWorker to match the number of GPUs per node slotsPerWorker: 4 launcherCreationPolicy: WaitForWorkersReady runPolicy: cleanPodPolicy: Running sshAuthMountPath: /home/mpiuser/.ssh mpiReplicaSpecs: Launcher: replicas: 1 template: metadata: labels: nvbandwidth-test-replica: mpi-launcher spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: node-role.kubernetes.io/control-plane operator: Exists containers: - image: ghcr.io/nvidia/k8s-samples:nvbandwidth-v0.7-8d103163 name: mpi-launcher securityContext: runAsUser: 1000 command: - mpirun args: - --bind-to - core - --map-by # Update the number (4) to match the number of GPUs per node - ppr:4:node - -np # Update the number (8) to match the total number of GPUs in the cluster, this example has 2 nodes * 4 GPUs per node - "8" - --report-bindings - -q - nvbandwidth - -t - multinode_device_to_device_memcpy_read_ce Worker: # Update replicas to match the number of worker nodes replicas: 2 template: metadata: labels: nvbandwidth-test-replica: mpi-worker spec: affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: nvbandwidth-test-replica operator: In values: - mpi-worker topologyKey: nvidia.com/gpu.clique containers: - image: ghcr.io/nvidia/k8s-samples:nvbandwidth-v0.7-8d103163 name: mpi-worker securityContext: runAsUser: 1000 env: command: - /usr/sbin/sshd args: - -De - -f - /home/mpiuser/.sshd_config resources: limits: nvidia.com/gpu: 4 claims: - name: compute-domain-channel resourceClaims: - name: compute-domain-channel resourceClaimTemplateName: nvbandwidth-test-compute-domain-channel
Apply the manifest.
$ kubectl apply -f nvbandwidth-test-job.yaml
Example Output
computedomain.resource.nvidia.com/nvbandwidth-test-compute-domain configured mpijob.kubeflow.org/nvbandwidth-test configured
Verify that the nvbandwidth pods were created.
$ kubectl get pods
Example Output
NAME READY STATUS RESTARTS AGE nvbandwidth-test-launcher-lzv84 1/1 Running 0 8s nvbandwidth-test-worker-0 1/1 Running 0 15s nvbandwidth-test-worker-1 1/1 Running 0 15s
Verify that the ComputeDomain pods were created for each node.
$ kubectl get pods -n nvidia-dra-driver-gpu -l resource.nvidia.com/computeDomain
Example Output
NAME READY STATUS RESTARTS AGE nvbandwidth-test-compute-domain-ht24d-9jhmj 1/1 Running 0 20s nvbandwidth-test-compute-domain-ht24d-rcn2c 1/1 Running 0 20s
Verify the nvbandwidth test results.
$ kubectl logs --tail=-1 -l job-name=nvbandwidth-test-launcher
Example Output
Warning: Permanently added '[nvbandwidth-test-worker-0.nvbandwidth-test.default.svc]:2222' (ECDSA) to the list of known hosts. Warning: Permanently added '[nvbandwidth-test-worker-1.nvbandwidth-test.default.svc]:2222' (ECDSA) to the list of known hosts. [nvbandwidth-test-worker-0:00025] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.] [nvbandwidth-test-worker-0:00025] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.] [nvbandwidth-test-worker-0:00025] MCW rank 2 bound to socket 0[core 2[hwt 0]]: [././B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.] [nvbandwidth-test-worker-0:00025] MCW rank 3 bound to socket 0[core 3[hwt 0]]: [./././B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.] [nvbandwidth-test-worker-1:00025] MCW rank 4 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.] [nvbandwidth-test-worker-1:00025] MCW rank 5 bound to socket 0[core 1[hwt 0]]: [./B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.] [nvbandwidth-test-worker-1:00025] MCW rank 6 bound to socket 0[core 2[hwt 0]]: [././B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.] [nvbandwidth-test-worker-1:00025] MCW rank 7 bound to socket 0[core 3[hwt 0]]: [./././B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.] nvbandwidth Version: v0.7 Built from Git version: v0.7 MPI version: Open MPI v4.1.4, package: Debian OpenMPI, ident: 4.1.4, repo rev: v4.1.4, May 26, 2022 CUDA Runtime Version: 12080 CUDA Driver Version: 12080 Driver Version: 570.124.06 Process 0 (nvbandwidth-test-worker-0): device 0: HGX GB200 (00000008:01:00) Process 1 (nvbandwidth-test-worker-0): device 1: HGX GB200 (00000009:01:00) Process 2 (nvbandwidth-test-worker-0): device 2: HGX GB200 (00000018:01:00) Process 3 (nvbandwidth-test-worker-0): device 3: HGX GB200 (00000019:01:00) Process 4 (nvbandwidth-test-worker-1): device 0: HGX GB200 (00000008:01:00) Process 5 (nvbandwidth-test-worker-1): device 1: HGX GB200 (00000009:01:00) Process 6 (nvbandwidth-test-worker-1): device 2: HGX GB200 (00000018:01:00) Process 7 (nvbandwidth-test-worker-1): device 3: HGX GB200 (00000019:01:00) Running multinode_device_to_device_memcpy_read_ce. memcpy CE GPU(row) -> GPU(column) bandwidth (GB/s) 0 1 2 3 4 5 6 7 0 N/A 798.02 798.25 798.02 798.02 797.88 797.73 797.95 1 798.10 N/A 797.80 798.02 798.02 798.25 797.88 798.02 2 797.95 797.95 N/A 797.73 797.80 797.95 797.95 797.65 3 798.10 798.02 797.95 N/A 798.02 798.10 797.88 797.73 4 797.80 798.02 798.02 798.02 N/A 797.95 797.80 798.02 5 797.80 797.95 798.10 798.10 797.95 N/A 797.95 797.88 6 797.73 797.95 798.10 798.02 797.95 797.88 N/A 797.80 7 797.88 798.02 797.95 798.02 797.88 797.95 798.02 N/A SUM multinode_device_to_device_memcpy_read_ce 44685.29 NOTE: The reported results may not reflect the full capabilities of the platform.
Delete test.
$ kubectl delete -f nvbandwidth-test-job.yaml
Example Output
computedomain.resource.nvidia.com "nvbandwidth-test-compute-domain" deleted mpijob.kubeflow.org "nvbandwidth-test" deleted