Troubleshooting the NVIDIA GPU Operator#

The page outlines common issues and troubleshooting steps for the NVIDIA GPU Operator.

If you are facing an issue that is not covered by this age, please file an issue in the NVIDIA GPU Operator GitHub repository.

GPU Operator pods are stuck in Init#

Observation

The output from kubectl get pods -n gpu-operator, shows something like:

gpu-feature-discovery-tmblp                        0/1   Init:0/1  0         11m
nvidia-container-toolkit-daemonset-mqzwq           0/1   Init:0/1  0         2m
nvidia-dcgm-exporter-qpxxl                         0/1   Init:0/1  0         8m32s
nvidia-device-plugin-daemonset-tl9k7               0/1   Init:0/1  0         11m
nvidia-operator-validator-th4w7                    0/1   Init:0/4  0         10m
nvidia-driver-daemonset-4rtiu                      0/2   Running   3         12m

Root Cause

This most likely refers to an issue with the nvidia-driver-daemonset. Note that the operand pods will only come up when the driver daemonset and toolkit pods come up successfully.

  1. Check the driver daemonset pod logs:

    • To retrieve the main driver container logs:

      kubectl logs -n gpu-operator nvidia-driver-daemonset-p97x5 -c nvidia-driver-ctr
      
    • If you see Init:Error in the kubectl output, then retrieve the k8s-driver-manager logs

      kubectl logs -n gpu-operator nvidia-driver-daemonset-p97x5 -c k8s-driver-manager
      
  2. Check the dmesg logs

    • dmesg displays the messages generated by the Linux Kernel. dmesg helps us detect any issues loading the GPU driver modules especially when the driver daemonset logs don’t provide a lot of information

    • You can retrieve dmesg using either: kubectl exec or Execute the dmesg in your host terminal.

    kubectl exec

    kubectl exec -n gpu-operator -it nvidia-driver-daemonset-p97x5 -c nvidia-driver-ctr -- dmesg
    

    Execute the dmesg in your host terminal

    sudo dmesg
    

    TIP: You can also grep for NVRM or Xid to view logs emitted by the driver’s kernel module.

    sudo dmesg | grep -i NVRM
    

    OR

    sudo dmesg | grep -i Xid
    
  3. Ensure that your driver daemonset has internet access to download deb/rpm packages during runtime:

    • Check your Kubernetes cluster’s VPC, Security groups and DNS settings

    • Consider exec’ing into a container shell and testing internet connectivity with a simple ping command

No runtime for “nvidia” is configured#

Observation

When running kubectl describe for one of the gpu-operator pods, and you see an error like:

Warning  FailedCreatePodSandBox  2m37s (x94 over 22m)  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

Root Cause

This means that the RuntimeClass is unable to find the runtime handler named “nvidia” in your container runtime’s configuration. The runtime handler is added by the nvidia-container-toolkit, so this error message is likely related to startup issues with nvidia-container-toolkit

Action

  1. Check the nvidia-container-toolkit logs

    • To retrieve the toolkit pod logs:

      kubectl logs -n gpu-operator nvidia-container-toolkit-daemonset-2rhwg -c nvidia-container-toolkit-ctr
      
  2. Check the driver daemonset logs

  3. Review the container runtime configuration TOML

    • CRI-O and Containerd are the two main container runtimes supported by the toolkit. You can view the runtime configuration file and verify that the “nvidia” container runtime handler actually exists

    • Here are some ways to retrieve the container runtime config:

      • If using “containerd”, run the containerd config command to retrieve the active containerd configuration

      • If using “cri-o”, run the crio status config command to retrieve the active cri-o configuration

Operator validator pods crashing with “error code system not yet initialized”#

When the operator validator pods are crashing with this error, this most likely points to a GPU node that is NVSwitch-based and requires the nvidia-fabricmanager to be installed. NVSwitch-based systems, like NVIDIA DGX and NVIDIA HGX server systems, require the memory fabric to be setup after the GPU driver is installed. Learn more about the Fabric Manager from the Fabric Manager user guide

Action

  1. nvidia-smi -q

    • Exec into the driver container and run nvidia-smi -q if you are using gpu driver daemonset.

      kubectl exec -n gpu-operator -it nvidia-driver-daemonset-p97x5 -c nvidia-driver-ctr -- nvidia-smi -q
      
    • The nvidia-smi -q displays a verbose output with all the attributes of a GPU

    • If you see the following in the nvidia-smi -q command output, then the nvidia-fabricmanager needs to be installed

      Fabric
           State                             : In Progress
           Status                            : N/A
           CliqueId                          : N/A
           ClusterUUID                       : N/A
      

    NOTE: If your driver is pre-installed on your host system, run nvidia-smi -q in your host’s shell terminal

  2. Refer to the nvidia-driver-daemonset logs

    • The driver daemonset has the logic to detect NVSwitches and install the nvidia-fabricmanager if they are found

    • Check the driver daemonset logs to confirm if the NVSwitch devices were detected and/or if the nvidia-fabricmanager was installed successfully

  3. Check the Fabric Manager logs

    • If the operator validator pods are still crashing despite fabric manager being installed, you may need to look up the fabric manager logs

    • Exec into the driver container and run cat /var/log/fabricmanager.log if the gpu driver daemonset is deployed

      kubectl exec -n gpu-operator -it nvidia-driver-daemonset-p97x5 -c nvidia-driver-ctr -- cat /var/log/fabricmanager.log
      
    • If you are using a host-installed driver, SSH into the host and run cat /var/log/fabricmanager.log

GPU Feature Discovery crashing with CreateContainerError/CrashLoopBackoff#

When the GPU Feature Discovery pods start crashing and you see the error below in the kubectl describe output, the root cause is likely a driver/hardware issue.

....
....
 Containers:
   gpu-feature-discovery:
    Container ID:   containerd://947879d0f2a3e3a11187c3435c2e13f1d8962540b8853cebb409eaa47f661c34                                                                                                                    Image:          nvcr.io/nvidia/gpu-feature-discovery:v0.8.0-ubi8
    Image ID:       nvcr.io/nvidia/gpu-feature-discovery@sha256:84ce86490d0d313ed6517f2ac3a271e1179d7478d86c772da3846727d7feddc3                                                                                     Port:           <none>
    Host Port:      <none>                                                                                                                                                                                           State:          Waiting
    Reason:       CrashLoopBackOff                                                                                                                                                                                 Last State:     Terminated
    Reason:       StartError                                                                                                                                                                                         Message:      failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running
 hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: initialization error: driver rpc error: timed out: unknown

Action

  1. Check dmesg logs

    • dmesg can be used to retrieve any issues stemming from gpu driver/hardware.

    • You can fine tune your search by grepping for NVRM or Xid in your dmesg command output

    • Your command would look like - sudo dmesg | grep -i NVRM or sudo dmesg | grep -i Xid

    • If the output from the previous command has something like the snippet below, then it is likely a GPU driver/hardware issue.

      # dmesg |grep -i xid
      NVRM: Xid (PCI:0000:ca:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
      

    This error message indicates an Xid error with the code 79. For more information on Xid errors and its various error codes, refer to this page.

  2. Check nvidia-device-plugin-daemonset logs

    • The nvidia-device-plugin has a health checker module which periodically monitors the NVML event stream for any Xid errors and marks a GPU as unhealthy if an Xid error is reported against it

    • Retrieve the nvidia-device-plugin-daemonset pod logs

      kubectl logs -n gpu-operator nvidia-device-plugin-daemonset-9bmvc -c nvidia-device-plugin
      
    • If there are Xid errors, the device plugin logs should look something like

      XidCriticalError: Xid=48 on Device=GPU-e3dbf294-2783-f38b-4274-5bc836df5be1; marking device as unhealthy.
      
      'nvidia.com/gpu' device marked unhealthy: GPU-e3dbf294-2783-f38b-4274-5bc836df5be1
      

GPU Node does not have the expected number of GPUs#

When inspecting your GPU node, you may not see the expected number of “Allocatable” GPUs advertised in the node.

For e.g., Given a GPU node with 8 GPUs, its kubectl describe output may look something like the snippet below:

Name:               gpu-node-1
Roles:              worker
......
......
Addresses:
  InternalIP:  10.158.144.58
  Hostname:    gpu-node-1
Capacity:
  cpu:                     96
  ephemeral-storage:       106935552Ki
  hugepages-1Gi:           0
  hugepages-2Mi:           0
  memory:                  527422416Ki
  nvidia.com/gpu:          7
  pods:                    110
Allocatable:
  cpu:                     96
  ephemeral-storage:       98551804561
  hugepages-1Gi:           0
  hugepages-2Mi:           0
  memory:                  527320016Ki
  nvidia.com/gpu:          7
  pods:                    110
....
....

The above node only advertises 7 GPU devices as allocatable when we expect it to display 8 instead

Action

  1. Check for any Xid errors in the nvidia-device-plugin-daemonset pod logs. If an Xid error is raised for a GPU, the device plugin will automatically mark the GPU as unhealthy and take it off of list of “Allocatable” GPUs. Here are some example device-plugin logs in the event of an Xid error:

    I0624 22:58:05.486593       1 health.go:159] Processing event {Device:{Handle:0x7f7597647848} EventType:8 EventData:109 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
    I0624 22:58:05.486697       1 health.go:185] XidCriticalError: Xid=79 on Device=GPU-adb24b25-1db1-436e-d958-ddee5da83d07; marking device as unhealthy.
    I0624 22:58:05.486727       1 server.go:276] 'nvidia.com/gpu' device marked unhealthy: GPU-adb24b25-1db1-436e-d958-ddee5da83d07
    
  2. You can also check for Xid errors in GPU node’s dmesg logs.

    sudo dmesg | grep -i xid
    
  3. For more information on Xid error codes and how to resolve them, you can refer to Xid Errors page.

DCGM Exporter pods go into CrashLoopBackoff#

By default, the gpu-operator only deploys the dcgm-exporter while disabling the standalone dcgm. In this setup, the dcgm-exporter spawns a dcgm process locally. If, however, dcgm is enabled and deployed as a separate pod/container, then the dcgm-exporter will attempt to connect to the dcgm pod through a Kubernetes service. If the cluster networking settings aren’t applied correctly, you would likely see error messages as mentioned below in the dcgm-exporter logs:

time="2025-06-25T20:09:25Z" level=info msg="Attemping to connect to remote hostengine at nvidia-dcgm:5555"
time="2025-06-25T20:09:30Z" level=error msg="Encountered a failure." stacktrace="goroutine 1 [running]:\nruntime/debug.Stack()
/usr/local/go/src/runtime/debug/stack.go:24 +0x5e\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.action.func1.1()
/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:283 +0x3d\npanic({0x18b42c0?, 0x2a8d3e0?})
/usr/local/go/src/runtime/panic.go:770

Action

  1. If you have NetworkPolicies set up, ensure that they are configured to allow the dcgm-exporter pod to communicate with the dcgm pod

  2. Ensure that you don’t have security groups or network firewall settings preventing pod-pod traffic whether intranode or internode.

GPU driver upgrades are not progressing#

Despite initiating a cluster-wide driver upgrade, not every driver daemonset gets updated to the desired version and this state may persist for a long period of time.

$ kubectl get daemonsets -n gpu-operator nvidia-driver-daemonset
NAME                      DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                       AGE
nvidia-driver-daemonset   4         4         4       3            4           nvidia.com/gpu.deploy.driver=true   14d

Action

  1. Check for any nodes that have the upgrade-failed label.

    kubectl get nodes -l nvidia.com/gpu-driver-upgrade-state=upgrade-failed
    
  2. Check the driver daemonset pod logs in these nodes

  3. If the driver daemonset pod logs aren’t informative, check the node’s dmesg

  4. Once the issue is resolved, you can re-label the node with the command below:

    kubectl label node <node-name> "nvidia.com/gpu-driver-upgrade-state=upgrade-required"
    
  5. If the driver upgrade is still stuck, delete the driver pod on the node.

Pods stuck in Pending state in mixed MIG + full GPU environments#

Issue

For drivers 570.124.06, 570.133.20, 570.148.08, and 570.158.01, GPU workloads cannot be scheduled on nodes that have a mix of MIG slices and full GPUs. For more detailed information, see GitHub issue NVIDIA/gpu-operator#1361.

Observation

When a GPU pod is created on a node that has a mix of MIG slices and full GPUs, the GPU pod gets stuck indefinitely in the Pending state.

Root Cause

This is due to a regression in NVML introduced in the R570 drivers starting from 570.124.06.

Action

It’s recommended that you downgrade to driver version 570.86.15 to work around this issue.

GPU Operator Validator: Failed to Create Pod Sandbox#

Issue

On some occasions, the driver container is unable to unload the nouveau Linux kernel module.

Observation

  • Running kubectl describe pod -n gpu-operator -l app=nvidia-operator-validator includes the following event:

    Events:
      Type     Reason                  Age                 From     Message
      ----     ------                  ----                ----     -------
      Warning  FailedCreatePodSandBox  8s (x21 over 9m2s)  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
    
  • Running one of the following commands on the node indicates that the nouveau Linux kernel module is loaded:

    $ lsmod | grep -i nouveau
    $ dmesg | grep -i nouveau
    $ journalctl -xb | grep -i nouveau
    

Root Cause

The nouveau Linux kernel module is loaded and the driver container is unable to unload the module. Because the nouveau module is loaded, the driver container cannot load the nvidia module.

Action

On each node, run the following commands to prevent loading the nouveau Linux kernel module on boot:

$ sudo tee /etc/modules-load.d/ipmi.conf <<< "ipmi_msghandler" \
    && sudo tee /etc/modprobe.d/blacklist-nouveau.conf <<< "blacklist nouveau" \
    && sudo tee -a /etc/modprobe.d/blacklist-nouveau.conf <<< "options nouveau modeset=0"

$ sudo update-initramfs -u

$ sudo init 6

No GPU Driver or Operand Pods Running#

Issue

On some clusters, taints are applied to nodes with a taint effect of NoSchedule.

Observation

  • Running kubectl get ds -n gpu-operator shows 0 for DESIRED, CURRENT, READY and so on.

    NAME                              DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                                                                         AGE
    gpu-feature-discovery             0         0         0       0            0           nvidia.com/gpu.deploy.gpu-feature-discovery=true                                                                      11m
    ...
    

Root Cause

The NoSchedule taint prevents the Operator from deploying the GPU Driver and other Operand pods.

Action

Describe each node, identify the taints, and either remove the taints from the nodes or add the taints as tolerations to the daemon sets.

GPU Operator Pods Stuck in Crash Loop#

Issue

On large clusters, such as 300 or more nodes, the GPU Operator pods can get stuck in a crash loop.

Observation

  • The GPU Operator pod is not running:

    $ kubectl get pod -n gpu-operator -l app=gpu-operator
    

    Example Output

    NAME                            READY   STATUS             RESTARTS      AGE
    gpu-operator-568c7ff7f6-chg5b   0/1     CrashLoopBackOff   4 (85s ago)   4m42s
    
  • The node that is running the GPU Operator pod has sufficient resources and the node is Ready:

    $ kubectl describe node <node-name>
    

    Example Output

    Conditions:
      Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
      ----                 ------  -----------------                 ------------------                ------                       -------
      MemoryPressure       False   Tue, 26 Dec 2023 14:01:31 +0000   Tue, 12 Dec 2023 19:47:47 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
      DiskPressure         False   Tue, 26 Dec 2023 14:01:31 +0000   Thu, 14 Dec 2023 19:15:03 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
      PIDPressure          False   Tue, 26 Dec 2023 14:01:31 +0000   Tue, 12 Dec 2023 19:47:47 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
      Ready                True    Tue, 26 Dec 2023 14:01:31 +0000   Thu, 14 Dec 2023 19:15:13 +0000   KubeletReady                 kubelet is posting ready status
    

Root Cause

The memory resource limit for the GPU Operator is too low for the cluster size.

Action

Increase the memory request and limit for the GPU Operator pod:

  • Set the memory request to a value that matches the average memory consumption over an large time window.

  • Set the memory limit to match the spikes in memory consumption that occur occasionally.

  1. Increase the memory resource limit for the GPU Operator pod:

    $ kubectl patch deployment gpu-operator -n gpu-operator --type='json' \
        -p='[{"op":"replace", "path":"/spec/template/spec/containers/0/resources/limits/memory", "value":"1400Mi"}]'
    
  2. Optional: Increase the memory resource request for the pod:

    $ kubectl patch deployment gpu-operator -n gpu-operator --type='json' \
        -p='[{"op":"replace", "path":"/spec/template/spec/containers/0/resources/requests/memory", "value":"600Mi"}]'
    

Monitor the GPU Operator pod. Increase the memory request and limit again if the pod remains stuck in a crash loop.

infoROM is corrupted (nvidia-smi return code 14)#

Issue

The nvidia-operator-validator pod fails and nvidia-driver-daemonsets fails as well.

Observation

The output from the driver validation container indicates that the infoROM is corrupt:

$ kubectl logs -n gpu-operator nvidia-operator-validator-xxxxx -c driver-validation

Example Output

| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 00000000:0B:00.0 Off |                    0 |
| N/A   42C    P0    29W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
WARNING: infoROM is corrupted at gpu 0000:0B:00.0
14

The GPU emits some warning messages related to infoROM. The return values for the nvidia-smi command are listed below.

RETURN VALUE

Return code reflects whether the operation succeeded or failed and what
was the reason of failure.

·      Return code 0 - Success

·      Return code 2 - A supplied argument or flag is invalid
·      Return code 3 - The requested operation is not available on target device
·      Return code 4 - The current user does  not  have permission  to access this device or perform this operation
·      Return code 6 - A query to find an object was unsuccessful
·      Return code 8 - A device's external power cables are not properly attached
·      Return code 9 - NVIDIA driver is not loaded
·      Return code 10 - NVIDIA Kernel detected an interrupt issue  with a GPU
·      Return code 12 - NVML Shared Library couldn't be found or loaded
·      Return code 13 - Local version of NVML  doesn't  implement  this function
·      Return code 14 - infoROM is corrupted
·      Return code 15 - The GPU has fallen off the bus or has otherwise become inaccessible
·      Return code 255 - Other error or internal driver error occurred

Root Cause

The nvidia-smi command should return a success code (return code 0) for the driver-validator container to pass and GPU operator to successfully deploy driver pod on the node.

Action

Replace the faulty GPU.

EFI + Secure Boot#

Issue

GPU Driver pod fails to deploy.

Root Cause

EFI Secure Boot is currently not supported with GPU Operator

Action

Disable EFI Secure Boot on the server.

File an issue#

If you are facing a gpu-operator and/or operand(s) issue that is not documented in this guide, you can run the must-gather utility to prepare a bug report.

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh

This utility is used to collect relevant information from your cluster that is needed for diagnosing and debugging issues. The final output is an archive file which contains the manifests and logs of all the components managed by gpu-operator.