Troubleshooting the NVIDIA GPU Operator#
This page outlines common issues and troubleshooting steps for the NVIDIA GPU Operator.
If you are facing an issue that is not covered by this page, please file an issue in the NVIDIA GPU Operator GitHub repository.
The nouveau driver fails to initialize the GPU#
Observation
- The GPU driver fails to initialize the GPU with the error - Failed to enable MSI-Xin the system journal logs.
- All GPU Operator pods become stuck in the - initstate.
Root Cause
- The - nouveauLinux kernel module is loaded.
Action
The nouveau driver must be denylisted when using NVIDIA vGPU.
Follow the instructions in the NVIDIA AI Enterprise: VMware Deployment Guide
to disable nouveau on your OS/distro to resolve this issue.
GPU Operator pods are stuck in Init#
Observation
The output from kubectl get pods -n gpu-operator, shows something like:
gpu-feature-discovery-tmblp                        0/1   Init:0/1  0         11m
nvidia-container-toolkit-daemonset-mqzwq           0/1   Init:0/1  0         2m
nvidia-dcgm-exporter-qpxxl                         0/1   Init:0/1  0         8m32s
nvidia-device-plugin-daemonset-tl9k7               0/1   Init:0/1  0         11m
nvidia-operator-validator-th4w7                    0/1   Init:0/4  0         10m
nvidia-driver-daemonset-4rtiu                      0/2   Running   3         12m
Root Cause
This most likely refers to an issue with the nvidia-driver-daemonset. Note that the operand pods will only come up when the driver daemonset and toolkit pods come up successfully.
- Check the driver daemonset pod logs: - To retrieve the main driver container logs: - kubectl logs -n gpu-operator nvidia-driver-daemonset-p97x5 -c nvidia-driver-ctr
- If you see - Init:Errorin the kubectl output, then retrieve the k8s-driver-manager logs- kubectl logs -n gpu-operator nvidia-driver-daemonset-p97x5 -c k8s-driver-manager
 
- Check the dmesg logs - dmesgdisplays the messages generated by the Linux Kernel.- dmesghelps us detect any issues loading the GPU driver modules especially when the driver daemonset logs don’t provide a lot of information
- You can retrieve - dmesgusing either: kubectl exec or execute- dmesgin your host terminal.
 - kubectl exec - kubectl exec -n gpu-operator -it nvidia-driver-daemonset-p97x5 -c nvidia-driver-ctr -- dmesg- Execute - dmesgin your host terminal- sudo dmesg- TIP: You can also grep for NVRM or Xid to view logs emitted by the driver’s kernel module. - sudo dmesg | grep -i NVRM- OR - sudo dmesg | grep -i Xid
- Ensure that your driver daemonset has internet access to download deb/rpm packages during runtime: - Check your Kubernetes cluster’s VPC, Security groups and DNS settings 
- Consider executing into a container shell and testing internet connectivity with a simple - pingcommand
 
No runtime for “nvidia” is configured#
Observation
When running kubectl describe for one of the gpu-operator pods, and you see an error like:
Warning  FailedCreatePodSandBox  2m37s (x94 over 22m)  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
Root Cause
This means that the RuntimeClass is unable to find the runtime handler named “nvidia” in your container runtime’s configuration.
The runtime handler is added by the nvidia-container-toolkit, so this error message is likely related to startup issues with nvidia-container-toolkit
Action
- Check the nvidia-container-toolkit logs - To retrieve the toolkit pod logs: - kubectl logs -n gpu-operator nvidia-container-toolkit-daemonset-2rhwg -c nvidia-container-toolkit-ctr
 
- Check the driver daemonset logs - Ensure the driver daemonset is up and running. Refer to GPU Operator pods are stuck in Init. 
 
- Review the container runtime configuration TOML - CRI-O and Containerd are the two main container runtimes supported by the toolkit. You can view the runtime configuration file and verify that the “nvidia” container runtime handler actually exists 
- Here are some ways to retrieve the container runtime config: - If using “containerd”, run the - containerd configcommand to retrieve the active containerd configuration
- If using “cri-o”, run the - crio status configcommand to retrieve the active cri-o configuration
 
 
Operator validator pods crashing with “error code system not yet initialized”#
When the operator validator pods are crashing with this error, this most likely points to a GPU node that is NVSwitch-based and requires the nvidia-fabricmanager to be installed. NVSwitch-based systems, like NVIDIA DGX and NVIDIA HGX server systems, require the memory fabric to be set up after the GPU driver is installed. Learn more about the Fabric Manager from the Fabric Manager user guide
Action
- nvidia-smi -q - Execute into the driver container and run - nvidia-smi -qif you are using gpu driver daemonset.- kubectl exec -n gpu-operator -it nvidia-driver-daemonset-p97x5 -c nvidia-driver-ctr -- nvidia-smi -q
- The - nvidia-smi -qdisplays a verbose output with all the attributes of a GPU
- If you see the following in the - nvidia-smi -qcommand output, then the nvidia-fabricmanager needs to be installed- Fabric State : In Progress Status : N/A CliqueId : N/A ClusterUUID : N/A 
 - Note: If your driver is pre-installed on your host system, run - nvidia-smi -qin your host’s shell terminal
- Refer to the nvidia-driver-daemonset logs - The driver daemonset has the logic to detect NVSwitches and install the - nvidia-fabricmanagerif they are found
- Check the driver daemonset logs to confirm if the NVSwitch devices were detected and/or if the - nvidia-fabricmanagerwas installed successfully
 
- Check the Fabric Manager logs - If the operator validator pods are still crashing despite fabric manager being installed, you may need to look up the fabric manager logs 
- Execute into the driver container and run - cat /var/log/fabricmanager.logif the gpu driver daemonset is deployed- kubectl exec -n gpu-operator -it nvidia-driver-daemonset-p97x5 -c nvidia-driver-ctr -- cat /var/log/fabricmanager.log
- If you are using a host-installed driver, SSH into the host and run - cat /var/log/fabricmanager.log
 
GPU Feature Discovery crashing with CreateContainerError/CrashLoopBackoff#
When the GPU Feature Discovery pods start crashing and you see the error below in the kubectl describe output, the root cause is likely a driver/hardware issue.
....
....
 Containers:
   gpu-feature-discovery:
    Container ID:   containerd://947879d0f2a3e3a11187c3435c2e13f1d8962540b8853cebb409eaa47f661c34                                                                                                                    Image:          nvcr.io/nvidia/gpu-feature-discovery:v0.8.0-ubi8
    Image ID:       nvcr.io/nvidia/gpu-feature-discovery@sha256:84ce86490d0d313ed6517f2ac3a271e1179d7478d86c772da3846727d7feddc3                                                                                     Port:           <none>
    Host Port:      <none>                                                                                                                                                                                           State:          Waiting
    Reason:       CrashLoopBackOff                                                                                                                                                                                 Last State:     Terminated
    Reason:       StartError                                                                                                                                                                                         Message:      failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running
 hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: initialization error: driver rpc error: timed out: unknown
Action
- Check dmesg logs - dmesgcan be used to retrieve any issues stemming from gpu driver/hardware.
- You can fine tune your search by grepping for - NVRMor- Xidin your dmesg command output
- Your command would look like - - sudo dmesg | grep -i NVRMor- sudo dmesg | grep -i Xid
- If the output from the previous command has something like the snippet below, then it is likely a GPU driver/hardware issue. - # dmesg |grep -i xid NVRM: Xid (PCI:0000:ca:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus. 
 - This error message indicates an Xid error with the code 79. For more information on Xid errors and its various error codes, refer to this page. 
- Check nvidia-device-plugin-daemonset logs - The - nvidia-device-pluginhas a health checker module which periodically monitors the NVML event stream for any Xid errors and marks a GPU as unhealthy if an Xid error is reported against it
- Retrieve the - nvidia-device-plugin-daemonsetpod logs- kubectl logs -n gpu-operator nvidia-device-plugin-daemonset-9bmvc -c nvidia-device-plugin
- If there are Xid errors, the device plugin logs should look something like - XidCriticalError: Xid=48 on Device=GPU-e3dbf294-2783-f38b-4274-5bc836df5be1; marking device as unhealthy. 'nvidia.com/gpu' device marked unhealthy: GPU-e3dbf294-2783-f38b-4274-5bc836df5be1 
 
GPU Node does not have the expected number of GPUs#
When inspecting your GPU node, you may not see the expected number of “Allocatable” GPUs advertised in the node.
For e.g., Given a GPU node with 8 GPUs, its kubectl describe output may look something like the snippet below:
Name:               gpu-node-1
Roles:              worker
......
......
Addresses:
  InternalIP:  10.158.144.58
  Hostname:    gpu-node-1
Capacity:
  cpu:                     96
  ephemeral-storage:       106935552Ki
  hugepages-1Gi:           0
  hugepages-2Mi:           0
  memory:                  527422416Ki
  nvidia.com/gpu:          7
  pods:                    110
Allocatable:
  cpu:                     96
  ephemeral-storage:       98551804561
  hugepages-1Gi:           0
  hugepages-2Mi:           0
  memory:                  527320016Ki
  nvidia.com/gpu:          7
  pods:                    110
....
....
The above node only advertises 7 GPU devices as allocatable when we expect it to display 8 instead
Action
- Check for any Xid errors in the - nvidia-device-plugin-daemonsetpod logs. If an Xid error is raised for a GPU, the device plugin will automatically mark the GPU as unhealthy and take it off the list of “Allocatable” GPUs. Here are some example device-plugin logs in the event of an Xid error:- I0624 22:58:05.486593 1 health.go:159] Processing event {Device:{Handle:0x7f7597647848} EventType:8 EventData:109 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0624 22:58:05.486697 1 health.go:185] XidCriticalError: Xid=79 on Device=GPU-adb24b25-1db1-436e-d958-ddee5da83d07; marking device as unhealthy. I0624 22:58:05.486727 1 server.go:276] 'nvidia.com/gpu' device marked unhealthy: GPU-adb24b25-1db1-436e-d958-ddee5da83d07 
- You can also check for Xid errors in GPU node’s - dmesglogs.- sudo dmesg | grep -i xid
- For more information on Xid error codes and how to resolve them, you can refer to Xid Errors page. 
DCGM Exporter pods go into CrashLoopBackoff#
By default, the GPU Operator only deploys the dcgm-exporter while disabling the standalone dcgm. In this setup, the dcgm-exporter spawns a dcgm process locally. If, however, dcgm is enabled and deployed as a separate pod/container, then the dcgm-exporter will attempt to connect to the dcgm pod through a Kubernetes service. If the cluster networking settings aren’t applied correctly, you would likely see error messages as mentioned below in the dcgm-exporter logs:
time="2025-06-25T20:09:25Z" level=info msg="Attempting to connect to remote hostengine at nvidia-dcgm:5555"
time="2025-06-25T20:09:30Z" level=error msg="Encountered a failure." stacktrace="goroutine 1 [running]:\nruntime/debug.Stack()
/usr/local/go/src/runtime/debug/stack.go:24 +0x5e\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.action.func1.1()
/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:283 +0x3d\npanic({0x18b42c0?, 0x2a8d3e0?})
/usr/local/go/src/runtime/panic.go:770
Action
- If you have - NetworkPoliciesset up, ensure that they are configured to allow the dcgm-exporter pod to communicate with the dcgm pod
- Ensure that you don’t have security groups or network firewall settings preventing pod-pod traffic whether intranode or internode. 
GPU driver upgrades are not progressing#
Despite initiating a cluster-wide driver upgrade, not every driver daemonset gets updated to the desired version and this state may persist for a long period of time.
$ kubectl get daemonsets -n gpu-operator nvidia-driver-daemonset
NAME                      DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                       AGE
nvidia-driver-daemonset   4         4         4       3            4           nvidia.com/gpu.deploy.driver=true   14d
Action
- Check for any nodes that have the - upgrade-failedlabel.- kubectl get nodes -l nvidia.com/gpu-driver-upgrade-state=upgrade-failed
- Check the driver daemonset pod logs in these nodes 
- If the driver daemonset pod logs aren’t informative, check the node’s - dmesg
- Once the issue is resolved, you can re-label the node with the command below: - kubectl label node <node-name> "nvidia.com/gpu-driver-upgrade-state=upgrade-required"
- If the driver upgrade is still stuck, delete the driver pod on the node. 
Pods stuck in Pending state in mixed MIG + full GPU environments#
Issue
For drivers 570.124.06, 570.133.20, 570.148.08, and 570.158.01, GPU workloads cannot be scheduled on nodes that have a mix of MIG slices and full GPUs. For more detailed information, see GitHub issue NVIDIA/gpu-operator#1361.
Observation
When a GPU pod is created on a node that has a mix of MIG slices and full GPUs,
the GPU pod gets stuck indefinitely in the Pending state.
Root Cause
This is due to a regression in NVML introduced in the R570 drivers starting from 570.124.06.
Action
NVIDIA recommends that you downgrade to driver version 570.86.15 to work around this issue.
GPU Operator Validator: Failed to Create Pod Sandbox#
Issue
On some occasions, the driver container is unable to unload the nouveau Linux kernel module.
Observation
- Running - kubectl describe pod -n gpu-operator -l app=nvidia-operator-validatorincludes the following event:- Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedCreatePodSandBox 8s (x21 over 9m2s) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured 
- Running one of the following commands on the node indicates that the - nouveauLinux kernel module is loaded:- $ lsmod | grep -i nouveau $ dmesg | grep -i nouveau $ journalctl -xb | grep -i nouveau 
Root Cause
The nouveau Linux kernel module is loaded and the driver container is unable to unload the module.
Because the nouveau module is loaded, the driver container cannot load the nvidia module.
Action
On each node, run the following commands to prevent loading the nouveau Linux kernel module on boot:
$ sudo tee /etc/modules-load.d/ipmi.conf <<< "ipmi_msghandler" \
    && sudo tee /etc/modprobe.d/blacklist-nouveau.conf <<< "blacklist nouveau" \
    && sudo tee -a /etc/modprobe.d/blacklist-nouveau.conf <<< "options nouveau modeset=0"
$ sudo update-initramfs -u
$ sudo init 6
No GPU Driver or Operand Pods Running#
Issue
On some clusters, taints are applied to nodes with a taint effect of NoSchedule.
Observation
- Running - kubectl get ds -n gpu-operatorshows- 0for- DESIRED,- CURRENT,- READYand so on.- NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE gpu-feature-discovery 0 0 0 0 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 11m ... 
Root Cause
The NoSchedule taint prevents the Operator from deploying the GPU Driver and other Operand pods.
Action
Describe each node, identify the taints, and either remove the taints from the nodes or add the taints as tolerations to the daemon sets.
GPU Operator Pods Stuck in Crash Loop#
Issue
On large clusters, such as 300 or more nodes, the GPU Operator pods can get stuck in a crash loop.
Observation
- The GPU Operator pod is not running: - $ kubectl get pod -n gpu-operator -l app=gpu-operator - Example Output - NAME READY STATUS RESTARTS AGE gpu-operator-568c7ff7f6-chg5b 0/1 CrashLoopBackOff 4 (85s ago) 4m42s 
- The node that is running the GPU Operator pod has sufficient resources and the node is - Ready:- $ kubectl describe node <node-name>- Example Output - Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- MemoryPressure False Tue, 26 Dec 2023 14:01:31 +0000 Tue, 12 Dec 2023 19:47:47 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Tue, 26 Dec 2023 14:01:31 +0000 Thu, 14 Dec 2023 19:15:03 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Tue, 26 Dec 2023 14:01:31 +0000 Tue, 12 Dec 2023 19:47:47 +0000 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Tue, 26 Dec 2023 14:01:31 +0000 Thu, 14 Dec 2023 19:15:13 +0000 KubeletReady kubelet is posting ready status 
Root Cause
The memory resource limit for the GPU Operator is too low for the cluster size.
Action
Increase the memory request and limit for the GPU Operator pod:
- Set the memory request to a value that matches the average memory consumption over a large time window. 
- Set the memory limit to match the spikes in memory consumption that occur occasionally. 
- Increase the memory resource limit for the GPU Operator pod: - $ kubectl patch deployment gpu-operator -n gpu-operator --type='json' \ -p='[{"op":"replace", "path":"/spec/template/spec/containers/0/resources/limits/memory", "value":"1400Mi"}]' 
- Optional: Increase the memory resource request for the pod: - $ kubectl patch deployment gpu-operator -n gpu-operator --type='json' \ -p='[{"op":"replace", "path":"/spec/template/spec/containers/0/resources/requests/memory", "value":"600Mi"}]' 
Monitor the GPU Operator pod. Increase the memory request and limit again if the pod remains stuck in a crash loop.
infoROM is corrupted (nvidia-smi return code 14)#
Issue
The nvidia-operator-validator pod fails and nvidia-driver-daemonsets fails as well.
Observation
The output from the driver validation container indicates that the infoROM is corrupt:
$ kubectl logs -n gpu-operator nvidia-operator-validator-xxxxx -c driver-validation
Example Output
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 00000000:0B:00.0 Off |                    0 |
| N/A   42C    P0    29W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
WARNING: infoROM is corrupted at gpu 0000:0B:00.0
14
The GPU emits some warning messages related to infoROM.
The return values for the nvidia-smi command are listed below.
RETURN VALUE
Return code reflects whether the operation succeeded or failed and what
was the reason of failure.
·      Return code 0 - Success
·      Return code 2 - A supplied argument or flag is invalid
·      Return code 3 - The requested operation is not available on target device
·      Return code 4 - The current user does not have permission to access this device or perform this operation
·      Return code 6 - A query to find an object was unsuccessful
·      Return code 8 - A device's external power cables are not properly attached
·      Return code 9 - NVIDIA driver is not loaded
·      Return code 10 - NVIDIA Kernel detected an interrupt issue with a GPU
·      Return code 12 - NVML Shared Library couldn't be found or loaded
·      Return code 13 - Local version of NVML doesn't implement this function
·      Return code 14 - infoROM is corrupted
·      Return code 15 - The GPU has fallen off the bus or has otherwise become inaccessible
·      Return code 255 - Other error or internal driver error occurred
Root Cause
The nvidia-smi command should return a success code (return code 0) for the driver-validator container to pass and GPU Operator to successfully deploy driver pod on the node.
Action
Replace the faulty GPU.
EFI + Secure Boot#
Issue
GPU Driver pod fails to deploy.
Root Cause
EFI Secure Boot is currently not supported with the GPU Operator
Action
Disable EFI Secure Boot on the server.
File an issue#
If you are facing a gpu-operator and/or operand(s) issue that is not documented in this guide, you can run the must-gather utility to prepare a bug report.
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh
This utility is used to collect relevant information from your cluster that is needed for diagnosing and debugging issues. The final output is an archive file which contains the manifests and logs of all the components managed by gpu-operator.