Troubleshooting the NVIDIA GPU Operator#
The page outlines common issues and troubleshooting steps for the NVIDIA GPU Operator.
If you are facing an issue that is not covered by this age, please file an issue in the NVIDIA GPU Operator GitHub repository.
GPU Operator pods are stuck in Init#
Observation
The output from kubectl get pods -n gpu-operator
, shows something like:
gpu-feature-discovery-tmblp 0/1 Init:0/1 0 11m
nvidia-container-toolkit-daemonset-mqzwq 0/1 Init:0/1 0 2m
nvidia-dcgm-exporter-qpxxl 0/1 Init:0/1 0 8m32s
nvidia-device-plugin-daemonset-tl9k7 0/1 Init:0/1 0 11m
nvidia-operator-validator-th4w7 0/1 Init:0/4 0 10m
nvidia-driver-daemonset-4rtiu 0/2 Running 3 12m
Root Cause
This most likely refers to an issue with the nvidia-driver-daemonset. Note that the operand pods will only come up when the driver daemonset and toolkit pods come up successfully.
Check the driver daemonset pod logs:
To retrieve the main driver container logs:
kubectl logs -n gpu-operator nvidia-driver-daemonset-p97x5 -c nvidia-driver-ctr
If you see
Init:Error
in the kubectl output, then retrieve the k8s-driver-manager logskubectl logs -n gpu-operator nvidia-driver-daemonset-p97x5 -c k8s-driver-manager
Check the dmesg logs
dmesg
displays the messages generated by the Linux Kernel.dmesg
helps us detect any issues loading the GPU driver modules especially when the driver daemonset logs don’t provide a lot of informationYou can retrieve
dmesg
using either: kubectl exec or Execute thedmesg
in your host terminal.
kubectl exec
kubectl exec -n gpu-operator -it nvidia-driver-daemonset-p97x5 -c nvidia-driver-ctr -- dmesg
Execute the
dmesg
in your host terminalsudo dmesg
TIP: You can also grep for NVRM or Xid to view logs emitted by the driver’s kernel module.
sudo dmesg | grep -i NVRM
OR
sudo dmesg | grep -i Xid
Ensure that your driver daemonset has internet access to download deb/rpm packages during runtime:
Check your Kubernetes cluster’s VPC, Security groups and DNS settings
Consider exec’ing into a container shell and testing internet connectivity with a simple
ping
command
No runtime for “nvidia” is configured#
Observation
When running kubectl describe
for one of the gpu-operator pods, and you see an error like:
Warning FailedCreatePodSandBox 2m37s (x94 over 22m) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
Root Cause
This means that the RuntimeClass
is unable to find the runtime handler named “nvidia” in your container runtime’s configuration.
The runtime handler is added by the nvidia-container-toolkit, so this error message is likely related to startup issues with nvidia-container-toolkit
Action
Check the nvidia-container-toolkit logs
To retrieve the toolkit pod logs:
kubectl logs -n gpu-operator nvidia-container-toolkit-daemonset-2rhwg -c nvidia-container-toolkit-ctr
Check the driver daemonset logs
Ensure the driver daemonset is up and running. Refer to GPU Operator pods are stuck in Init.
Review the container runtime configuration TOML
CRI-O and Containerd are the two main container runtimes supported by the toolkit. You can view the runtime configuration file and verify that the “nvidia” container runtime handler actually exists
Here are some ways to retrieve the container runtime config:
If using “containerd”, run the
containerd config command
to retrieve the active containerd configurationIf using “cri-o”, run the
crio status config
command to retrieve the active cri-o configuration
Operator validator pods crashing with “error code system not yet initialized”#
When the operator validator pods are crashing with this error, this most likely points to a GPU node that is NVSwitch-based and requires the nvidia-fabricmanager to be installed. NVSwitch-based systems, like NVIDIA DGX and NVIDIA HGX server systems, require the memory fabric to be setup after the GPU driver is installed. Learn more about the Fabric Manager from the Fabric Manager user guide
Action
nvidia-smi -q
Exec into the driver container and run
nvidia-smi -q
if you are using gpu driver daemonset.kubectl exec -n gpu-operator -it nvidia-driver-daemonset-p97x5 -c nvidia-driver-ctr -- nvidia-smi -q
The
nvidia-smi -q
displays a verbose output with all the attributes of a GPUIf you see the following in the
nvidia-smi -q
command output, then the nvidia-fabricmanager needs to be installedFabric State : In Progress Status : N/A CliqueId : N/A ClusterUUID : N/A
NOTE: If your driver is pre-installed on your host system, run
nvidia-smi -q
in your host’s shell terminalRefer to the nvidia-driver-daemonset logs
The driver daemonset has the logic to detect NVSwitches and install the
nvidia-fabricmanager
if they are foundCheck the driver daemonset logs to confirm if the NVSwitch devices were detected and/or if the
nvidia-fabricmanager
was installed successfully
Check the Fabric Manager logs
If the operator validator pods are still crashing despite fabric manager being installed, you may need to look up the fabric manager logs
Exec into the driver container and run
cat /var/log/fabricmanager.log
if the gpu driver daemonset is deployedkubectl exec -n gpu-operator -it nvidia-driver-daemonset-p97x5 -c nvidia-driver-ctr -- cat /var/log/fabricmanager.log
If you are using a host-installed driver, SSH into the host and run
cat /var/log/fabricmanager.log
GPU Feature Discovery crashing with CreateContainerError/CrashLoopBackoff#
When the GPU Feature Discovery pods start crashing and you see the error below in the kubectl describe
output, the root cause is likely a driver/hardware issue.
....
....
Containers:
gpu-feature-discovery:
Container ID: containerd://947879d0f2a3e3a11187c3435c2e13f1d8962540b8853cebb409eaa47f661c34 Image: nvcr.io/nvidia/gpu-feature-discovery:v0.8.0-ubi8
Image ID: nvcr.io/nvidia/gpu-feature-discovery@sha256:84ce86490d0d313ed6517f2ac3a271e1179d7478d86c772da3846727d7feddc3 Port: <none>
Host Port: <none> State: Waiting
Reason: CrashLoopBackOff Last State: Terminated
Reason: StartError Message: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running
hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: initialization error: driver rpc error: timed out: unknown
Action
Check dmesg logs
dmesg
can be used to retrieve any issues stemming from gpu driver/hardware.You can fine tune your search by grepping for
NVRM
orXid
in your dmesg command outputYour command would look like -
sudo dmesg | grep -i NVRM
orsudo dmesg | grep -i Xid
If the output from the previous command has something like the snippet below, then it is likely a GPU driver/hardware issue.
# dmesg |grep -i xid NVRM: Xid (PCI:0000:ca:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
This error message indicates an Xid error with the code 79. For more information on Xid errors and its various error codes, refer to this page.
Check nvidia-device-plugin-daemonset logs
The
nvidia-device-plugin
has a health checker module which periodically monitors the NVML event stream for any Xid errors and marks a GPU as unhealthy if an Xid error is reported against itRetrieve the
nvidia-device-plugin-daemonset
pod logskubectl logs -n gpu-operator nvidia-device-plugin-daemonset-9bmvc -c nvidia-device-plugin
If there are Xid errors, the device plugin logs should look something like
XidCriticalError: Xid=48 on Device=GPU-e3dbf294-2783-f38b-4274-5bc836df5be1; marking device as unhealthy. 'nvidia.com/gpu' device marked unhealthy: GPU-e3dbf294-2783-f38b-4274-5bc836df5be1
GPU Node does not have the expected number of GPUs#
When inspecting your GPU node, you may not see the expected number of “Allocatable” GPUs advertised in the node.
For e.g., Given a GPU node with 8 GPUs, its kubectl describe output may look something like the snippet below:
Name: gpu-node-1
Roles: worker
......
......
Addresses:
InternalIP: 10.158.144.58
Hostname: gpu-node-1
Capacity:
cpu: 96
ephemeral-storage: 106935552Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 527422416Ki
nvidia.com/gpu: 7
pods: 110
Allocatable:
cpu: 96
ephemeral-storage: 98551804561
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 527320016Ki
nvidia.com/gpu: 7
pods: 110
....
....
The above node only advertises 7 GPU devices as allocatable when we expect it to display 8 instead
Action
Check for any Xid errors in the
nvidia-device-plugin-daemonset
pod logs. If an Xid error is raised for a GPU, the device plugin will automatically mark the GPU as unhealthy and take it off of list of “Allocatable” GPUs. Here are some example device-plugin logs in the event of an Xid error:I0624 22:58:05.486593 1 health.go:159] Processing event {Device:{Handle:0x7f7597647848} EventType:8 EventData:109 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0624 22:58:05.486697 1 health.go:185] XidCriticalError: Xid=79 on Device=GPU-adb24b25-1db1-436e-d958-ddee5da83d07; marking device as unhealthy. I0624 22:58:05.486727 1 server.go:276] 'nvidia.com/gpu' device marked unhealthy: GPU-adb24b25-1db1-436e-d958-ddee5da83d07
You can also check for Xid errors in GPU node’s
dmesg
logs.sudo dmesg | grep -i xid
For more information on Xid error codes and how to resolve them, you can refer to Xid Errors page.
DCGM Exporter pods go into CrashLoopBackoff#
By default, the gpu-operator only deploys the dcgm-exporter
while disabling the standalone dcgm
. In this setup, the dcgm-exporter
spawns a dcgm process locally. If, however, dcgm
is enabled and deployed as a separate pod/container, then the dcgm-exporter
will attempt to connect to the dcgm
pod through a Kubernetes service. If the cluster networking settings aren’t applied correctly, you would likely see error messages as mentioned below in the dcgm-exporter
logs:
time="2025-06-25T20:09:25Z" level=info msg="Attemping to connect to remote hostengine at nvidia-dcgm:5555"
time="2025-06-25T20:09:30Z" level=error msg="Encountered a failure." stacktrace="goroutine 1 [running]:\nruntime/debug.Stack()
/usr/local/go/src/runtime/debug/stack.go:24 +0x5e\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.action.func1.1()
/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:283 +0x3d\npanic({0x18b42c0?, 0x2a8d3e0?})
/usr/local/go/src/runtime/panic.go:770
Action
If you have
NetworkPolicies
set up, ensure that they are configured to allow the dcgm-exporter pod to communicate with the dcgm podEnsure that you don’t have security groups or network firewall settings preventing pod-pod traffic whether intranode or internode.
GPU driver upgrades are not progressing#
Despite initiating a cluster-wide driver upgrade, not every driver daemonset gets updated to the desired version and this state may persist for a long period of time.
$ kubectl get daemonsets -n gpu-operator nvidia-driver-daemonset
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
nvidia-driver-daemonset 4 4 4 3 4 nvidia.com/gpu.deploy.driver=true 14d
Action
Check for any nodes that have the
upgrade-failed
label.kubectl get nodes -l nvidia.com/gpu-driver-upgrade-state=upgrade-failed
Check the driver daemonset pod logs in these nodes
If the driver daemonset pod logs aren’t informative, check the node’s
dmesg
Once the issue is resolved, you can re-label the node with the command below:
kubectl label node <node-name> "nvidia.com/gpu-driver-upgrade-state=upgrade-required"
If the driver upgrade is still stuck, delete the driver pod on the node.
Pods stuck in Pending state in mixed MIG + full GPU environments#
Issue
For drivers 570.124.06, 570.133.20, 570.148.08, and 570.158.01, GPU workloads cannot be scheduled on nodes that have a mix of MIG slices and full GPUs. For more detailed information, see GitHub issue NVIDIA/gpu-operator#1361.
Observation
When a GPU pod is created on a node that has a mix of MIG slices and full GPUs,
the GPU pod gets stuck indefinitely in the Pending
state.
Root Cause
This is due to a regression in NVML introduced in the R570 drivers starting from 570.124.06.
Action
It’s recommended that you downgrade to driver version 570.86.15 to work around this issue.
GPU Operator Validator: Failed to Create Pod Sandbox#
Issue
On some occasions, the driver container is unable to unload the nouveau
Linux kernel module.
Observation
Running
kubectl describe pod -n gpu-operator -l app=nvidia-operator-validator
includes the following event:Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedCreatePodSandBox 8s (x21 over 9m2s) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
Running one of the following commands on the node indicates that the
nouveau
Linux kernel module is loaded:$ lsmod | grep -i nouveau $ dmesg | grep -i nouveau $ journalctl -xb | grep -i nouveau
Root Cause
The nouveau
Linux kernel module is loaded and the driver container is unable to unload the module.
Because the nouveau
module is loaded, the driver container cannot load the nvidia
module.
Action
On each node, run the following commands to prevent loading the nouveau
Linux kernel module on boot:
$ sudo tee /etc/modules-load.d/ipmi.conf <<< "ipmi_msghandler" \
&& sudo tee /etc/modprobe.d/blacklist-nouveau.conf <<< "blacklist nouveau" \
&& sudo tee -a /etc/modprobe.d/blacklist-nouveau.conf <<< "options nouveau modeset=0"
$ sudo update-initramfs -u
$ sudo init 6
No GPU Driver or Operand Pods Running#
Issue
On some clusters, taints are applied to nodes with a taint effect of NoSchedule
.
Observation
Running
kubectl get ds -n gpu-operator
shows0
forDESIRED
,CURRENT
,READY
and so on.NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE gpu-feature-discovery 0 0 0 0 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 11m ...
Root Cause
The NoSchedule
taint prevents the Operator from deploying the GPU Driver and other Operand pods.
Action
Describe each node, identify the taints, and either remove the taints from the nodes or add the taints as tolerations to the daemon sets.
GPU Operator Pods Stuck in Crash Loop#
Issue
On large clusters, such as 300 or more nodes, the GPU Operator pods can get stuck in a crash loop.
Observation
The GPU Operator pod is not running:
$ kubectl get pod -n gpu-operator -l app=gpu-operator
Example Output
NAME READY STATUS RESTARTS AGE gpu-operator-568c7ff7f6-chg5b 0/1 CrashLoopBackOff 4 (85s ago) 4m42s
The node that is running the GPU Operator pod has sufficient resources and the node is
Ready
:$ kubectl describe node <node-name>
Example Output
Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- MemoryPressure False Tue, 26 Dec 2023 14:01:31 +0000 Tue, 12 Dec 2023 19:47:47 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Tue, 26 Dec 2023 14:01:31 +0000 Thu, 14 Dec 2023 19:15:03 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Tue, 26 Dec 2023 14:01:31 +0000 Tue, 12 Dec 2023 19:47:47 +0000 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Tue, 26 Dec 2023 14:01:31 +0000 Thu, 14 Dec 2023 19:15:13 +0000 KubeletReady kubelet is posting ready status
Root Cause
The memory resource limit for the GPU Operator is too low for the cluster size.
Action
Increase the memory request and limit for the GPU Operator pod:
Set the memory request to a value that matches the average memory consumption over an large time window.
Set the memory limit to match the spikes in memory consumption that occur occasionally.
Increase the memory resource limit for the GPU Operator pod:
$ kubectl patch deployment gpu-operator -n gpu-operator --type='json' \ -p='[{"op":"replace", "path":"/spec/template/spec/containers/0/resources/limits/memory", "value":"1400Mi"}]'
Optional: Increase the memory resource request for the pod:
$ kubectl patch deployment gpu-operator -n gpu-operator --type='json' \ -p='[{"op":"replace", "path":"/spec/template/spec/containers/0/resources/requests/memory", "value":"600Mi"}]'
Monitor the GPU Operator pod. Increase the memory request and limit again if the pod remains stuck in a crash loop.
infoROM is corrupted (nvidia-smi return code 14)#
Issue
The nvidia-operator-validator pod fails and nvidia-driver-daemonsets fails as well.
Observation
The output from the driver validation container indicates that the infoROM is corrupt:
$ kubectl logs -n gpu-operator nvidia-operator-validator-xxxxx -c driver-validation
Example Output
| NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... On | 00000000:0B:00.0 Off | 0 |
| N/A 42C P0 29W / 250W | 0MiB / 16280MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
WARNING: infoROM is corrupted at gpu 0000:0B:00.0
14
The GPU emits some warning messages related to infoROM.
The return values for the nvidia-smi
command are listed below.
RETURN VALUE
Return code reflects whether the operation succeeded or failed and what
was the reason of failure.
· Return code 0 - Success
· Return code 2 - A supplied argument or flag is invalid
· Return code 3 - The requested operation is not available on target device
· Return code 4 - The current user does not have permission to access this device or perform this operation
· Return code 6 - A query to find an object was unsuccessful
· Return code 8 - A device's external power cables are not properly attached
· Return code 9 - NVIDIA driver is not loaded
· Return code 10 - NVIDIA Kernel detected an interrupt issue with a GPU
· Return code 12 - NVML Shared Library couldn't be found or loaded
· Return code 13 - Local version of NVML doesn't implement this function
· Return code 14 - infoROM is corrupted
· Return code 15 - The GPU has fallen off the bus or has otherwise become inaccessible
· Return code 255 - Other error or internal driver error occurred
Root Cause
The nvidia-smi
command should return a success code (return code 0) for the driver-validator container to pass and GPU operator to successfully deploy driver pod on the node.
Action
Replace the faulty GPU.
EFI + Secure Boot#
Issue
GPU Driver pod fails to deploy.
Root Cause
EFI Secure Boot is currently not supported with GPU Operator
Action
Disable EFI Secure Boot on the server.
File an issue#
If you are facing a gpu-operator and/or operand(s) issue that is not documented in this guide, you can run the must-gather
utility to prepare a bug report.
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh
This utility is used to collect relevant information from your cluster that is needed for diagnosing and debugging issues. The final output is an archive file which contains the manifests and logs of all the components managed by gpu-operator.