Troubleshooting the NVIDIA GPU Operator#
GPU Operator Validator: Failed to Create Pod Sandbox#
Issue
On some occasions, the driver container is unable to unload the nouveau Linux kernel module.
Observation
- Running - kubectl describe pod -n gpu-operator -l app=nvidia-operator-validatorincludes the following event:- Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedCreatePodSandBox 8s (x21 over 9m2s) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured 
- Running one of the following commands on the node indicates that the - nouveauLinux kernel module is loaded:- $ lsmod | grep -i nouveau $ dmesg | grep -i nouveau $ journalctl -xb | grep -i nouveau 
Root Cause
The nouveau Linux kernel module is loaded and the driver container is unable to unload the module.
Because the nouveau module is loaded, the driver container cannot load the nvidia module.
Action
On each node, run the following commands to prevent loading the nouveau Linux kernel module on boot:
$ sudo tee /etc/modules-load.d/ipmi.conf <<< "ipmi_msghandler" \
    && sudo tee /etc/modprobe.d/blacklist-nouveau.conf <<< "blacklist nouveau" \
    && sudo tee -a /etc/modprobe.d/blacklist-nouveau.conf <<< "options nouveau modeset=0"
$ sudo update-initramfs -u
$ sudo init 6
No GPU Driver or Operand Pods Running#
Issue
On some clusters, taints are applied to nodes with a taint effect of NoSchedule.
Observation
- Running - kubectl get ds -n gpu-operatorshows- 0for- DESIRED,- CURRENT,- READYand so on.- NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE gpu-feature-discovery 0 0 0 0 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 11m ... 
Root Cause
The NoSchedule taint prevents the Operator from deploying the GPU Driver and other Operand pods.
Action
Describe each node, identify the taints, and either remove the taints from the nodes or add the taints as tolerations to the daemon sets.
GPU Operator Pods Stuck in Crash Loop#
Issue
On large clusters, such as 300 or more nodes, the GPU Operator pods can get stuck in a crash loop.
Observation
- The GPU Operator pod is not running: - $ kubectl get pod -n gpu-operator -l app=gpu-operator - Example Output - NAME READY STATUS RESTARTS AGE gpu-operator-568c7ff7f6-chg5b 0/1 CrashLoopBackOff 4 (85s ago) 4m42s 
- The node that is running the GPU Operator pod has sufficient resources and the node is - Ready:- $ kubectl describe node <node-name>- Example Output - Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- MemoryPressure False Tue, 26 Dec 2023 14:01:31 +0000 Tue, 12 Dec 2023 19:47:47 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Tue, 26 Dec 2023 14:01:31 +0000 Thu, 14 Dec 2023 19:15:03 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Tue, 26 Dec 2023 14:01:31 +0000 Tue, 12 Dec 2023 19:47:47 +0000 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Tue, 26 Dec 2023 14:01:31 +0000 Thu, 14 Dec 2023 19:15:13 +0000 KubeletReady kubelet is posting ready status 
Root Cause
The memory resource limit for the GPU Operator is too low for the cluster size.
Action
Increase the memory request and limit for the GPU Operator pod:
- Set the memory request to a value that matches the average memory consumption over an large time window. 
- Set the memory limit to match the spikes in memory consumption that occur occasionally. 
- Increase the memory resource limit for the GPU Operator pod: - $ kubectl patch deployment gpu-operator -n gpu-operator --type='json' \ -p='[{"op":"replace", "path":"/spec/template/spec/containers/0/resources/limits/memory", "value":"1400Mi"}]' 
- Optional: Increase the memory resource request for the pod: - $ kubectl patch deployment gpu-operator -n gpu-operator --type='json' \ -p='[{"op":"replace", "path":"/spec/template/spec/containers/0/resources/requests/memory", "value":"600Mi"}]' 
Monitor the GPU Operator pod. Increase the memory request and limit again if the pod remains stuck in a crash loop.
infoROM is corrupted (nvidia-smi return code 14)#
Issue
The nvidia-operator-validator pod fails and nvidia-driver-daemonsets fails as well.
Observation
The output from the driver validation container indicates that the infoROM is corrupt:
$ kubectl logs -n gpu-operator nvidia-operator-validator-xxxxx -c driver-validation
Example Output
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 00000000:0B:00.0 Off |                    0 |
| N/A   42C    P0    29W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
WARNING: infoROM is corrupted at gpu 0000:0B:00.0
14
The GPU emits some warning messages related to infoROM.
The return values for the nvidia-smi command are listed below.
RETURN VALUE
Return code reflects whether the operation succeeded or failed and what
was the reason of failure.
·      Return code 0 - Success
·      Return code 2 - A supplied argument or flag is invalid
·      Return code 3 - The requested operation is not available on target device
·      Return code 4 - The current user does  not  have permission  to access this device or perform this operation
·      Return code 6 - A query to find an object was unsuccessful
·      Return code 8 - A device's external power cables are not properly attached
·      Return code 9 - NVIDIA driver is not loaded
·      Return code 10 - NVIDIA Kernel detected an interrupt issue  with a GPU
·      Return code 12 - NVML Shared Library couldn't be found or loaded
·      Return code 13 - Local version of NVML  doesn't  implement  this function
·      Return code 14 - infoROM is corrupted
·      Return code 15 - The GPU has fallen off the bus or has otherwise become inaccessible
·      Return code 255 - Other error or internal driver error occurred
Root Cause
The nvidi-smi command should return a success code (return code 0) for the driver-validator container to pass and GPU operator to successfully deploy driver pod on the node.
Action
Replace the faulty GPU.
EFI + Secure Boot#
Issue
GPU Driver pod fails to deploy.
Root Cause
EFI Secure Boot is currently not supported with GPU Operator
Action
Disable EFI Secure Boot on the server.