Troubleshooting#
Use this page when Confidential Containers installation or workload deployment steps fail.
Refer to the NVIDIA GPU Operator troubleshooting guide for general operator issues such as driver daemonsets, the container toolkit, and validator pods. The sections below cover Confidential Containers-specific deploy failures: CC node labels, Kata runtime installation, and host prerequisites.
If these steps do not resolve your issue, refer to Getting Help.
View GPU Operator Logs#
Get the list of GPU Operator pods:
$ kubectl get pods -n gpu-operatorExample Output:
NAME READY STATUS RESTARTS AGE gpu-operator-1766001809-node-feature-discovery-gc-75776475sxzkp 1/1 Running 0 86s gpu-operator-1766001809-node-feature-discovery-master-6869lxq2g 1/1 Running 0 86s gpu-operator-1766001809-node-feature-discovery-worker-mh4cv 1/1 Running 0 86s gpu-operator-f48fd66b-vtfrl 1/1 Running 0 86s nvidia-cc-manager-7z74t 1/1 Running 0 61s nvidia-kata-sandbox-device-plugin-daemonset-d5rvg 1/1 Running 0 30s nvidia-sandbox-validator-6xnzc 1/1 Running 0 30s nvidia-vfio-manager-h229x 1/1 Running 0 62s
Get specific logs for a pod:
$ kubectl logs -n gpu-operator <pod-name>Replace
<pod-name>with the name of the GPU Operator pod fromkubectl get pods -n gpu-operator.
View Kata Containers Logs#
Get the list of Kata Containers pods:
$ kubectl get pods -n kata-systemExample Output:
NAME READY STATUS RESTARTS AGE kata-deploy-<pod-name> 1/1 Running 0 6m37s
View the logs for the Kata Containers pod:
$ kubectl logs -n kata-system <pod-name>Replace
<pod-name>with the name of the Kata Containers pod fromkubectl get pods -n kata-system.
nvidia.com/cc.mode.state Not Matching nvidia.com/cc.mode#
When changing the Confidential Computing mode (refer to Managing the Confidential Computing Mode), the Confidential Computing Manager updates the nvidia.com/cc.mode.state label to reflect the current state of the Confidential Computing mode.
If the nvidia.com/cc.mode.state does not match the desired CC mode (on, off, or ppcie), it means the Confidential Computing update is still ongoing.
Wait a few more minutes, then check the labels again.
$ kubectl get node $NODE_NAME -o json | \
jq '.metadata.labels | with_entries(select(.key | startswith("nvidia.com/cc")))'
Example Output:
{
"nvidia.com/cc.mode": "on",
"nvidia.com/cc.mode.state": "on",
"nvidia.com/cc.ready.state": "true"
}
nvidia.com/cc.mode.state is failed#
When the nvidia.com/cc.mode.state is failed, it means there was a problem updating the Confidential Computing mode on the GPU.
Checks:
Confirm no user workloads are running on the node before changing CC mode. List pods scheduled on the node:
$ export NODE_NAME="<node-name>" $ kubectl get pods -A --field-selector spec.nodeName=$NODE_NAME -o wide
This lists pods on
$NODE_NAME.kube-systemDaemonSets (for example CNI orkube-proxy) are expected on every worker node.gpu-operatorandkata-systempods are expected only if this node is configured for Confidential Containers (labelednvidia.com/gpu.workload.config=vm-passthroughor cluster-widesandboxWorkloads.defaultWorkload=vm-passthrough). Delete or reschedule any otherRunningpods (especially GPU workloads) before changing CC mode.View
nvidia-cc-managerpod logs:$ kubectl logs -n gpu-operator nvidia-cc-manager-<pod-name>Replace
<pod-name>with the name of thenvidia-cc-managerpod fromkubectl get pods -n gpu-operator.Confirm hardware virtualization and ACS are enabled in the host BIOS. One way to do this is to check for
vmx(Intel) orsvm(AMD) in/proc/cpuinfo. For ACS, coordinate with your Hardware IT Administrator if needed.Re-apply the desired mode label to retry the transition:
$ kubectl label node $NODE_NAME nvidia.com/cc.mode=on --overwrite
For mode configuration options, refer to Managing the Confidential Computing Mode.
Pod Stuck in ContainerCreating with device cold plug failed error#
If you see the following error when kubectl describe pod <pod-name> -n <namespace> and the pod is stuck in the ContainerCreating state, it means the KubeletPodResourcesGet feature gate is not enabled on the worker node.
Refer to the Kubelet Configuration section in Prerequisites for more information on setting the feature gate.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreatePodSandBox 19s (x16 over 34s) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox "d0a43b5d3c6c433f011efbfacb6de3f7ac448f3d09a272cef8d43249712b12b1": failed to create containerd task: failed to create shim task: device cold plug failed: cold plug: GetPodResources failed for pod(cuda-vectoradd-kata) in namespace(default): rpc error: code = Unknown desc = PodResources API Get method disabled
Pod Stuck in Pending State with Insufficient nvidia.com/pgpu Error#
If kubectl describe pod <pod-name> -n <namespace> shows the pod stuck in the Pending state, the scheduler cannot place the pod on a node with available passthrough GPU capacity.
Events:
Type Reason Age From Message
---- ------ --- ---- -------
Warning FailedScheduling ... default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/pgpu.
Common causes:
The worker node is not configured for Confidential Containers workloads.
GPU Operator Confidential Containers operands are missing or not
Runningon the worker node.nvidia.com/pgpucapacity on the node is zero because GPUs are not bound tovfio-pcion the host.All passthrough GPUs on eligible nodes are already allocated to other pods.
Resolution:
Confirm GPU Operator operands are
Runningon the worker node:$ kubectl get pods -n gpu-operator -o wide --field-selector spec.nodeName=<node-name>
Expected Confidential Containers operands include
nvidia-cc-manager,nvidia-vfio-manager,nvidia-kata-sandbox-device-plugin, andnvidia-sandbox-validator. If an operand is notRunning, refer to View GPU Operator Logs.Confirm the node is configured for Confidential Containers workloads:
$ kubectl describe node <node-name> | grep nvidia.com/gpu.workload.config
Example Output:
nvidia.com/gpu.workload.config: vm-passthroughIf the label is missing, add it:
$ kubectl label node <node-name> nvidia.com/gpu.workload.config=vm-passthrough
If you set the cluster-wide default during installation instead of per-node labeling, confirm
sandboxWorkloads.defaultWorkloadisvm-passthrough. Refer to Common GPU Operator Configuration Settings in Detailed Install Guide.Check
nvidia.com/pgpucapacity on the node:$ kubectl describe node <node-name> | grep nvidia.com/pgpu
Example Output:
nvidia.com/pgpu: 8 nvidia.com/pgpu: 8
If capacity and allocatable are zero, GPUs are not available for scheduling. On the worker host, confirm VFIO binding:
$ lspci -nnk -d 10de:Example Output (expected):
65:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:xxxx] (rev a1) Kernel driver in use: vfio-pci
If the output shows
Kernel driver in use: nvidiaornouveau, remove host drivers as described in Ensure No Host NVIDIA GPU Drivers Are Present. Confirm IOMMU is enabled:$ ls /sys/kernel/iommu_groupsIf the directory is empty or missing, configure IOMMU as described in Prerequisites, then reboot the host. Review
nvidia-vfio-managerpod logs on the affected node in View GPU Operator Logs. After fixing host prerequisites, wait for operand pods to reconcile and confirmnvidia.com/pgpuis non-zero.If the node shows non-zero
nvidia.com/pgpucapacity but the pod is stillPending, all GPUs may be in use. Check allocatable capacity and running workloads on the node.
Refer to the optional VFIO validation step in Detailed Install Guide.
Getting Help#
If the steps on this page do not resolve your issue, use the resources below based on which component is failing.
NVIDIA GPU Operator and Confidential Computing Operands#
For issues with GPU Operator pods or Confidential Containers operands (nvidia-cc-manager, nvidia-vfio-manager, nvidia-kata-sandbox-device-plugin, and nvidia-sandbox-validator):
Review the NVIDIA GPU Operator troubleshooting guide.
If the issue is not documented there, run the GPU Operator
must-gatherutility to collect cluster diagnostics:$ curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh $ chmod +x must-gather.sh $ ./must-gather.sh
The utility produces an archive with manifests and logs from GPU Operator-managed components.
Prepare a bug report and file an issue in the NVIDIA GPU Operator GitHub repository.
Kata Containers#
For issues with kata-deploy, missing runtime classes, or Kata runtime failures:
Search the Kata Containers GitHub issues for similar reports.
If no existing issue matches your problem, open a new issue in that repository.
Include your environment details, Kata chart version,
kata-deploypod logs, and cluster configuration.
Attestation and Upstream Confidential Containers#
For attestation, Trustee, sealed secrets, or other upstream Confidential Containers features, refer to the Confidential Containers documentation and the Confidential Containers GitHub repository.
For NVIDIA Confidential Computing licensing requirements, refer to Licensing.