Troubleshooting#

Use this page when Confidential Containers installation or workload deployment steps fail.

Refer to the NVIDIA GPU Operator troubleshooting guide for general operator issues such as driver daemonsets, the container toolkit, and validator pods. The sections below cover Confidential Containers-specific deploy failures: CC node labels, Kata runtime installation, and host prerequisites.

If these steps do not resolve your issue, refer to Getting Help.

Symptom	Cause	Fix
Pod stuck in Pending with Insufficient nvidia.com/pgpu	Node not labeled for Confidential Containers, GPUs not bound to `vfio-pci`, or all GPUs already allocated.	Check operands running, verify `vm-passthrough` label, confirm `vfio-pci` binding.
Pod stuck in ContainerCreating with device cold plug failed	`KubeletPodResourcesGet` feature gate not enabled on worker node.	Enable feature gate per Prerequisites
nvidia.com/cc.mode.state not matching nvidia.com/cc.mode	Mode transition still in progress, or blocked by a workload using a GPU on the node.	Wait 1-2 minutes, check cc.mode labels. If GPU Operator pods, specifically the vfio-manager are stuck in terminating, make sure that no user workloads are running on the node.
nvidia.com/cc.mode.state is failed	Workload running during mode change, or BIOS/ACS misconfigured.	Drain workloads, re-apply mode label, confirm ACS enabled in BIOS.

View GPU Operator Logs#

Use this section to collect GPU Operator pod logs when an operand is not running or reporting errors.

Get the list of GPU Operator pods:

$ kubectl get pods -n gpu-operator

Example Output:

NAME                                                              READY   STATUS    RESTARTS   AGE
gpu-operator-1766001809-node-feature-discovery-gc-75776475sxzkp   1/1     Running   0          86s
gpu-operator-1766001809-node-feature-discovery-master-6869lxq2g   1/1     Running   0          86s
gpu-operator-1766001809-node-feature-discovery-worker-mh4cv       1/1     Running   0          86s
gpu-operator-f48fd66b-vtfrl                                       1/1     Running   0          86s
nvidia-cc-manager-7z74t                                           1/1     Running   0          61s
nvidia-kata-sandbox-device-plugin-daemonset-d5rvg                 1/1     Running   0          30s
nvidia-sandbox-validator-6xnzc                                    1/1     Running   0          30s
nvidia-vfio-manager-h229x                                         1/1     Running   0          62s

Get specific logs for a pod:
```
$ kubectl logs -n gpu-operator <pod-name>
```
Replace <pod-name> with the name of the GPU Operator pod from kubectl get pods -n gpu-operator. Logs for nvidia-kata-sandbox-device-plugin-daemonset-<pod-name>, nvidia-cc-manager-<pod-name>, and nvidia-vfio-manager-<pod-name> can be most useful for troubleshooting Confidential Containers issues.

View Kata Containers Logs#

Use this section to collect Kata Containers logs and confirm runtime classes are installed.

Confirm the expected runtime classes are registered:
```
$ kubectl get runtimeclass
```
After a successful Kata Containers deployment, you should see kata-qemu-nvidia-gpu-snp (AMD SEV-SNP) and kata-qemu-nvidia-gpu-tdx (Intel TDX) in the output. If these are missing, continue to the next steps.

Get the list of Kata Containers pod:

$ kubectl get pods -n kata-system

Example Output:

NAME                       READY   STATUS    RESTARTS   AGE
kata-deploy-<pod-name>       1/1     Running   0          6m37s

View the logs for the Kata Containers pod:
```
$ kubectl logs -n kata-system <pod-name>
```
Replace <pod-name> with the name of the Kata Containers pod.

Example Output:
```
Install completed
daemonset mode: waiting for SIGTERM
```

If the logs show errors or runtime classes are still missing after a successful Helm deploy, collect the log output and refer to Getting Help for Kata-specific troubleshooting resources.

`nvidia.com/cc.mode.state` Not Matching `nvidia.com/cc.mode`#

When changing the Confidential Computing mode (refer to Managing the Confidential Computing Mode), the Confidential Computing Manager updates the nvidia.com/cc.mode.state label to reflect the current state of the Confidential Computing mode. If the nvidia.com/cc.mode.state does not match the desired CC mode (on, off, or ppcie), it could mean the following:

The GPU is still updating to the Confidential Computing mode.
Wait a few more minutes, then check the labels again.
The transition is blocked by a user workload with a resource claim for a GPU on the node.
This is usually accompanied by the VFIO manager stuck in Terminating state or CC Manager logs showing the mode transition is still in progress. Remove the workload to unblock the mode transition.

Checks:

Set the NODE_NAME environment variable to the node you want to check:

$ export NODE_NAME="<node-name>"

Check the cc.mode labels:

$ kubectl get node $NODE_NAME -o json | \
      jq '.metadata.labels | with_entries(select(.key | startswith("nvidia.com/cc")))'

Example Output:

{
   "nvidia.com/cc.mode": "on",
   "nvidia.com/cc.mode.state": "off",
   "nvidia.com/cc.ready.state": "false"
}

Get the list of GPU Operator pods:

$ kubectl get pods -n gpu-operator

Example Output:

NAME                                                          READY   STATUS        RESTARTS      AGE
gpu-operator-6474ddf79d-s4gcb                                 1/1     Running       0             10m
gpu-operator-node-feature-discovery-gc-8fb8d5d8d-mvvfz        1/1     Running       0             10m
gpu-operator-node-feature-discovery-master-5bbc6d887b-66wrs   1/1     Running       0             10m
gpu-operator-node-feature-discovery-worker-9xbb8              1/1     Running       0             10m
nvidia-cc-manager-tqdc6                                       1/1     Running       0             10m
nvidia-kata-sandbox-device-plugin-daemonset-b5tp7             1/1     Running       0             10m
nvidia-vfio-manager-b7jtv                                     0/1     Terminating   0             10m

Get the logs for the nvidia-cc-manager pod:
```
$ kubectl logs -n gpu-operator nvidia-cc-manager-<pod-name>
```
Replace <pod-name> with the name of the nvidia-cc-manager pod from kubectl get pods -n gpu-operator.

Example Output:
```
2026-06-18 19:47:26,095 - k8s-cc-manager - INFO - Resetting 2 GPU(s) to apply CC mode
2026-06-18 19:47:26,095 - k8s-cc-manager - INFO - Resetting GPU 0000:0d:00.0
2026-06-18 19:47:26,475 - k8s-cc-manager - INFO - Resetting GPU 0000:b5:00.0
```
If the CC Manager logs show the mode transition is still in progress, it means the GPU is still updating to the Confidential Computing mode. CC Manager logs show Successfully set CC mode to '<off|on|ppcie>' on all GPUs when the mode transition is complete.

Check which pods have a passthrough GPU allocated on the node:

$ kubectl get pods -A --field-selector spec.nodeName=$NODE_NAME -o json | \
    jq -r '.items[] | select(any(.spec.containers[]; .resources.requests["nvidia.com/pgpu"] // empty)) | "\(.metadata.namespace)/\(.metadata.name)"'

Example Output:

default/cuda-vectoradd-kata

Any pods returned have a passthrough GPU allocated and are blocking the mode transition.

Delete the pod(s) to unblock the mode transition.
```
$ kubectl delete pod <pod-name> -n <namespace>
```
Example Output:
```
pod "<pod-name>" deleted
```
Repeat this step for each pod that is blocking the mode transition. Once all user pods are deleted, the mode transition will resume automatically.

Confirm the mode transition is complete:

$ kubectl get pods -n gpu-operator

Example Output:

NAME                                                          READY   STATUS    RESTARTS   AGE
gpu-operator-6474ddf79d-s4gcb                                 1/1     Running   0             10m
gpu-operator-node-feature-discovery-gc-8fb8d5d8d-mvvfz        1/1     Running   0             10m
gpu-operator-node-feature-discovery-master-5bbc6d887b-66wrs   1/1     Running   0             10m
gpu-operator-node-feature-discovery-worker-9xbb8              1/1     Running   0             10m
nvidia-cc-manager-tqdc6                                       1/1     Running   0             10m
nvidia-kata-sandbox-device-plugin-daemonset-b5tp7             1/1     Running   0             10m
nvidia-vfio-manager-b7jtv                                     1/1     Running   0              1m

`nvidia.com/cc.mode.state` is `failed`#

When the nvidia.com/cc.mode.state is failed, it means there was a problem updating the Confidential Computing mode on the GPU. This typicall indicates that your GPU does not support Confidential Computing mode or there is a hardware error. You may need to contact your Hardware IT Administrator to confirm the GPU is supported and the hardware is functioning correctly. Refer to the Prerequisites for more information on required hardware.

Checks:

Set the NODE_NAME environment variable to the node you want to check:

$ export NODE_NAME="<node-name>"

Confirm no user workloads are running on the node before changing CC mode. List pods scheduled on the node:
```
$ kubectl get pods -A --field-selector spec.nodeName=$NODE_NAME -o wide
```
This lists pods on $NODE_NAME. kube-system DaemonSets (for example CNI or kube-proxy) are expected on every worker node. gpu-operator and kata-system pods are expected only if this node is configured for Confidential Containers (labeled nvidia.com/gpu.workload.config=vm-passthrough or cluster-wide sandboxWorkloads.defaultWorkload=vm-passthrough). Delete or reschedule any other Running pods (especially GPU workloads) before changing CC mode.
View nvidia-cc-manager pod logs:
```
$ kubectl logs -n gpu-operator nvidia-cc-manager-<pod-name>
```
Replace <pod-name> with the name of the nvidia-cc-manager pod from kubectl get pods -n gpu-operator.
Confirm hardware virtualization and ACS are enabled in the host BIOS. One way to do this is to check for vmx (Intel) or svm (AMD) in /proc/cpuinfo. For ACS, coordinate with your Hardware IT Administrator if needed.

Re-apply the desired mode label to retry the transition:

$ kubectl label node $NODE_NAME nvidia.com/cc.mode=on --overwrite

Confirm the mode transition is complete by checking the CC mode labels:
```
$ kubectl get node $NODE_NAME -o json | \
      jq '.metadata.labels | with_entries(select(.key | startswith("nvidia.com/cc")))'
```
The nvidia.com/cc.mode.state label should match the desired mode when the transition is complete. If the state is still failed, refer to Getting Help.

For mode configuration options, refer to Managing the Confidential Computing Mode.

Pod Stuck in `ContainerCreating` with `device cold plug failed` error#

If you see the following error when kubectl describe pod <pod-name> -n <namespace> and the pod is stuck in the ContainerCreating state, it means the KubeletPodResourcesGet feature gate is not enabled on the worker node. Refer to the Kubelet Configuration section in Prerequisites for more information on setting the feature gate.

Events:
  Type     Reason                  Age                 From     Message
  ----     ------                  ----                ----     -------
   Warning  FailedCreatePodSandBox  19s (x16 over 34s)  kubelet            (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox "d0a43b5d3c6c433f011efbfacb6de3f7ac448f3d09a272cef8d43249712b12b1": failed to create containerd task: failed to create shim task: device cold plug failed: cold plug: GetPodResources failed for pod(cuda-vectoradd-kata) in namespace(default): rpc error: code = Unknown desc = PodResources API Get method disabled

Pod Stuck in `Pending` State with `Insufficient nvidia.com/pgpu` Error#

If kubectl describe pod <pod-name> -n <namespace> shows the pod stuck in the Pending state, the scheduler cannot place the pod on a node with available passthrough GPU capacity.

Events:
  Type     Reason   Age   From      Message
  ----     ------   ---   ----      -------
  Warning  FailedScheduling  ...  default-scheduler   0/1 nodes are available: 1 Insufficient nvidia.com/pgpu.

Common causes:

The worker node is not configured for Confidential Containers workloads.
GPU Operator Confidential Containers operands are missing or not Running on the worker node.
nvidia.com/pgpu capacity on the node is zero because GPUs are not bound to vfio-pci on the host.
All passthrough GPUs on eligible nodes are already allocated to other pods.

Resolution:

Set the NODE_NAME environment variable to the affected node so you can copy-paste the commands below:

$ export NODE_NAME="<node-name>"

Confirm GPU Operator operands are Running on the worker node:
```
$ kubectl get pods -n gpu-operator -o wide --field-selector spec.nodeName=$NODE_NAME
```
Expected Confidential Containers operands include nvidia-cc-manager, nvidia-vfio-manager, nvidia-kata-sandbox-device-plugin, and nvidia-sandbox-validator. If an operand is not Running, refer to View GPU Operator Logs.
Confirm the node is configured for Confidential Containers workloads:
```
$ kubectl describe node $NODE_NAME | grep nvidia.com/gpu.workload.config
```
Example Output:
```
nvidia.com/gpu.workload.config: vm-passthrough
```
If the label is missing, add it:
```
$ kubectl label node $NODE_NAME nvidia.com/gpu.workload.config=vm-passthrough
```
If you set the cluster-wide default during installation instead of per-node labeling, confirm sandboxWorkloads.defaultWorkload is vm-passthrough. Refer to Common GPU Operator Configuration Settings in Detailed Install Guide.
Check nvidia.com/pgpu capacity on the node:
```
$ kubectl describe node $NODE_NAME | grep nvidia.com/pgpu
```
Example Output:
```
nvidia.com/pgpu:  8
nvidia.com/pgpu:  8
```
If capacity and allocatable are zero, GPUs are not available for scheduling. On the worker host, confirm VFIO binding (10de is the NVIDIA PCI vendor ID):
```
$ lspci -nnk -d 10de:
```
Example Output (expected):
```
65:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:xxxx] (rev a1)
        Kernel driver in use: vfio-pci
```
If the output shows Kernel driver in use: nvidia or nouveau, remove host drivers as described in Ensure No Host NVIDIA GPU Drivers Are Present. Confirm IOMMU is enabled:
```
$ ls /sys/kernel/iommu_groups
```
If the directory is empty or missing, configure IOMMU as described in Prerequisites, then reboot the host. Review nvidia-vfio-manager pod logs on the affected node in View GPU Operator Logs. After fixing host prerequisites, wait for operand pods to reconcile and confirm nvidia.com/pgpu is non-zero.
If the node shows non-zero nvidia.com/pgpu capacity but the pod is still Pending, all passthrough GPUs on eligible nodes may be allocated to other pods. Check which pods currently have a passthrough GPU allocated:
```
$ kubectl get pods -A --field-selector spec.nodeName=$NODE_NAME -o json | \
    jq -r '.items[] | select(any(.spec.containers[]; .resources.requests["nvidia.com/pgpu"] // empty)) | "\(.metadata.namespace)/\(.metadata.name)"'
```
Wait for a workload to complete or free capacity before retrying.

Refer to the optional VFIO validation step in Detailed Install Guide.

Getting Help#

If the steps on this page do not resolve your issue, use the resources below based on which component is failing.

NVIDIA GPU Operator and Confidential Computing Operands#

For issues with GPU Operator pods or Confidential Containers operands (nvidia-cc-manager, nvidia-vfio-manager, nvidia-kata-sandbox-device-plugin, and nvidia-sandbox-validator):

Review the NVIDIA GPU Operator troubleshooting guide.
If the issue is not documented there, run the GPU Operator must-gather utility to collect cluster diagnostics:
```
$ curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
$ chmod +x must-gather.sh
$ ./must-gather.sh
```
The utility produces an archive with manifests and logs from GPU Operator-managed components.
Prepare a bug report and file an issue in the NVIDIA GPU Operator GitHub repository.

Kata Containers#

For issues with kata-deploy, missing runtime classes, or Kata runtime failures:

Search the Kata Containers GitHub issues for similar reports.
If no existing issue matches your problem, open a new issue in that repository.

Include your environment details, Kata chart version, kata-deploy pod logs, and cluster configuration.

Attestation and Upstream Confidential Containers#

For attestation, Trustee, sealed secrets, or other upstream Confidential Containers features, refer to the Confidential Containers documentation and the Confidential Containers GitHub repository.

For NVIDIA Confidential Computing licensing requirements, refer to Licensing.

Troubleshooting#

View GPU Operator Logs#

View Kata Containers Logs#

nvidia.com/cc.mode.state Not Matching nvidia.com/cc.mode#

nvidia.com/cc.mode.state is failed#

Pod Stuck in ContainerCreating with device cold plug failed error#

Pod Stuck in Pending State with Insufficient nvidia.com/pgpu Error#

Getting Help#

NVIDIA GPU Operator and Confidential Computing Operands#

Kata Containers#

Attestation and Upstream Confidential Containers#

`nvidia.com/cc.mode.state` Not Matching `nvidia.com/cc.mode`#

`nvidia.com/cc.mode.state` is `failed`#

Pod Stuck in `ContainerCreating` with `device cold plug failed` error#

Pod Stuck in `Pending` State with `Insufficient nvidia.com/pgpu` Error#