Troubleshooting#

Use this page when Confidential Containers installation or workload deployment steps fail.

Refer to the NVIDIA GPU Operator troubleshooting guide for general operator issues such as driver daemonsets, the container toolkit, and validator pods. The sections below cover Confidential Containers-specific deploy failures: CC node labels, Kata runtime installation, and host prerequisites.

If these steps do not resolve your issue, refer to Getting Help.

View GPU Operator Logs#

  1. Get the list of GPU Operator pods:

    $ kubectl get pods -n gpu-operator
    

    Example Output:

    NAME                                                              READY   STATUS    RESTARTS   AGE
    gpu-operator-1766001809-node-feature-discovery-gc-75776475sxzkp   1/1     Running   0          86s
    gpu-operator-1766001809-node-feature-discovery-master-6869lxq2g   1/1     Running   0          86s
    gpu-operator-1766001809-node-feature-discovery-worker-mh4cv       1/1     Running   0          86s
    gpu-operator-f48fd66b-vtfrl                                       1/1     Running   0          86s
    nvidia-cc-manager-7z74t                                           1/1     Running   0          61s
    nvidia-kata-sandbox-device-plugin-daemonset-d5rvg                 1/1     Running   0          30s
    nvidia-sandbox-validator-6xnzc                                    1/1     Running   0          30s
    nvidia-vfio-manager-h229x                                         1/1     Running   0          62s
    
  2. Get specific logs for a pod:

    $ kubectl logs -n gpu-operator <pod-name>
    

    Replace <pod-name> with the name of the GPU Operator pod from kubectl get pods -n gpu-operator.

View Kata Containers Logs#

  1. Get the list of Kata Containers pods:

    $ kubectl get pods -n kata-system
    

    Example Output:

    NAME                       READY   STATUS    RESTARTS   AGE
    kata-deploy-<pod-name>       1/1     Running   0          6m37s
    
  2. View the logs for the Kata Containers pod:

    $ kubectl logs -n kata-system <pod-name>
    

    Replace <pod-name> with the name of the Kata Containers pod from kubectl get pods -n kata-system.

nvidia.com/cc.mode.state Not Matching nvidia.com/cc.mode#

When changing the Confidential Computing mode (refer to Managing the Confidential Computing Mode), the Confidential Computing Manager updates the nvidia.com/cc.mode.state label to reflect the current state of the Confidential Computing mode. If the nvidia.com/cc.mode.state does not match the desired CC mode (on, off, or ppcie), it means the Confidential Computing update is still ongoing. Wait a few more minutes, then check the labels again.

$ kubectl get node $NODE_NAME -o json | \
      jq '.metadata.labels | with_entries(select(.key | startswith("nvidia.com/cc")))'

Example Output:

{
   "nvidia.com/cc.mode": "on",
   "nvidia.com/cc.mode.state": "on",
   "nvidia.com/cc.ready.state": "true"
}

nvidia.com/cc.mode.state is failed#

When the nvidia.com/cc.mode.state is failed, it means there was a problem updating the Confidential Computing mode on the GPU.

Checks:

  1. Confirm no user workloads are running on the node before changing CC mode. List pods scheduled on the node:

    $ export NODE_NAME="<node-name>"
    $ kubectl get pods -A --field-selector spec.nodeName=$NODE_NAME -o wide
    

    This lists pods on $NODE_NAME. kube-system DaemonSets (for example CNI or kube-proxy) are expected on every worker node. gpu-operator and kata-system pods are expected only if this node is configured for Confidential Containers (labeled nvidia.com/gpu.workload.config=vm-passthrough or cluster-wide sandboxWorkloads.defaultWorkload=vm-passthrough). Delete or reschedule any other Running pods (especially GPU workloads) before changing CC mode.

  2. View nvidia-cc-manager pod logs:

    $ kubectl logs -n gpu-operator nvidia-cc-manager-<pod-name>
    

    Replace <pod-name> with the name of the nvidia-cc-manager pod from kubectl get pods -n gpu-operator.

  3. Confirm hardware virtualization and ACS are enabled in the host BIOS. One way to do this is to check for vmx (Intel) or svm (AMD) in /proc/cpuinfo. For ACS, coordinate with your Hardware IT Administrator if needed.

  4. Re-apply the desired mode label to retry the transition:

    $ kubectl label node $NODE_NAME nvidia.com/cc.mode=on --overwrite
    

For mode configuration options, refer to Managing the Confidential Computing Mode.

Pod Stuck in ContainerCreating with device cold plug failed error#

If you see the following error when kubectl describe pod <pod-name> -n <namespace> and the pod is stuck in the ContainerCreating state, it means the KubeletPodResourcesGet feature gate is not enabled on the worker node. Refer to the Kubelet Configuration section in Prerequisites for more information on setting the feature gate.

Events:
  Type     Reason                  Age                 From     Message
  ----     ------                  ----                ----     -------
   Warning  FailedCreatePodSandBox  19s (x16 over 34s)  kubelet            (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox "d0a43b5d3c6c433f011efbfacb6de3f7ac448f3d09a272cef8d43249712b12b1": failed to create containerd task: failed to create shim task: device cold plug failed: cold plug: GetPodResources failed for pod(cuda-vectoradd-kata) in namespace(default): rpc error: code = Unknown desc = PodResources API Get method disabled

Pod Stuck in Pending State with Insufficient nvidia.com/pgpu Error#

If kubectl describe pod <pod-name> -n <namespace> shows the pod stuck in the Pending state, the scheduler cannot place the pod on a node with available passthrough GPU capacity.

Events:
  Type     Reason   Age   From      Message
  ----     ------   ---   ----      -------
  Warning  FailedScheduling  ...  default-scheduler   0/1 nodes are available: 1 Insufficient nvidia.com/pgpu.

Common causes:

  • The worker node is not configured for Confidential Containers workloads.

  • GPU Operator Confidential Containers operands are missing or not Running on the worker node.

  • nvidia.com/pgpu capacity on the node is zero because GPUs are not bound to vfio-pci on the host.

  • All passthrough GPUs on eligible nodes are already allocated to other pods.

Resolution:

  1. Confirm GPU Operator operands are Running on the worker node:

    $ kubectl get pods -n gpu-operator -o wide --field-selector spec.nodeName=<node-name>
    

    Expected Confidential Containers operands include nvidia-cc-manager, nvidia-vfio-manager, nvidia-kata-sandbox-device-plugin, and nvidia-sandbox-validator. If an operand is not Running, refer to View GPU Operator Logs.

  2. Confirm the node is configured for Confidential Containers workloads:

    $ kubectl describe node <node-name> | grep nvidia.com/gpu.workload.config
    

    Example Output:

    nvidia.com/gpu.workload.config: vm-passthrough
    

    If the label is missing, add it:

    $ kubectl label node <node-name> nvidia.com/gpu.workload.config=vm-passthrough
    

    If you set the cluster-wide default during installation instead of per-node labeling, confirm sandboxWorkloads.defaultWorkload is vm-passthrough. Refer to Common GPU Operator Configuration Settings in Detailed Install Guide.

  3. Check nvidia.com/pgpu capacity on the node:

    $ kubectl describe node <node-name> | grep nvidia.com/pgpu
    

    Example Output:

    nvidia.com/pgpu:  8
    nvidia.com/pgpu:  8
    

    If capacity and allocatable are zero, GPUs are not available for scheduling. On the worker host, confirm VFIO binding:

    $ lspci -nnk -d 10de:
    

    Example Output (expected):

    65:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:xxxx] (rev a1)
            Kernel driver in use: vfio-pci
    

    If the output shows Kernel driver in use: nvidia or nouveau, remove host drivers as described in Ensure No Host NVIDIA GPU Drivers Are Present. Confirm IOMMU is enabled:

    $ ls /sys/kernel/iommu_groups
    

    If the directory is empty or missing, configure IOMMU as described in Prerequisites, then reboot the host. Review nvidia-vfio-manager pod logs on the affected node in View GPU Operator Logs. After fixing host prerequisites, wait for operand pods to reconcile and confirm nvidia.com/pgpu is non-zero.

  4. If the node shows non-zero nvidia.com/pgpu capacity but the pod is still Pending, all GPUs may be in use. Check allocatable capacity and running workloads on the node.

Refer to the optional VFIO validation step in Detailed Install Guide.

Getting Help#

If the steps on this page do not resolve your issue, use the resources below based on which component is failing.

NVIDIA GPU Operator and Confidential Computing Operands#

For issues with GPU Operator pods or Confidential Containers operands (nvidia-cc-manager, nvidia-vfio-manager, nvidia-kata-sandbox-device-plugin, and nvidia-sandbox-validator):

  1. Review the NVIDIA GPU Operator troubleshooting guide.

  2. If the issue is not documented there, run the GPU Operator must-gather utility to collect cluster diagnostics:

    $ curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
    $ chmod +x must-gather.sh
    $ ./must-gather.sh
    

    The utility produces an archive with manifests and logs from GPU Operator-managed components.

  3. Prepare a bug report and file an issue in the NVIDIA GPU Operator GitHub repository.

Kata Containers#

For issues with kata-deploy, missing runtime classes, or Kata runtime failures:

  1. Search the Kata Containers GitHub issues for similar reports.

  2. If no existing issue matches your problem, open a new issue in that repository.

    Include your environment details, Kata chart version, kata-deploy pod logs, and cluster configuration.

Attestation and Upstream Confidential Containers#

For attestation, Trustee, sealed secrets, or other upstream Confidential Containers features, refer to the Confidential Containers documentation and the Confidential Containers GitHub repository.

For NVIDIA Confidential Computing licensing requirements, refer to Licensing.