Detailed Install Guide#

This page lists the steps for a Kubernetes Cluster Administrator to deploy Kata Containers and the NVIDIA GPU Operator to your cluster and configure it for Confidential Containers. For persona responsibilities and documentation structure, refer to Personas.

If you want the fastest path and intend to run Confidential Containers on every node in your cluster, use the Quickstart Install instead. Use this guide when you need per-node control, such as running Confidential Containers on some nodes and traditional GPU workloads on others, or when you want additional configuration options.

Install Overview#

This guide assumes you completed Prerequisites on an existing Kubernetes cluster with GPU worker nodes.

Install workflow:

  1. Prerequisites: prepare worker hosts and cluster software.

  2. Label nodes to deploy Confidential Containers components: select GPU workers for Confidential Containers workloads.

  3. Install Kata Containers: install runtime classes and node-level Kata components.

  4. Install the NVIDIA GPU Operator: deploy Confidential Containers operands on target nodes.

  5. Run a Sample Workload: confirm the deployment end to end.

Success criteria: Helm releases report STATUS: deployed, the kata-deploy pod is Running, SNP and TDX runtime classes are available, GPU Operator operands are healthy on target nodes, and the sample workload logs include Test PASSED.

When you finish this page, nodes are labeled for Confidential Container component deployment, Kata runtime classes are available, and GPU Operator operands are running on those nodes. Continue to Run a Sample Workload if you have not run it yet.

Label Nodes for Confidential Containers Components#

The GPU Operator reads labels to determine what software components to deploy to a node. To configure a node for Confidential Container workloads, you label the node with the nvidia.com/gpu.workload.config=vm-passthrough label. Then, when the GPU Operator is installed in a subsequent step, it will deploy the software components needed to run Confidential Containers to the node.

A node can only run one container runtime at a time, so a node configured for Confidential Container workloads cannot run traditional GPU container workloads. The labeling approach is useful if you want to run Confidential Containers workloads on some nodes and traditional GPU container workloads on other nodes in your cluster.

For more details on how the GPU Operator deploys components to your cluster, refer to the GPU Operator Cluster Topology Considerations section in the architecture overview.

Tip

Skip this section if you plan to use all nodes in your cluster to run Confidential Containers and instead set sandboxWorkloads.defaultWorkload=vm-passthrough when installing the GPU Operator.

  1. Get a list of the nodes in your cluster:

    $ kubectl get nodes
    

    Example Output:

    NAME          STATUS   ROLES           AGE   VERSION
    node-01       Ready    <none>          10d   v1.34.0
    node-02       Ready    <none>          10d   v1.34.0
    

    Identify the GPU worker node or nodes you want to configure for Confidential Containers and use its name in the next step.

  2. Set the NODE_NAME environment variable to the name of the node you want to configure:

    $ export NODE_NAME="<node-name>"
    

    Note

    Commands in this guide use the $NODE_NAME environment variable to reference this node.

  3. Label the node for Confidential Containers:

    $ kubectl label node $NODE_NAME nvidia.com/gpu.workload.config=vm-passthrough
    

    Example Output:

    node/<node-name> labeled
    

    The node/<node-name> labeled message confirms the label was applied.

    Note

    If the command prints <node-name> not labeled, the label may already be set. Continue to the next step to verify the label was added.

  4. Verify the node label was added:

    $ kubectl describe node $NODE_NAME | grep nvidia.com/gpu.workload.config
    

    Example Output:

    nvidia.com/gpu.workload.config: vm-passthrough
    

Success criteria: All nodes you intend to use for Confidential Container workloads have the nvidia.com/gpu.workload.config: vm-passthrough label. By labeling the nodes in your cluster that you want to run Confidential Container workloads, you are signaling to the GPU Operator to deploy the software components needed to run Confidential Containers to the node and configuring the node to only run a Confidential runtime.

After all your desired nodes are labeled, you can continue to the next step to install Kata Containers.

Install the Kata Containers Helm Chart#

Install Kata Containers using the kata-deploy Helm chart. The kata-deploy chart installs all required components from the Kata Containers project including the Kata Containers runtime binary, runtime configuration, UVM kernel, and images that NVIDIA uses for Confidential Containers and native Kata containers.

The minimum required version is 3.29.0.

  1. Set the chart version and registry path:

    $ export VERSION="3.29.0"
    $ export CHART="oci://ghcr.io/kata-containers/kata-deploy-charts/kata-deploy"
    
  2. Install the kata-deploy Helm chart:

    $ helm install kata-deploy "${CHART}" \
       --namespace kata-system --create-namespace \
       --set nfd.enabled=false \
       --wait --timeout 10m \
       --version "${VERSION}"
    

    Example Output immediately after running the command:

    Pulled: ghcr.io/kata-containers/kata-deploy-charts/kata-deploy:3.29.0
    Digest: sha256:aea41018779716ce2e0bf406d701637d10fb5a0792db51a08dfd3f76701eb933
    

    The --wait flag in the install command instructs Helm to wait until the release is deployed before returning. It can take a 2-3 minutes to return more output.

    Note

    There is a known Helm issue on single node clusters, that may result in the Helm command finishing before all deployed pods are finished initializing. If you are deploying to a single node cluster, you may need to wait for an additional few minutes after the Helm command completes for the kata-deploy pod to be in the Running state.

    Example Output when the release is deployed:

    Pulled: ghcr.io/kata-containers/kata-deploy-charts/kata-deploy:3.29.0
    Digest: sha256:aea41018779716ce2e0bf406d701637d10fb5a0792db51a08dfd3f76701eb933
    LAST DEPLOYED: Wed Apr  1 17:03:00 2026
    NAMESPACE: kata-system
    STATUS: deployed
    REVISION: 1
    DESCRIPTION: Install complete
    TEST SUITE: None
    

    STATUS: deployed confirms the Helm release succeeded and the chart resources were applied. This does not yet confirm the Kata components are healthy, so continue to the verification steps below before you install the GPU Operator.

    Note

    Both kata-deploy and the GPU Operator deploy Node Feature Discovery (NFD) by default. The install command includes --set nfd.enabled=false to prevent kata-deploy from deploying NFD. The GPU Operator will deploy and manage NFD in the next step.

  3. Verify that the kata-deploy pod is running:

    $ kubectl get pods -n kata-system | grep kata-deploy
    

    Example Output:

    kata-deploy-b2lzs       1/1     Running   0             6m37s
    

    A READY value of 1/1 and a STATUS of Running mean the kata-deploy pod installed the Kata components on the node successfully. If the pod is Pending, ContainerCreating, or CrashLoopBackOff, wait a minute and re-run the command. If it does not reach Running, refer to the log steps below.

  4. Verify that the kata-qemu-nvidia-gpu-snp and kata-qemu-nvidia-gpu-tdx runtime classes are available:

    After helm install completes with STATUS: deployed, the kata-deploy chart has created the Kata RuntimeClass resources on the cluster. This check is the required checkpoint before you continue to Install the NVIDIA GPU Operator.

    $ kubectl get runtimeclass | grep kata-qemu-nvidia-gpu
    

    Example Output:

    NAME                       HANDLER                    AGE
    kata-qemu-nvidia-gpu       kata-qemu-nvidia-gpu       40s
    kata-qemu-nvidia-gpu-snp   kata-qemu-nvidia-gpu-snp   40s
    kata-qemu-nvidia-gpu-tdx   kata-qemu-nvidia-gpu-tdx   40s
    

    Several runtimes are installed by the kata-deploy chart. The kata-qemu-nvidia-gpu runtime class is used with Kata Containers, in a non-Confidential Containers scenario. The kata-qemu-nvidia-gpu-snp for AMD-based systems or kata-qemu-nvidia-gpu-tdx for Intel-based systems runtime classes are used to deploy Confidential Containers workloads.

    If SNP or TDX runtime classes are not listed, the install did not complete correctly. On a single-node cluster, retry after a few minutes only if Helm returned before the kata-deploy pod reaches Running (refer to the note above). Otherwise, refer to the log steps below.

Success criteria: Helm reports STATUS: deployed, the kata-deploy pod is Running, and both kata-qemu-nvidia-gpu-snp and kata-qemu-nvidia-gpu-tdx are available on the cluster. After all checks pass, continue to Install the NVIDIA GPU Operator.

If you have an issue deploying the kata-deploy pod or are not seeing the expected runtime classes, use the following steps to view the logs:

  1. Get the kata-deploy pod name:

    $ kubectl get pods -n kata-system | grep kata-deploy
    

    Example Output:

    NAME                       READY   STATUS    RESTARTS      AGE
    kata-deploy-<pod-name>       1/1     Running   0             6m37s
    
  2. View the logs for the kata-deploy pod:

    $ kubectl logs -n kata-system kata-deploy-<pod-name>
    

    Replace <pod-name> with the name of the kata-deploy pod from the first command’s output.

    Example Output:

    Install completed
    daemonset mode: waiting for SIGTERM
    

    If logs show CrashLoopBackOff, repeated errors, or runtime classes are missing after a successful Helm deploy, collect the log output and check for similar reports in the Kata Containers GitHub repository. If no existing issue matches your problem, open a new issue in that repository with your kata-deploy logs, chart version (3.29.0), and cluster details.

Install the NVIDIA GPU Operator#

Install the NVIDIA GPU Operator and configure it to deploy Confidential Container components.

  1. Add and update the NVIDIA Helm repository:

    $ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
       && helm repo update
    

    Example Output:

    "nvidia" has been added to your repositories
    Hang tight while we grab the latest from your chart repositories...
    ...Successfully got an update from the "nvidia" chart repository
    Update Complete. ⎈Happy Helming!⎈
    
  2. Install the GPU Operator with the following configuration:

    Tip

    Add --set sandboxWorkloads.defaultWorkload=vm-passthrough to configure every worker node for Confidential Containers workloads. Refer to the Label Nodes for Confidential Containers Components section for more details on this use case.

    $ helm install --wait --timeout 10m --generate-name \
       -n gpu-operator --create-namespace \
       nvidia/gpu-operator \
       --set sandboxWorkloads.enabled=true \
       --set sandboxWorkloads.mode=kata \
       --set nfd.enabled=true \
       --set nfd.nodefeaturerules=true \
       --version=v26.3.1
    

    Example Output:

    NAME: gpu-operator
    LAST DEPLOYED: Tue Mar 10 17:58:12 2026
    NAMESPACE: gpu-operator
    STATUS: deployed
    REVISION: 1
    TEST SUITE: None
    

    STATUS: deployed confirms the Helm release succeeded. The --wait flag instructs Helm to wait until the release is deployed before returning. It may take 3-5 minutes for the Helm command to complete and for all GPU Operator pods to be in the Running state.

    For additional installation settings,

  3. Verify that all GPU Operator pods, especially the Confidential Computing Manager, Kata Device Plugin and VFIO Manager operands, are running:

    $ kubectl get pods -n gpu-operator
    

    Example Output:

    NAME                                                              READY   STATUS    RESTARTS   AGE
    gpu-operator-1766001809-node-feature-discovery-gc-75776475sxzkp   1/1     Running   0          86s
    gpu-operator-1766001809-node-feature-discovery-master-6869lxq2g   1/1     Running   0          86s
    gpu-operator-1766001809-node-feature-discovery-worker-mh4cv       1/1     Running   0          86s
    gpu-operator-f48fd66b-vtfrl                                       1/1     Running   0          86s
    nvidia-cc-manager-7z74t                                           1/1     Running   0          61s
    nvidia-kata-sandbox-device-plugin-daemonset-d5rvg                 1/1     Running   0          30s
    nvidia-sandbox-validator-6xnzc                                    1/1     Running   0          30s
    nvidia-vfio-manager-h229x                                         1/1     Running   0          62s
    

    Each pod should report a READY value of 1/1 and a STATUS of Running or Completed. The nvidia-cc-manager, nvidia-kata-sandbox-device-plugin-daemonset, and nvidia-vfio-manager operands are specific to Confidential Containers and must be present on labeled nodes. Pods may briefly show Pending or Init while they start, which is expected. When all operands are Running or Completed, the GPU Operator components are deployed and you can continue.

    For more details on each of the GPU Operator components, refer to the GPU Operator Cluster Topology Considerations section in the architecture overview.

  4. Optional: If you have host access to the worker node, you can perform the following validation step:

    1. Confirm that the host uses the vfio-pci device driver for GPUs:

      $ lspci -nnk -d 10de:
      

      Example Output:

      65:00.0 3D controller [0302]: NVIDIA Corporation xxxxxxx [xxx] [10de:xxxx] (rev xx)
              Subsystem: NVIDIA Corporation xxxxxxx [xxx] [10de:xxxx]
              Kernel driver in use: vfio-pci
              Kernel modules: nvidiafb, nouveau
      

      Kernel driver in use: vfio-pci confirms the GPU is bound for VFIO passthrough into the confidential virtual machine. If the driver in use is nvidia or nouveau instead, the GPU is not ready for passthrough. Confirm your node meets the Prerequisites section, including removing any NVIDIA GPU drivers on the host.

Success criteria: All GPU Operator pods are Running or Completed. Your cluster is now configured to deploy workloads in Kata Containers. Continue to Run a Sample Workload to confirm everything is working as expected.

If you are not seeing the expected output, view the logs for the GPU Operator pods or refer to Troubleshooting.

$ kubectl logs -n gpu-operator <pod-name>

Replace <pod-name> with the name of the GPU Operator pod from kubectl get pods -n gpu-operator.

Tip

For general GPU Operator issues such as driver or toolkit failures, refer to the NVIDIA GPU Operator troubleshooting guide. For Confidential Containers-specific deploy failures, refer to Troubleshooting. Common symptoms include Insufficient nvidia.com/pgpu and device cold plug failed.

Common GPU Operator Configuration Settings#

The following are the available GPU Operator configuration settings to enable Confidential Containers:

Parameter

Description

Default

sandboxWorkloads.enabled

Enables sandbox workload management in the GPU Operator for virtual machine-style workloads and related operands.

false

sandboxWorkloads.defaultWorkload

Specifies the default type of workload for the cluster, one of container, vm-passthrough, or vm-vgpu.

Set to vm-passthrough if you plan to run all or mostly virtual machines in your cluster.

container

sandboxWorkloads.mode

Specifies the sandbox mode to use when deploying sandbox workloads. Accepted values are kubevirt (default) and kata. Set to kata to run Confidential Containers workloads in Kata Containers.

kubevirt

kataSandboxDevicePlugin.env

Optional list of environment variables passed to the NVIDIA Kata Device Plugin pod. Each list item is an EnvVar object with required name and optional value fields. Use the setting to configure P_GPU_ALIAS or NVSWITCH_ALIAS for the Kata sandbox device plugin. Refer to the Configuring GPU or NVSwitch Resource Types Name section for more details.

[] (empty list)

Configuring GPU or NVSwitch Resource Types Name#

By default, the NVIDIA GPU Operator creates a resource type for GPUs and NVSwitches, nvidia.com/pgpu and nvidia.com/nvswitch. You can reference this name in your manifests to request GPU or NVSwitch resources for your workload. If you want to use a different name, you can set the P_GPU_ALIAS or NVSWITCH_ALIAS environment variables in the Kata device plugin to your preferred name. In clusters where all GPUs are the same model, a single resource type is typically sufficient.

In heterogeneous clusters, where you have different GPU types on your nodes, you might want to use specific GPU types for your workload. To do this, specify an empty P_GPU_ALIAS environment variable in the Kata sandbox device plugin by adding the following to your GPU Operator installation: --set kataSandboxDevicePlugin.env[0].name=P_GPU_ALIAS and --set kataSandboxDevicePlugin.env[0].value="".

When this variable is set to "", the Kata device plugin creates GPU model-specific resource types, for example nvidia.com/GH100_H200_141GB, instead of the default nvidia.com/pgpu type. Use the exposed device resource types in pod specs by specifying respective resource limits.

Similarly, you can set NVSWITCH_ALIAS to "" to advertise model-specific NVSwitch resource types.

The following example installs the GPU Operator with both P_GPU_ALIAS and NVSWITCH_ALIAS configured:

$ helm install --wait --timeout 10m --generate-name \
     -n gpu-operator --create-namespace \
     nvidia/gpu-operator \
     --set sandboxWorkloads.enabled=true \
     --set sandboxWorkloads.mode=kata \
     --set nfd.enabled=true \
     --set nfd.nodefeaturerules=true \
     --set kataSandboxDevicePlugin.env[0].name=P_GPU_ALIAS \
     --set kataSandboxDevicePlugin.env[0].value="" \
     --set kataSandboxDevicePlugin.env[1].name=NVSWITCH_ALIAS \
     --set kataSandboxDevicePlugin.env[1].value="" \
     --version=v26.3.1

After installing the GPU Operator, you can view the GPU or NVSwitch resource types available on a node by running the following command:

$ kubectl get node $NODE_NAME -o json | grep nvidia.com

Note

The NODE_NAME environment variable was set in the Label Nodes section. If you want to view the resource types for a different node, you can update the NODE_NAME environment variable and run the command again.

Example Output:

"nvidia.com/GH100_H200_141GB": "1"

You should see the resource type information for the GPUs and NVSwitches on the node.

Next Steps#

  • Run a Sample Workload to verify your deployment.

  • To help manage the lifecycle of Kata Containers, install the Kata Lifecycle Manager. This Argo Workflows-based tool manages Kata Containers upgrades and day-two operations.