GPU Operator with Kata Containers

About the Operator with Kata Containers

Note

Technology Preview features are not supported in production environments and are not functionally complete. Technology Preview features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process. These releases may not have any documentation, and testing is limited.

Kata Containers are similar, but subtly different from traditional containers such as a Docker container.

A traditional container packages software for user-space isolation from the host, but the container runs on the host and shares the operating system kernel with the host. Sharing the operating system kernel is a potential vulnerability.

A Kata container runs in a virtual machine on the host. The virtual machine has a separate operating system and operating system kernel. Hardware virtualization and a separate kernel provide improved workload isolation in comparison with traditional containers.

The NVIDIA GPU Operator works with the Kata container runtime. Kata uses a hypervisor, like QEMU, to provide a lightweight virtual machine with a single purpose–to run a Kubernetes pod.

The following diagram shows the software components that Kubernetes uses to run a Kata container.

flowchart LR a[Kubelet] --> b[CRI] --> c[Kata\nRuntime] --> d[Lightweight\nQEMU VM] --> e[Lightweight\nGuest OS] --> f[Pod] --> g[Container]

Software Components with Kata Container Runtime

NVIDIA supports Kata Containers by using Helm to run a daemon set that installs the Kata runtime and QEMU.

The daemon set runs the kata-deploy.sh script that performs the following actions on each node htat is labeled to run Kata Containers:

  • Downloads an NVIDIA optimized Linux kernel image and initial RAM disk that provides the lightweight operating system for the virtual machines that run in QEMU. These artifacts are downloaded from the NVIDIA container registry, nvcr.io, on each worker node.

  • Configures each worker node with a runtime class, kata-qemu-nvidia-gpu.

About NVIDIA Kata Manager

When you configure the GPU Operator for Kata Containers, the Operator deploys NVIDIA Kata Manager as an operand.

The manager performs the following actions on each node that is labeled to run Kata Containers:

  • Configures containerd with the kata-qemu-nvidia-gpu runtime class.

  • Creates a CDI specification, /var/run/cdi/nvidia.com-pgpu.yaml, for each GPU on the node.

  • Loads the vhost-sock and vhost-net Linux kernel modules.

Benefits of Using Kata Containers

The primary benefits of Kata Containers are as follows:

  • Running untrusted workloads in a container. The virtual machine provides a layer of defense against the untrusted code.

  • Limiting access to hardware devices such as NVIDIA GPUs. The virtual machine is provided access to specific devices. This approach ensures that the workload cannot access additional devices.

  • Transparent deployment of unmodified containers.

Limitations and Restrictions

  • GPUs are available to containers as a single GPU in passthrough mode only. Multi-GPU passthrough and vGPU are not supported.

  • Support is limited to initial installation and configuration only. Upgrade and configuration of existing clusters for Kata Containers is not supported.

  • Support for Kata Containers is limited to the implementation described on this page. The Operator does not support Red Hat OpenShift sandbox containers.

  • Uninstalling the GPU Operator or the NVIDIA Kata Manager does not remove the /opt/nvidia-gpu-operator/artifacts/runtimeclasses/ directory on the worker nodes.

  • NVIDIA supports the Operator and Kata Containers with the containerd runtime only.

Cluster Topology Considerations

You can configure all the worker nodes in your cluster for Kata Containers or you configure some nodes for Kata Containers and the others for traditional containers. Consider the following example.

Node A is configured to run traditional containers.

Node B is configured to run Kata Containers.

Node A receives the following software components:

  • NVIDIA Driver Manager for Kubernetes – to install the data-center driver.

  • NVIDIA Container Toolkit – to ensure that containers can access GPUs.

  • NVIDIA Device Plugin for Kubernetes – to discover and advertise GPU resources to kubelet.

  • NVIDIA DCGM and DCGM Exporter – to monitor GPUs.

  • NVIDIA MIG Manager for Kubernetes – to manage MIG-capable GPUs.

  • Node Feature Discovery – to detect CPU, kernel, and host features and label worker nodes.

  • NVIDIA GPU Feature Discovery – to detect NVIDIA GPUs and label worker nodes.

Node B receives the following software components:

  • NVIDIA Kata Manager for Kubernetes – to manage the NVIDIA artifacts such as the NVIDIA optimized Linux kernel image and initial RAM disk.

  • NVIDIA Sandbox Device Plugin – to discover and advertise the passthrough GPUs to kubelet.

  • NVIDIA VFIO Manager – to load the vfio-pci device driver and bind it to all GPUs on the node.

  • Node Feature Discovery – to detect CPU security features, NVIDIA GPUs, and label worker nodes.

Prerequisites

  • Your hosts are configured to enable hardware virtualization and Access Control Services (ACS). With some AMD CPUs and BIOSes, ACS might be grouped under Advanced Error Reporting (AER). Enabling these features is typically performed by configuring the host BIOS.

  • Your hosts are configured to support IOMMU.

    If the output from running ls -1 /sys/kernel/iommu_groups | wc -l includes a value greater than 0, then your host is configured for IOMMU.

    If a host is not configured or you are unsure, add the intel_iommu=on Linux kernel command-line argument. For most Linux distributions, you add the argument to the /etc/default/grub file:

    ...
    GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on modprobe.blacklist=nouveau"
    ...
    

    On Ubuntu systems, run sudo update-grub after making the change to configure the bootloader. On other systems, you might need to run sudo dracut after making the change. Refer to the documentation for your operating system. Reboot the host after configuring the bootloader.

  • You have a Kubernetes cluster and you have cluster administrator privileges.

Overview of Installation and Configuration

Installing and configuring your cluster to support the NVIDIA GPU Operator with Kata Containers is as follows:

  1. Label the worker nodes that you want to use with Kata Containers.

    This step ensures that you can continue to run traditional container workloads with GPU or vGPU workloads on some nodes in your cluster. Alternatively, you can set the default sandbox workload to vm-passthrough to run confidential containers on all worker nodes.

  2. Install the Kata Deploy Helm chart.

    This step runs kata-deploy.sh on each node and installs the Kata Containers runtime on each node.

  3. Install the NVIDIA GPU Operator.

    You install the Operator and specify options to deploy the operands that are required for Kata Containers.

After installation, you can run a sample workload.

Kata Deploy Helm Chart Customizations

The following table shows the configurable values from the Kata Deploy Helm chart.

Parameter

Description

Default

kataDeploy.allowedHypervisorAnnotations

Specifies the hypervisor annotations to enable in the Kata configuration file on each node. Specify a space-separated string of values such as enable_iommu initrd kernel.

None

kataDeploy.createRuntimeClasses

When set to true, the kata-deploy.sh script installs the runtime classes on the nodes.

false

kataDeploy.createDefaultRuntimeClass

When set to true, the kata-deploy.sh script sets the runtime class specified in the defaultShim field as the default Kata runtime class.

false

kataDeploy.debug

When set to true, the kata-deploy.sh script enables debugging and a debug console in the Kata configuration file on each node.

false

kataDeploy.defaultShim

Specifies the shim to set as the default Kata runtime class. This field is ignored unless you specify createDefaultRuntimeClass: true.

None

kataDeploy.imagePullPolicy

Specifies the image pull policy for the kata-deploy container.

Always

kataDeploy.k8sDistribution

Specifies the Kubernetes platform. The Helm chart uses the value to set the platform-specific location of the containerd configuration file.

Supported values are k8s, k3s, rke2, and k0s.

k8s

kataDeploy.repository

Specifies the image repository for the kata-deploy container.

nvcr.io/nvidia/cloud-native

kataDeploy.shims

Specifies the shim binaries to install on each node. Specify a space-separated string of values.

qemu-nvidia-gpu

kataDeploy.version

Specifies the version of the kata-deploy container to run.

latest

Install the Kata Deploy Helm Chart

Perform the following steps to install the Helm chart:

  1. Label the nodes to run virtual machines in containers. Label only the nodes that you want to run with Kata Containers:

    $ kubectl label node <node-name> nvidia.com/gpu.workload.config=vm-passthrough
    
  2. Add and update the NVIDIA Helm repository:

    $ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
       && helm repo update
    
  3. Specify at least the following options when you install the chart:

    $ helm install --wait --generate-name \
       -n kube-system \
       nvidia/kata-deploy \
       --set kataDeploy.createRuntimeClasses=true
    
  4. Optional: Verify the installation.

    • Confirm the kata-deploy containers are running:

      $ kubectl get pods -n kube-system -l name=kata-deploy
      
    • Confirm the runtime class is installed:

      $ kubectl get runtimeclass kata-qemu-nvidia-gpu
      

      Example Output

      NAME                   HANDLER                AGE
      kata-qemu-nvidia-gpu   kata-qemu-nvidia-gpu   23s
      

Install the NVIDIA GPU Operator

Procedure

Perform the following steps to install the Operator for use with Kata Containers:

  1. Add and update the NVIDIA Helm repository:

    $ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
       && helm repo update
    
  2. Specify at least the following options when you install the Operator. If you want to run Kata Containers by default on all worker nodes, also specify --set sandboxWorkloads.defaultWorkload=vm-passthough.

    $ helm install --wait --generate-name \
       -n gpu-operator --create-namespace \
       nvidia/gpu-operator \
       --set sandboxWorkloads.enabled=true \
       --set kataManager.enabled=true \
       --set kataManager.config.runtimeClasses=null
    

    Example Output

    NAME: gpu-operator
    LAST DEPLOYED: Tue Jul 25 19:19:07 2023
    NAMESPACE: gpu-operator
    STATUS: deployed
    REVISION: 1
    TEST SUITE: None
    

Verification

  1. Verify that the Kata Manager and VFIO Manager operands are running:

    $ kubectl get pods -n gpu-operator
    

    Example Output

    NAME                                                         READY   STATUS      RESTARTS   AGE
    gpu-operator-57bf5d5769-nb98z                                1/1     Running     0          6m21s
    gpu-operator-node-feature-discovery-master-b44f595bf-5sjxg   1/1     Running     0          6m21s
    gpu-operator-node-feature-discovery-worker-lwhdr             1/1     Running     0          6m21s
    nvidia-kata-manager-bw5mb                                    1/1     Running     0          3m36s
    nvidia-sandbox-device-plugin-daemonset-cr4s6                 1/1     Running     0          2m37s
    nvidia-sandbox-validator-9wjm4                               1/1     Running     0          2m37s
    nvidia-vfio-manager-vg4wp                                    1/1     Running     0          3m36s
    
  2. Verify that the kata-qemu-nvidia-gpu runtime classes is available:

    $ kubectl get runtimeclass
    

    Example Output

    NAME                       HANDLER                    AGE
    kata-qemu-nvidia-gpu       kata-qemu-nvidia-gpu       96s
    nvidia                     nvidia                     97s
    
  3. Optional: If you have host access to the worker node, confirm that the host uses the vfio-pci device driver for GPUs:

    $ lspci -nnk -d 10de:
    

    Example Output

    65:00.0 3D controller [0302]: NVIDIA Corporation GA102GL [A10] [10de:2236] (rev a1)
            Subsystem: NVIDIA Corporation GA102GL [A10] [10de:1482]
            Kernel driver in use: vfio-pci
            Kernel modules: nvidiafb, nouveau
    

Run a Sample Workload

A pod specification for a Kata container requires the following:

  • Specify a Kata runtime class.

  • Specify a passthrough GPU resource.

  1. Determine the passthrough GPU resource names:

    kubectl get nodes -l nvidia.com/gpu.present -o json | \
      jq '.items[0].status.allocatable |
        with_entries(select(.key | startswith("nvidia.com/"))) |
        with_entries(select(.value != "0"))'
    

    Example Output

    {
       "nvidia.com/GA102GL_A10": "1"
    }
    
  2. Create a file, such as cuda-vectoradd-kata.yaml, like the following example:

    apiVersion: v1
    kind: Pod
    metadata:
      name: cuda-vectoradd-kata
      annotations:
        cdi.k8s.io/gpu: "nvidia.com/pgpu=0"
        io.katacontainers.config.hypervisor.default_memory: "16384"
    spec:
      runtimeClassName: kata-qemu-nvidia-gpu
      restartPolicy: OnFailure
      containers:
      - name: cuda-vectoradd
        image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
        resources:
          limits:
            "nvidia.com/GA102GL_A10": 1
    

    The io.katacontainers.config.hypervisor.default_memory annotation starts the VM with 16 GB of memory. Modify the value to accommodate your workload.

  3. Create the pod:

    $ kubectl apply -f cuda-vectoradd-kata.yaml
    
  4. View the logs from pod:

    $ kubectl logs -n default cuda-vectoradd-kata
    

    Example Output

    [Vector addition of 50000 elements]
    Copy input data from the host memory to the CUDA device
    CUDA kernel launch with 196 blocks of 256 threads
    Copy output data from the CUDA device to the host memory
    Test PASSED
    Done
    
  5. Delete the pod:

    $ kubectl delete -f cuda-vectoradd-kata.yaml
    

Troubleshooting Workloads

If the sample workload does not run, confirm that you labelled nodes to run virtual machines in containers:

$ kubectl get nodes -l nvidia.com/gpu.workload.config=vm-passthrough

Example Output

NAME               STATUS   ROLES    AGE   VERSION
kata-worker-1      Ready    <none>   10d   v1.27.3
kata-worker-2      Ready    <none>   10d   v1.27.3
kata-worker-3      Ready    <none>   10d   v1.27.3

Optional: Configuring a GPU Resource Alias

By default, GPU resources are exposed on nodes with a name like nvidia.com/GA102GL_A10. You can configure the NVIDIA Sandbox Device Plugin so that nodes also expose GPUs with an alias like nvidia.com/pgpu.

  1. Patch the cluster policy with a command like the following example:

    $ kubectl patch clusterpolicies.nvidia.com/cluster-policy --type=merge \
        -p '{"spec": {"sandboxDevicePlugin": {"env":[{"name": "P_GPU_ALIAS", "value":"pgpu"}]}}}'
    

    The sandbox device plugin daemon set pods restart.

  2. Optional: Describe a node to confirm the alias:

    $ kubectl describe node <node-name>
    

    Partial Output

    ...
    Allocatable:
      cpu:                     16
      ephemeral-storage:       1922145660Ki
      hugepages-1Gi:           0
      hugepages-2Mi:           0
      memory:                  65488292Ki
      nvidia.com/GA102GL_A10:  0
      nvidia.com/pgpu:         1
    

About the Pod Annotation

The cdi.k8s.io/gpu: "nvidia.com/pgpu=0" annotation is used when the pod sandbox is created. The annotation ensures that the virtual machine created by the Kata runtime is created with the correct PCIe topology so that GPU passthrough succeeds.

The annotation refers to a Container Device Interface (CDI) device, nvidia.com/pgpu=0. The pgpu indicates passthrough GPU and the 0 indicates the device index. The index is defined by the order that the GPUs are enumerated on the PCI bus. The index does not correlate to a CUDA index.

The NVIDIA Kata Manager creates a CDI specification on the GPU nodes. The file includes a device entry for each passthrough device.

In the following sample /var/run/cdi/nvidia.com-pgpu.yaml file shows one GPU that is bound to the VFIO PCI driver:

cdiVersion: 0.5.0
containerEdits: {}
devices:
- containerEdits:
    deviceNodes:
    - path: /dev/vfio/10
name: "0"
kind: nvidia.com/pgpu