Accelerating workloads with NVIDIA GPUs on Red Hat Device Edge#

Introduction#

Red Hat Device Edge delivers an enterprise-ready distribution of the Red Hat-led open source community project MicroShift with Red Hat Enterprise Linux (RHEL) and Red Hat Ansible Automation Platform.

MicroShift is a Kubernetes implementation derived from OpenShift, focusing on a minimal footprint for single-node deployments in resource-constrained locations. MicroShift became generally available (GA) with release 4.14, and at the time of this writing, the current release is 4.21. This platform enables you to deploy bare metal, virtual, containerized, or Kubernetes workloads to edge environments, with support for streamlined over-the-air updates for managed RHEL devices in hard-to-service locations.

This guide provides procedures to enable workloads to use NVIDIA GPUs on an x86 system running Red Hat Device Edge. The procedures documented here are validated for production use and are supported through NVIDIA Enterprise Support agreement and Red Hat Production Support Terms of Service.

Note

This documentation supports NVIDIA GPU enablement on Red Hat Device Edge deployments using RPM-based installations: Standard RHEL installations using RPM packages.

Document overview

This guide is organized into the following sections:

Use the table of contents above to navigate directly to specific sections.

Architecture Overview#

The NVIDIA GPU integration with Red Hat Device Edge consists of multiple layers working together to expose GPU resources to containerized workloads running in MicroShift.

NVIDIA GPU with Red Hat Device Edge Architecture Stack

Architecture Stack: Hardware → OS → Drivers → Container Runtime → Kubernetes → Device Plugin → Applications#

The component stack flows from hardware to application:

  • Hardware Layer: NVIDIA GPU (x86 PCIe device)

  • Operating System Layer: Red Hat Enterprise Linux 9.x

  • Driver Layer: NVIDIA GPU Driver (R580 Production Branch) - kernel modules, libraries, and management tools

  • Container Runtime Layer: CRI-O + NVIDIA Container Toolkit - configures containers for GPU access

  • Kubernetes Layer: MicroShift - orchestrates container workloads

  • Device Plugin Layer: NVIDIA Device Plugin - exposes GPU resources to Kubernetes

  • Application Layer: GPU-accelerated Kubernetes pods requesting nvidia.com/gpu resources

Note

Why Device Plugin instead of GPU Operator? The NVIDIA Device Plugin is preferred for Device Edge/MicroShift because it provides minimal resource overhead. The GPU Operator, used in OpenShift, is optimized for multi-node clusters with dynamic operator-based management, which adds unnecessary complexity for edge deployments.

Bill of Materials#

This section provides the validated software version combinations for this guide.

Supported RHEL and MicroShift combinations

Note

Check Latest Supportability Status: Before deployment, customers and partners should verify the latest supportability status and version compatibility by consulting:

The version combinations listed in this section reflect the validated configurations at the time of publication and may not represent all currently supported combinations.

Supported RHEL and MicroShift Versions#

RHEL Version

MicroShift Versions

RHEL 9.6

4.21, 4.20, 4.19

RHEL 9.4

4.18, 4.17, 4.16

RHEL 9.2

4.15 (EOL), 4.14 (EUS 2 only)

NVIDIA component versions

Note

Check Latest NVIDIA Component Supportability Status: Before deployment, customers and partners should verify the latest supportability status and version compatibility by consulting:

The version combinations listed in this section reflect the validated combinations at the time of publication and may not represent all currently supported combinations. For detailed version compatibility information, see the Version compatibility section.

NVIDIA Component Versions#

Component

Recommended Version

Notes

NVIDIA GPU Driver

R580 (Production Branch)

Current stable Production Branch. R575, R565, R560, R550, R525 are EOL

NVIDIA Container Toolkit

v1.17.9+

Required: >= 1.7.0 (>= 1.11.0 for Tegra systems). Version 1.17.9+ is validated for production use with Device Plugin 0.18.0+

NVIDIA Device Plugin

0.18.0+ (Helm, recommended) or Latest (static manifests)

Helm charts are NVIDIA’s recommended method for production deployments. Static manifests are available as an alternative

Prerequisites#

In this section: This section covers prerequisites for RPM-based installations, including version compatibility requirements, repository access, GPU verification, and RHEL release setting.

Version compatibility requirements

Important

Red Hat Device Edge requires specific RHEL and MicroShift version combinations that align with RHEL Extended Update Support (EUS) releases. Not all RHEL 9.x versions are compatible with all MicroShift versions. Always verify the latest support policy and compatibility matrix before deployment to ensure your RHEL and MicroShift versions are supported together and within their lifecycle phases. For the current compatible and supported versions of MicroShift, including related lifecycle dates, refer to:

Common prerequisites

RPM-based installations require the following prerequisites:

  • MicroShift Installation: Install MicroShift on your Red Hat Enterprise Linux 9.x machine:

  • Repository Access: Access to the following Red Hat repositories on the target system:

    • rhel-9-for-x86_64-baseos-rpms

    • rhel-9-for-x86_64-appstream-rpms

    • MicroShift repositories (as documented in the MicroShift installation guide)

  • GPU Verification: Verify an NVIDIA GPU is installed on the target device:

    $ lspci -nnv | grep -i nvidia
    

    Example Output

    17:00.0 3D controller [0302]: NVIDIA Corporation GA100GL [A30 PCIe] [10de:20b7] (rev a1)
            Subsystem: NVIDIA Corporation Device [10de:1532]
    
  • Setting the RHEL Release Version: Setting the Red Hat Enterprise Linux (RHEL) release version is critical to prevent unintended upgrades into an unsupported configuration. Compatibility is essential, as RPM dependency errors result if a MicroShift update is incompatible with the version of RHEL. Set the target system to a specific RHEL minor release before installing packages to prevent unintended upgrades:

    $ sudo subscription-manager release --set=9.6
    

    Note

    It is recommended to set the release before installing packages to avoid dependency issues. For MicroShift 4.20 or 4.19, use --set=9.6. For MicroShift 4.18, 4.17, or 4.16, use --set=9.4. For MicroShift 4.14 (EUS 2 only), use --set=9.2.

Installation procedures#

In this section: This section provides complete installation steps for the NVIDIA GPU driver, Container Toolkit, and Device Plugin on RPM-based RHEL installations. Components are installed sequentially on a running system.

This guide provides installation procedures for enabling NVIDIA GPU support on Red Hat Device Edge using RPM-based installations.

Installing via RPM packages#

In this section: This section provides sequential installation steps for the NVIDIA GPU driver (Step 1), Container Toolkit (Step 2), and Device Plugin (Step 3) on a running RPM-based RHEL system. Each step includes verification procedures.

About NVIDIA drivers

NVIDIA provides precompiled drivers in RPM repositories that implement the modularity mechanism. This approach is recommended for production deployments as it avoids the need for compiler toolchains and Extra Packages for Enterprise Linux (EPEL) dependencies.

Note

EPEL and DKMS are not supported for production use. While EPEL can provide compiler toolchains and DKMS (Dynamic Kernel Module Support) needed to build NVIDIA drivers from source, both are community-supported and not part of Red Hat Enterprise Linux. DKMS automatically rebuilds kernel modules when new kernels are installed, but, as documented by this KB 1132653, DKMS is not supplied or supported by Red Hat. The precompiled driver approach documented here avoids these risks by using packages from NVIDIA’s official RPM repositories. For more information about EPEL support policies, see How to use Extra Packages for Enterprise Linux (EPEL).

For more information about modularity streams and driver deployment, see Streamlining NVIDIA Driver Deployment on RHEL 8 with Modularity Streams.

Important

Use a supported Production Branch driver version. For current Production Branch information and EOL driver branches, see the Bill of Materials section. For the latest supported driver versions, refer to NVIDIA Datacenter Drivers documentation.

Step 1: Installing the NVIDIA GPU driver

This procedure applies to standard RHEL installations using RPM packages on a new system with an NVIDIA GPU.

  1. Add the NVIDIA CUDA repository:

    $ sudo dnf config-manager --add-repo=https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo
    
  2. View available driver modules:

    $ dnf module list nvidia-driver
    
  3. Install the driver. For production deployments, use the standard NVIDIA-signed driver:

    $ sudo dnf module install nvidia-driver:580
    

    Or for the latest version:

    $ sudo dnf module install nvidia-driver:latest
    

    Note

    You can list available driver versions using:

    $ dnf module list nvidia-driver
    

    Note

    Open GPU drivers (tech preview): Starting with RHEL 9.5, Red Hat signed open GPU drivers are available for testing on systems with Secure Boot enabled. These require the NVIDIA CUDA preview repository and codeready-builder-for-rhel-9-x86_64-rpms. Open GPU drivers should only be used for testing, not production. See NVIDIA Open GPU Datacenter Drivers for RHEL9 signed by Red Hat for details.

  4. (Optional) If you require fabric manager and NSCQ support, install the additional packages:

    $ sudo dnf install nvidia-fabric-manager libnvidia-nscq-580
    

    Note

    Replace 580 with your installed driver version if different.

  5. After installing the driver, disable the nouveau driver because it conflicts with the NVIDIA driver:

    $ echo 'blacklist nouveau' | sudo tee /etc/modprobe.d/nouveau-blacklist.conf
    
  6. Update initramfs:

    $ sudo dracut --force
    
  7. Enable the nvidia-persistenced service to reduce driver initialization time:

    $ sudo systemctl enable nvidia-persistenced.service
    

    If you installed fabric manager, also enable it:

    $ sudo systemctl enable nvidia-fabricmanager.service
    
  8. Reboot the machine:

    $ sudo systemctl reboot
    
  9. After the machine boots, verify that the NVIDIA drivers are installed properly:

    $ nvidia-smi
    

    Example Output

    Thu Jun 22 14:29:53 2023
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 580.xx.xx   Driver Version: 580.xx.xx   CUDA Version: 12.x     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  NVIDIA A30          Off  | 00000000:17:00.0 Off |                    0 |
    | N/A   29C    P0    35W / 165W |      0MiB / 24576MiB |     25%      Default |
    |                               |                      |             Disabled |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
    

Step 2: Installing the NVIDIA Container Toolkit

The NVIDIA Container Toolkit enables users to build and run GPU accelerated containers. The toolkit includes a container runtime library and utilities to automatically configure containers to leverage NVIDIA GPUs. You must install it to enable the container runtime to transparently configure the NVIDIA GPUs for the pods deployed in MicroShift.

The NVIDIA container toolkit supports the distributions listed in the NVIDIA Container Toolkit repository.

Important

The NVIDIA Device Plugin requires NVIDIA Container Toolkit >= 1.7.0 (>= 1.11.0 for Tegra-based systems). The recommended version 1.17.9+ exceeds this minimum requirement. For complete compatibility requirements, see the Version compatibility section.

  1. Add the NVIDIA Container Toolkit repository:

    $ curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \
        sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
    

    Note

    This uses the stable repository, which is recommended for production deployments.

  2. Install the NVIDIA Container Toolkit:

    $ sudo dnf install nvidia-container-toolkit -y
    
  3. Set the SELinux boolean to allow containers to use devices:

    $ sudo setsebool -P container_use_devices on
    

    Note

    With Container Toolkit v1.17.9+ and RHEL 9.x, the container_use_devices SELinux boolean is sufficient for GPU workloads. Custom SELinux modules are no longer required for standard deployments. If you encounter SELinux permission errors in specific environments, please raise a support case to NVIDIA.

  4. Configure CRI-O to use the NVIDIA runtime. This creates a drop-in configuration file that ensures the NVIDIA runtime is used:

    $ sudo nvidia-ctk runtime configure --runtime=crio --set-as-default \
        --drop-in-config=/etc/crio/crio.conf.d/99-nvidia.conf
    

    Important

    In MicroShift 4.20+, the MicroShift configuration file is already named 10-microshift.conf, ensuring it loads before 99-nvidia.conf. For MicroShift 4.14 and 4.15, the configuration files may not use numerical prefixes. If you encounter issues, rename the MicroShift configuration files to ensure proper load order:

    $ if [ -f /etc/crio/crio.conf.d/microshift.conf ]; then
        sudo mv /etc/crio/crio.conf.d/microshift.conf /etc/crio/crio.conf.d/10-microshift.conf
      fi
    $ if [ -f /etc/crio/crio.conf.d/microshift-ovn.conf ]; then
        sudo mv /etc/crio/crio.conf.d/microshift-ovn.conf /etc/crio/crio.conf.d/11-microshift-ovn.conf
      fi
    
  5. Update the runtime order in the NVIDIA Container Runtime configuration to prioritize crun:

    $ sudo sed -i 's/^runtimes =.*$/runtimes = ["crun", "docker-runc", "runc"]/g' \
        /etc/nvidia-container-runtime/config.toml
    

    Note

    This ensures crun is used as the primary runtime, which is the default for MicroShift and provides better performance.

  6. Restart the CRI-O service to apply the configuration:

    $ sudo systemctl restart crio
    
  7. Verify the Container Toolkit installation:

    $ nvidia-ctk --version
    
    $ systemctl status crio
    

Step 3: Installing the NVIDIA Device Plugin

To enable MicroShift to allocate GPU resources to pods, deploy the NVIDIA Device Plugin. The plugin runs as a daemon set that provides the following features:

  • Exposes the number of GPUs on each node of your cluster.

  • Keeps track of the health of your GPUs.

  • Runs GPU-enabled containers in your Kubernetes cluster.

You can install the NVIDIA Device Plugin using either Helm charts (recommended for production) or static YAML manifests. Both methods are supported, with Helm charts being NVIDIA’s preferred method for production deployments due to better configuration flexibility, especially for features like time-slicing.

Method 1: Installing with Helm Charts (Recommended for Production)

The Helm chart installation method is NVIDIA’s recommended approach for production deployments. It provides better configuration flexibility, easier updates, and support for advanced features like time-slicing. This method requires additional configuration for Pod Security Standards on MicroShift.

Note

The Helm chart installation on MicroShift requires manual Security Context Constraint (SCC) configuration. A pull request is currently under discussion to improve the Helm chart integration with MicroShift 4.15+ by adding better Pod Security Standards support. Monitor this PR for future improvements.

  1. Install Helm on your system:

    $ curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
    

    Or download a specific version manually:

    $ HELM_VERSION="v3.15.0"
    $ curl -LO "https://get.helm.sh/helm-${HELM_VERSION}-linux-amd64.tar.gz"
    $ tar -zxvf "helm-${HELM_VERSION}-linux-amd64.tar.gz"
    $ sudo mv linux-amd64/helm /usr/local/bin/helm
    $ rm -rf linux-amd64 "helm-${HELM_VERSION}-linux-amd64.tar.gz"
    

    Verify the installation:

    $ helm version
    
  2. Add the NVIDIA Device Plugin Helm repository:

    $ helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
    $ helm repo update
    
  3. Create a namespace for the device plugin:

    $ oc create namespace nvidia-device-plugin
    
  4. Configure Pod Security Standards for the namespace. The device plugin requires privileged access:

    $ oc label namespace nvidia-device-plugin \
        pod-security.kubernetes.io/enforce=privileged \
        pod-security.kubernetes.io/enforce-version=latest \
        pod-security.kubernetes.io/warn=privileged \
        pod-security.kubernetes.io/warn-version=latest \
        pod-security.kubernetes.io/audit=privileged \
        pod-security.kubernetes.io/audit-version=latest --overwrite
    
  5. Create a ConfigMap for the device plugin configuration. For basic GPU allocation, create a minimal configuration:

    $ cat << EOF > nvdp-configmap.yaml
    version: v1
    flags:
      migStrategy: "none"
      failOnInitError: true
      nvidiaDriverRoot: "/"
      plugin:
        passDeviceSpecs: false
        deviceListStrategy: envvar
        deviceIDStrategy: uuid
    EOF
    

    For time-slicing configuration (allowing multiple pods to share a GPU), use:

    $ cat << EOF > nvdp-configmap.yaml
    version: v1
    flags:
      migStrategy: "none"
      failOnInitError: true
      nvidiaDriverRoot: "/"
      plugin:
        passDeviceSpecs: false
        deviceListStrategy: envvar
        deviceIDStrategy: uuid
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 4
    EOF
    
  6. Create a values file for Helm with security context configurations:

    $ cat << EOF > nvdp-values.yaml
    podSecurityContext:
      runAsNonRoot: false
      seccompProfile:
        type: RuntimeDefault
    
    securityContext:
      allowPrivilegeEscalation: true
      runAsNonRoot: false
      privileged: true
      seccompProfile:
        type: RuntimeDefault
    
    gfd:
      enabled: true
      securityContext:
        privileged: true
        runAsNonRoot: false
        runAsUser: 0
    
    nfd:
      enableNodeFeatureApi: true
      worker:
        securityContext:
          privileged: true
          runAsNonRoot: false
          runAsUser: 0
          allowPrivilegeEscalation: true
          readOnlyRootFilesystem: true
          capabilities:
            drop: ["ALL"]
    EOF
    
  7. Grant Security Context Constraint (SCC) permissions to the service accounts:

    $ oc adm policy add-scc-to-user privileged -z nvdp-node-feature-discovery-worker -n nvidia-device-plugin
    $ oc adm policy add-scc-to-user privileged -z nvdp-nvidia-device-plugin-service-account -n nvidia-device-plugin
    
  8. Install the device plugin using Helm:

    $ helm upgrade -i nvdp nvdp/nvidia-device-plugin -n nvidia-device-plugin --version 0.18.0 \
        --set-file=config.map.config=nvdp-configmap.yaml \
        -f nvdp-values.yaml
    
  9. Verify the installation:

    $ oc get pod -n nvidia-device-plugin
    

    Example Output

    NAME                                                    READY   STATUS    RESTARTS   AGE
    nvdp-node-feature-discovery-gc-6476cc6bf4-nmtf8         1/1     Running   0          41s
    nvdp-node-feature-discovery-master-58788687cc-4rfsk     1/1     Running   0          41s
    nvdp-node-feature-discovery-worker-8rvk6                1/1     Running   0          41s
    nvdp-nvidia-device-plugin-gpu-feature-discovery-qsnv7   2/2     Running   0          39s
    nvdp-nvidia-device-plugin-ll495                         2/2     Running   0          39s
    
  10. Verify that the node exposes the nvidia.com/gpu resources:

    $ oc get node -o json | jq -r '.items[0].status.capacity'
    

Note

The Helm chart method is particularly useful when you need to:

  • Configure time-slicing for GPU sharing

  • Easily update or uninstall the device plugin

  • Use dynamic configuration without modifying static YAML files

This is NVIDIA’s recommended method for production deployments.

Method 2: Installing with Static Manifests (Alternative)

The static manifest method deploys the device plugin using YAML files placed in MicroShift’s manifests directory. This method is available as an alternative to Helm charts for simpler deployments that don’t require advanced configuration features.

  1. Create the manifests folder:

    $ sudo mkdir -p /etc/microshift/manifests.d/nvidia-device-plugin
    
  2. The device plugin runs in privileged mode, so you need to isolate it from other workloads by running it in its own namespace, nvidia-device-plugin. To add the plugin to the manifests deployed by MicroShift at start time, download the configuration file and save it at /etc/microshift/manifests.d/nvidia-device-plugin/nvidia-device-plugin.yml:

    $ curl -s -L https://gitlab.com/nvidia/kubernetes/device-plugin/-/raw/main/deployments/static/nvidia-device-plugin-privileged-with-service-account.yml | \
        sudo tee /etc/microshift/manifests.d/nvidia-device-plugin/nvidia-device-plugin.yml
    

    Note

    The device plugin repository has moved from GitHub to GitLab. The URL above reflects the current location.

  3. The resources are not created automatically even though the files exist. You need to add them to the kustomize configuration. Do this by adding a single kustomization.yaml file in the manifests.d/nvidia-device-plugin folder that references all the resources you want to create:

    $ cat <<EOF | sudo tee /etc/microshift/manifests.d/nvidia-device-plugin/kustomization.yaml
    ---
    apiVersion: kustomize.config.k8s.io/v1beta1
    kind: Kustomization
    resources:
      - nvidia-device-plugin.yml
    EOF
    
  4. Restart the MicroShift service so that it creates the resources:

    $ sudo systemctl restart microshift
    
  5. After MicroShift restarts, verify that the pod is running in the nvidia-device-plugin namespace:

    $ oc get pod -n nvidia-device-plugin
    

    Example Output

    NAME                                                    READY   STATUS    RESTARTS   AGE
    nvidia-device-plugin-daemonset-jx8s8                    1/1     Running   0          1m
    
  6. Verify in the log that it has registered itself as a device plugin for the nvidia.com/gpu resources:

    $ oc logs -n nvidia-device-plugin nvidia-device-plugin-daemonset-jx8s8
    

    Example Output

    [...]
    2023/06/22 14:25:38 Retreiving plugins.
    2023/06/22 14:25:38 Detected NVML platform: found NVML library
    2023/06/22 14:25:38 Detected non-Tegra platform: /sys/devices/soc0/family file not found
    2023/06/22 14:25:38 Starting GRPC server for 'nvidia.com/gpu'
    2023/06/22 14:25:38 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
    2023/06/22 14:25:38 Registered device plugin for 'nvidia.com/gpu' with Kubelet
    
  7. You can also verify that the node exposes the nvidia.com/gpu resources in its capacity:

    $ oc get node -o json | jq -r '.items[0].status.capacity'
    

    Example Output

    {
      "cpu": "48",
      "ephemeral-storage": "142063152Ki",
      "hugepages-1Gi": "0",
      "hugepages-2Mi": "0",
      "memory": "196686216Ki",
      "nvidia.com/gpu": "1",
      "pods": "250"
    }
    

Note

See also: For support information, see Supportability and Compatibility.

Verifying GPU workloads#

You can run a test workload to verify that the configuration is correct. A simple workload is the CUDA vectorAdd program that NVIDIA provides in a container image.

  1. Create a test namespace:

    $ oc create namespace test
    
  2. Create a file, such as pod-cuda-vector-add.yaml, with a pod specification. Note the spec.containers[0].resources.limits field where the nvidia.com/gpu resource specifies a value of 1.

    $ cat << EOF > pod-cuda-vector-add.yaml
    ---
    apiVersion: v1
    kind: Pod
    metadata:
      name: test-cuda-vector-add
      namespace: test
    spec:
      restartPolicy: OnFailure
      containers:
      - name: cuda-vector-add
        image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubi8"
        resources:
          limits:
            nvidia.com/gpu: 1
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
          runAsNonRoot: true
          seccompProfile:
            type: "RuntimeDefault"
    EOF
    
  3. Create the pod:

    $ oc apply -f pod-cuda-vector-add.yaml
    
  4. Verify the pod log has found a CUDA device:

    $ oc logs -n test test-cuda-vector-add
    

    Example Output

    [Vector addition of 50000 elements]
    Copy input data from the host memory to the CUDA device
    CUDA kernel launch with 196 blocks of 256 threads
    Copy output data from the CUDA device to the host memory
    Test PASSED
    Done
    
  5. Undeploy the pods in the pod-cuda-vector-add.yaml file:

    $ oc delete -f pod-cuda-vector-add.yaml
    
  6. Delete the test namespace:

    $ oc delete ns test
    

Supportability and Compatibility#

In this section: This section covers support policies from NVIDIA and Red Hat, version compatibility requirements, driver lifecycle information, and additional resources for troubleshooting and further information.

NVIDIA support

The NVIDIA Device Plugin (see NGC catalog), Container Toolkit, and GPU drivers are supported by NVIDIA through their Enterprise Support program. For production deployments, ensure you have an active NVIDIA Enterprise Support agreement.

Red Hat support

Red Hat supports the integration of NVIDIA components with Red Hat Device Edge and MicroShift within the scope of Red Hat’s third-party software support policies. For issues related to NVIDIA components (drivers, container toolkit, device plugin), contact NVIDIA Enterprise Support. For Red Hat components (MicroShift, RHEL, CRI-O), contact Red Hat Support. For integration issues, Red Hat and NVIDIA can collaborate through TSANet when both support agreements are active.

Driver lifecycle

Always use supported Production Branch driver versions. End-of-Life (EOL) driver branches should not be used in production. For current Production Branch information, see the Bill of Materials section. Refer to NVIDIA’s driver support matrix for current supported versions.

Version compatibility

The NVIDIA Device Plugin has the following prerequisites, as documented in the NVIDIA k8s-device-plugin repository:

  • NVIDIA driver version: NVIDIA drivers ~= 384.81 or newer. For production deployments, use a supported Production Branch driver (R580 recommended). See NVIDIA’s driver support matrix for current supported versions and EOL information.

  • NVIDIA Container Toolkit version: NVIDIA Container Toolkit >= 1.7.0 is required (>= 1.11.0 for integrated GPUs on Tegra-based systems). Version 1.17.9+ is recommended for production use. The Container Toolkit must be installed and configured with nvidia-container-runtime as the default low-level runtime.

  • NVIDIA Device Plugin version: Version 0.18.0+ is recommended for Helm deployments. The Device Plugin is compatible with Kubernetes >= 1.10. See the Device Plugin NGC catalog for available versions and release notes.

  • MicroShift version: Use MicroShift 4.20 or 4.19 with RHEL 9.6 (recommended for new deployments). MicroShift is based on Kubernetes and meets the Device Plugin’s Kubernetes version requirement. See the MicroShift system requirements for supported RHEL/MicroShift combinations.

  • RHEL version: Must align with MicroShift version requirements. RHEL 9.6 is recommended for MicroShift 4.20. See the Bill of Materials section for supported combinations.

Compatibility verification checklist:

Before deployment, verify:

  1. Your RHEL minor release matches your MicroShift version (see Bill of Materials section)

  2. Your NVIDIA driver version is >= 384.81 and a supported Production Branch (not EOL)

  3. Your Container Toolkit version is >= 1.7.0 (>= 1.11.0 for Tegra systems) and nvidia-container-runtime is configured as the default low-level runtime

  4. Your Device Plugin version is compatible with Kubernetes >= 1.10 (MicroShift meets this requirement)

  5. All components are within their support lifecycle

Additional resources

Red Hat documentation

Red Hat support and knowledge base

NVIDIA documentation and resources

Support collaboration