Troubleshooting#
Troubleshooting with Docker#
Generating Debugging Logs#
For most common issues, you can generate debugging logs to help identify the root cause of the problem. To generate debug logs :
Edit your runtime configuration under
/etc/nvidia-container-runtime/config.toml
and uncomment thedebug=...
line.Run your container again to reproduce the issue and generate the logs.
Generating Core Dumps#
In the event of a critical failure, core dumps can be automatically generated and can help troubleshoot issues. Refer to core(5) to generate these. In particular make check the following items:
/proc/sys/kernel/core_pattern
is correctly set and points somewhere with write access.ulimit -c
is set to a sensible default.
In case the nvidia-container-cli
process becomes unresponsive, gcore(1) can also be used.
Conflicting values set for option Signed-By error when running apt update#
When following the installation instructions on Ubuntu or Debian-based systems and updating the package repository, the following error could be triggered:
$ sudo apt-get update
E: Conflicting values set for option Signed-By regarding source https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/amd64/ /: /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg !=
E: The list of sources could not be read.
This is caused by the combination of two things:
A recent update to the installation instructions to create a repo list file
/etc/apt/sources.list.d/nvidia-container-toolkit.list
The deprecation of
apt-key
meaning that thesigned-by
directive is included in the repo list file
If this error is triggered it means that another reference to the same repository exists that does not specify the signed-by
directive.
The most likely candidates would be one or more of the files libnvidia-container.list
, nvidia-docker.list
, or nvidia-container-runtime.list
in the
folder /etc/apt/sources.list.d/
.
The conflicting repository references can be obtained by running and inspecting the output:
$ grep "nvidia.github.io" /etc/apt/sources.list.d/*
The list of files with possibly conflicting references can be obtained by running:
$ grep -l "nvidia.github.io" /etc/apt/sources.list.d/* | grep -vE "/nvidia-container-toolkit.list\$"
Deleting the listed files should resolve the original error.
Permission denied error when running the nvidia-docker wrapper under SELinux#
When running the nvidia-docker
wrapper (provided by the nvidia-docker2
package) on SELinux environments
one may see the following error
$ sudo nvidia-docker run --gpus=all --rm nvcr.io/nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
/bin/nvidia-docker: line 34: /bin/docker: Permission denied
/bin/nvidia-docker: line 34: /bin/docker: Success
With SELinux reporting the following error:
SELinux is preventing /usr/bin/bash from entrypoint access on the file /usr/bin/docker. For complete SELinux messages run: sealert -l 43932883-bf2e-4e4e-800a-80584c62c218
SELinux is preventing /usr/bin/bash from entrypoint access on the file /usr/bin/docker.
***** Plugin catchall (100. confidence) suggests **************************
If you believe that bash should be allowed entrypoint access on the docker file by default.
Then you should report this as a bug.
You can generate a local policy module to allow this access.
Do
allow this access for now by executing:
# ausearch -c 'nvidia-docker' --raw | audit2allow -M my-nvidiadocker
# semodule -X 300 -i my-nvidiadocker.pp
This occurs because nvidia-docker
forwards the command line arguments with minor modifications to the docker
executable.
To address this, specify the NVIDIA runtime in the the docker
command:
$ sudo docker run --gpus=all --runtime=nvidia --rm nvcr.io/nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
Alternatively a local SELinux policy can be generated as suggested:
$ ausearch -c 'nvidia-docker' --raw | audit2allow -M my-nvidiadocker
$ semodule -X 300 -i my-nvidiadocker.pp
NVML: Insufficient Permissions and SELinux#
Depending on how your Red Hat Enterprise Linux system is configured with SELinux, you might have to
specify --security-opt=label=disable
on the Docker or Podman command line to share parts of the
host OS that can not be relabeled.
Without this option, you might observe this error when running GPU containers:
Failed to initialize NVML: Insufficient Permissions
.
However, using this option disables SELinux separation in the container and the container is executed
in an unconfined type.
Review the SELinux policies on your system.
Containers losing access to GPUs with error: “Failed to initialize NVML: Unknown Error”#
When using the NVIDIA Container Runtime Hook (i.e. the Docker --gpus
flag or
the NVIDIA Container Runtime in legacy
mode) to inject requested GPUs and driver
libraries into a container, the hook makes modifications, including setting up cgroup access, to the container without the low-level runtime (e.g. runc
) being aware of these changes.
The result is that updates to the container may remove access to the requested GPUs.
When the container loses access to the GPU, you will see the following error message from the console output:
Failed to initialize NVML: Unknown Error
The message may differ depending on the type of application that is running in the container.
The container needs to be deleted once the issue occurs. When it is restarted, manually or automatically depending if you are using a container orchestration platform, it will regain access to the GPU.
Affected environments#
On certain systems this behavior is not limited to explicit container updates
such as adjusting CPU and Memory limits for a container.
On systems where systemd
is used to manage the cgroups of the container, reloading the systemd
unit files (systemctl daemon-reload
) is sufficient to trigger container updates and cause a loss of GPU access.
Mitigations and Workarounds#
Warning
Certain runc
versions show similar behavior with the systemd
cgroup driver when /dev/char
symlinks for the required devices are missing on the system.
Refer to GitHub disccusion #1133 for more details around this issue.
It should be noted that the behavior persisted even if device nodes were requested on the command line.
Newer runc
versions do not show this behavior and newer NVIDIA driver versions ensure that the required symlinks are present, reducing the likelihood of the specific issue occurring for affected runc
versions.
Use the following workarounds to prevent containers from losing access to requested GPUs when a systemctl daemon-reload
command is run:
For Docker, use cgroupfs as the cgroup driver for containers. To do this, update the
/etc/docker/daemon.json
to include:{ "exec-opts": ["native.cgroupdriver=cgroupfs"] }
and restart docker by running
systemctl restart docker
. This will ensure that the container will not lose access to devices whensystemctl daemon-reload
is run. This approach does not change the behavior for explicit container updates and a container will still lose access to devices in this case.Explicitly request the device nodes associated with the requested GPU(s) and any control device nodes when starting the container. For the Docker CLI, this is done by adding the relevant
--device
flags. In the case of the NVIDIA Kubernetes Device Plugin thecompatWithCPUManager= true
Helm option will ensure the same thing.Use the Container Device Interface (CDI) to inject devices into a container. When CDI is used to inject devices into a container, the required device nodes are included in the modifications made to the container config. This means that even if the container is updated it will still have access to the required devices.