Host Stub Visibility Issue#
Consider the following simple translation unit (TU):
#include <cstdio>
#include <cuda/memory>
template <class T>
__global__ void kernel(T *val) {
printf("kernel: set val = 42\n");
*val = 42;
}
__device__ int val;
int main() {
kernel<<<1, 1>>>(cuda::get_device_address(val));
}
The CUDA compiler frontend will turn this into:
template< class T>
static void __wrapper__device_stub_kernel(T *&ptr) {
::cudaLaunchKernel(0, 0, 0, 0, 0, 0);
}
// stub host function
template< class T>
void kernel(T *ptr) {
__wrapper__device_stub_kernel<T>(ptr);
}
int main() {
int *ptr{};
(__cudaPushCallConfiguration(1, 1)) ? (void)0 : kernel(ptr);
}
static void __device_stub__Z6kernelIiEvPT_(int *__par0) {
__cudaLaunchPrologue(1);
__cudaSetupArgSimple(__par0, 0UL);
__cudaLaunch(((char *)((void ( *)(int *))kernel )));
}
template<> void __wrapper__device_stub_kernel(int *&__cuda_0) {
__device_stub__Z6kernelIiEvPT_( (int *&)__cuda_0);
}
The CUDA runtime is going to use the address of template<> void kernel(T *ptr) (in the following h_kernel)
as a key in the host stub function (h_kernel) - device function (d_kernel) mapping. This works fine if
there is only a single source of truth for the stub function h_kernel.
However, imagine that there are two shared libraries: lib_a and lib_b both using the same kernel instance.
project(HostStubVisibility CUDA CXX)
add_executable(host_stub_visibility main.cu)
add_library(lib_a SHARED tu_a.cu)
add_library(lib_b SHARED tu_b.cu)
target_link_libraries(host_stub_visibility PRIVATE lib_a lib_b)
Each library will have it’s own fatbinary: d_kernel_a and d_kernel_b, but the the compiler
generated host stub function h_kernel has weak external linkage, so after dynamic linkage, we’ll end up having
only one of them.
lib |
host |
device |
|---|---|---|
a |
0xh_kernel_a |
0xd_kernel_a |
b |
0xh_kernel_a <- issue |
0xd_kernel_b |
Since there’s a clash of stub function addresses, only one entry stored. When lib_b queries for the
kernel using its address of h_kernel, it’s visible, although it might point to lib_a’s fatbinary.
The opposite case might happen as well, depending on loading order, linker etc and is undefined behavior.
Launching d_kernel from lib_b is not possible and leads to random errors. For instance, there seems to be
some per CUDART global state. When the __cudaPushCallConfiguration is called in lib_b, it affects the state of
cudart_b, but the launch happens through h_kernel, which is in lib_a.
This sometimes leads to __global__ function call is not configured. However, there might also be no error at all,
and the kernel launch is silently skipped.
A simple example program that exemplifies this can be found on github
:./host_stub_visibility/host_stub_visibility
a: kernel stub address: 0x7f43318a415d <== same address as in B
a: kernel is in mapping: no error <== kernel is found in the mapping
b: launched kernel
a: kernel: set val = 42
a: synchronized stream
a: copied from device to host
a: out: 42
a: kernel was launched: out == 42
b: kernel stub address: 0x7f43318a415d <== same address as in A
b: kernel is in mapping: no error <== kernel is found in the mapping
b: launched kernel
b: synchronized stream
b: copied from device to host
b: out: 0
b: kernel was NOT actually launched: out != 42 <== silent failure