Calling foreign functions from Python kernels#

Python kernels can call device functions written in other languages. CUDA C/C++, PTX, and binary objects (cubins, fat binaries, etc.) are directly supported; sources in other languages must be compiled to PTX first. The constituent parts of a Python kernel call to a foreign device function are:

The device function implementation in a foreign language (e.g. CUDA C).
A declaration of the device function in Python.
A kernel that calls the foreign function.

Device function ABI#

Numba’s ABI for calling device functions defines the following prototype in C/C++:

extern "C"
__device__ int
function(
  T* return_value,
  ...
);

Components of the prototype are as follows:

extern "C" is used to prevent name-mangling so that it is easy to declare the function in Python. It can be removed, but then the mangled name must be used in the declaration of the function in Python.
__device__ is required to define the function as a device function.
The return value is always of type int, and is used to signal whether a Python exception occurred. Since Python exceptions don’t occur in foreign functions, this should always be set to 0 by the callee.
The first argument is a pointer to the return value of type T, which is allocated in the local address space [1] and passed in by the caller. If the function returns a value, the pointee should be set by the callee to store the return value.
Subsequent arguments should match the types and order of arguments passed to the function from the Python kernel.

Functions written in other languages must compile to PTX that conforms to this prototype specification.

A function that accepts two floats and returns a float would have the following prototype:

extern "C"
__device__ int
mul_f32_f32(
  float* return_value,
  float x,
  float y
);

Notes

Declaration in Python#

To declare a foreign device function in Python, use declare_device():

numba.cuda.declare_device(name, sig, link=None, use_cooperative=False)#

Declare the signature of a foreign function. Returns a descriptor that can be used to call the function from a Python kernel.

Parameters:

name (str) – The name of the foreign function.
sig – The Numba signature of the function.
link – External code to link when calling the function.
use_cooperative – External code requires cooperative launch.

The returned descriptor name need not match the name of the foreign function. For example, when:

mul = cuda.declare_device('mul_f32_f32', 'float32(float32, float32)' , link="functions.cu")

is declared, calling mul(a, b) inside a kernel will translate into a call to mul_f32_f32(a, b) in the compiled code.

Passing pointers#

Numba’s calling convention requires multiple values to be passed for array arguments. These include the data pointer along with shape, stride, and other information. This is incompatible with the expectations of most C/C++ functions, which generally only expect a pointer to the data. To align the calling conventions between C device code and Python kernels it is necessary to declare array arguments using C pointer types.

For example, a function with the following prototype:

numba/cuda/tests/doc_examples/ffi/functions.cu#

extern "C"
__device__ int
sum_reduce(
  float* return_value,
  float* array,
  int n
);

would be declared as follows:

from test_ex_from_buffer in numba/cuda/tests/doc_examples/test_ffi.py#

signature = "float32(CPointer(float32), int32)"
sum_reduce = cuda.declare_device(
    "sum_reduce", signature, link=functions_cu
)

To obtain a pointer to array data for passing to foreign functions, use the from_buffer() method of a cffi.FFI instance. For example, a kernel using the sum_reduce function could be defined as:

from test_ex_from_buffer in numba/cuda/tests/doc_examples/test_ffi.py#

import cffi

ffi = cffi.FFI()

@cuda.jit
def reduction_caller(result, array):
    array_ptr = ffi.from_buffer(array)
    result[()] = sum_reduce(array_ptr, len(array))

where result and array are both arrays of float32 data.

Linking and Calling functions#

The link keyword argument to the declare_device function accepts Linkable Code items. Either a single Linkable Code item can be passed, or multiple items in a list, tuple, or set.

A Linkable Code item is either:

A string indicating the location of a file in the filesystem, or
A LinkableCode object, for linking code that exists in memory.

Suported code formats that can be linked are:

PTX source code (*.ptx)
CUDA C/C++ source code (*.cu)
CUDA ELF Fat Binaries (*.fatbin)
CUDA ELF Cubins (*.cubin)
CUDA ELF archives (*.a)
CUDA Object files (*.o)
CUDA LTOIR files (*.ltoir)

CUDA C/C++ source code will be compiled with the NVIDIA Runtime Compiler (NVRTC) and linked into the kernel as either PTX or LTOIR, depending on whether LTO is enabled. Other files will be passed directly to the CUDA Linker.

A LinkableCode object may have setup and teardown callback functions that perform module-specific initialization and cleanup tasks.

Setup functions are invoked once for every new module loaded.
Teardown functions are invoked just prior to module unloading.

Both setup and teardown callbacks are called with a handle to the relevant module. In practice, Numba creates a new module each time a kernel is compiled for a specific set of argument types.

For each module, the setup callback is invoked once only. When a module is executed by multiple threads, only one thread will execute the setup callback.

The callbacks are defined as follows:

def setup_callback(mod: cuda.cudadrv.drvapi.cu_module):...
def teardown_callback(mod: cuda.cudadrv.drvapi.cu_module):...

LinkableCode objects are initialized using the parameters of their base class:

class numba.cuda.LinkableCode( data, name=None, setup_callback=None, teardown_callback=None, nrt=False, )#

An object that holds code to be linked from memory.

Parameters:

data – A buffer, StringIO or BytesIO containing the data to link. If a file object is passed, the content in the object is read when data property is accessed.
name – The name of the file to be referenced in any compilation or linking errors that may be produced.
setup_callback – A function called prior to the launch of a kernel contained within a module that has this code object linked into it.
teardown_callback – A function called just prior to the unloading of a module that has this code object linked into it.
nrt – If True, assume this object contains NRT function calls and add NRT source code to the final link.

However, one should instantiate an instance of the class that represents the type of item being linked:

class numba.cuda.PTXSource( data, name=None, setup_callback=None, teardown_callback=None, nrt=False, )#: PTX source code in memory.

class numba.cuda.CUSource( data, name=None, setup_callback=None, teardown_callback=None, nrt=False, )#: CUDA C/C++ source code in memory.

class numba.cuda.Fatbin( data, name=None, setup_callback=None, teardown_callback=None, nrt=False, )#: An ELF Fatbin in memory.

class numba.cuda.Cubin( data, name=None, setup_callback=None, teardown_callback=None, nrt=False, )#: An ELF Cubin in memory.

class numba.cuda.Archive( data, name=None, setup_callback=None, teardown_callback=None, nrt=False, )#: An archive of objects in memory.

class numba.cuda.Object( data, name=None, setup_callback=None, teardown_callback=None, nrt=False, )#: An object file in memory.

class numba.cuda.LTOIR( data, name=None, setup_callback=None, teardown_callback=None, nrt=False, )#: An LTOIR file in memory.

Legacy `@cuda.jit` decorator `link` support#

The link keyword argument of the @cuda.jit decorator also accepts a list of Linkable Code items, which will then be linked into the kernel. This facility is provided for backwards compatibility; it is recommended that Linkable Code items are always specified in the declare_device call, so that the user of the declared API is not burdened with specifying the items to link themselves when writing a kernel.

As an example of how this legacy mechanism looked at the point of use: the following kernel calls the mul() function declared above with the implementation mul_f32_f32() as if it were in a file called functions.cu that had not been declared as part of the link argument in the declaration:

@cuda.jit(link=['functions.cu'])
def multiply_vectors(r, x, y):
    i = cuda.grid(1)

    if i < len(r):
        r[i] = mul(x[i], y[i])

C/C++ Support#

Support for compiling and linking of CUDA C/C++ code is provided through the use of NVRTC subject to the following considerations:

A suitable version of the NVRTC library must be available.
The CUDA include path is assumed by default to be /usr/local/cuda/include on Linux and $env:CUDA_PATH\include on Windows. It can be modified using the environment variable NUMBA_CUDA_INCLUDE_PATH.
The CUDA include directory will be made available to NVRTC on the include path.
Additional search paths can be set to the environment variable NUMBA_CUDA_NVRTC_EXTRA_SEARCH_PATHS. Multiple paths should be colon separated.

Extra Search Paths Example#

This example demonstrates calling a foreign function that includes additional headers not in the default Numba-CUDA search paths.

The definitions of the C++ template APIs are in two different locations:

numba/cuda/tests/doc_examples/ffi/include/mul.cuh#

// Templated multiplication function: mymul
template <typename T>
__device__ T mymul(T a, T b) { return a * b; }

numba/cuda/tests/data/include/add.cuh#

// Templated addition function: myadd
template <typename T>
__device__ T myadd(T a, T b) { return a + b; }

Neither of the headers are in the default search paths of Numba-CUDA, but the foreign device function saxpy depends on them:

numba/cuda/tests/data/doc_examples/ffi/saxpy.cu#

#include <add.cuh> // In numba/cuda/tests/data/include
#include <mul.cuh> // In numba/cuda/tests/doc_examples/ffi/include

extern "C"
__device__ int saxpy(float *ret, float a, float x, float y)
{
    *ret = myadd(mymul(a, x), y);
    return 0;
}

In the Python code, assume that mul_dir and add_dir are set to the paths that contain mul.cuh and add.cuh respectively. The paths are joined with : before setting config.CUDA_NVRTC_EXTRA_SEARCH_PATHS:

from test_ex_extra_includes in numba/cuda/tests/doc_examples/test_ffi.py#

from numba import config

includedir = ":".join([mul_dir, add_dir])
config.CUDA_NVRTC_EXTRA_SEARCH_PATHS = includedir

Next, use saxpy as intended:

from test_ex_extra_includes in numba/cuda/tests/doc_examples/test_ffi.py#

sig = "float32(float32, float32, float32)"
saxpy = cuda.declare_device("saxpy", sig=sig, link=saxpy_cu)

@cuda.jit
def vector_saxpy(a, x, y, res):
    i = cuda.grid(1)
    if i < len(res):
        res[i] = saxpy(a, x[i], y[i])

Complete Example#

This example demonstrates calling a foreign function written in CUDA C to multiply pairs of numbers from two arrays.

The foreign function is written as follows:

numba/cuda/tests/doc_examples/ffi/functions.cu#

// Foreign function example: multiplication of a pair of floats

extern "C" __device__ int
mul_f32_f32(
  float* return_value,
  float x,
  float y)
{
  // Compute result and store in caller-provided slot
  *return_value = x * y;

  // Signal that no Python exception occurred
  return 0;
}

The Python code and kernel are:

from test_ex_linking_cu in numba/cuda/tests/doc_examples/test_ffi.py#

from numba import cuda
import numpy as np
import os

# Path to the source containing the foreign function
# (here assumed to be in a subdirectory called "ffi")
basedir = os.path.dirname(os.path.abspath(__file__))
functions_cu = os.path.join(basedir, "ffi", "functions.cu")

# Declaration of the foreign function
mul = cuda.declare_device(
    "mul_f32_f32", "float32(float32, float32)", link=functions_cu
)

# A kernel that calls mul; functions.cu is linked automatically due to
# the call to mul.
@cuda.jit
def multiply_vectors(r, x, y):
    i = cuda.grid(1)

    if i < len(r):
        r[i] = mul(x[i], y[i])

# Generate random data
N = 32
np.random.seed(1)
x = np.random.rand(N).astype(np.float32)
y = np.random.rand(N).astype(np.float32)
r = np.zeros_like(x)

# Run the kernel
multiply_vectors[1, 32](r, x, y)

# Sanity check - ensure the results match those expected
np.testing.assert_array_equal(r, x * y)

Note

The example above is minimal in order to illustrate a foreign function call - it would not be expected to be particularly performant due to the small grid and light workload of the foreign function.

Calling foreign functions from Python kernels#

Device function ABI#

Declaration in Python#

Passing pointers#

Linking and Calling functions#

Legacy @cuda.jit decorator link support#

C/C++ Support#

Extra Search Paths Example#

Complete Example#

Legacy `@cuda.jit` decorator `link` support#