Writing CUDA Kernels#

Numba CUDA MLIR supports programming NVIDIA CUDA GPUs by directly compiling a restricted subset of Python code into CUDA kernels and device functions following the CUDA execution model.

Introduction#

CUDA has an execution model unlike the traditional sequential model used for programming CPUs. In CUDA, the code you write will be executed by multiple threads at once (often hundreds or thousands). Your solution will be modeled by defining a thread hierarchy of grid, blocks and threads.

Numba CUDA MLIR exposes facilities to declare and manage this hierarchy of threads. The facilities are largely similar to those exposed in CUDA C++.

Three kinds of GPU memory are exposed: global device memory (the large off-chip memory visible across all threads), on-chip shared memory and local memory. For all but the simplest algorithms, it is important that you carefully consider how to use and access memory in order to minimize bandwidth requirements and contention.

Terminology#

Several important terms in the topic of CUDA programming are listed here:

host: the CPU
device: the GPU
host memory: the system main memory
device memory: onboard memory on a GPU card
kernels: a GPU function launched by the host and executed on the device
device function: a GPU function executed on the device which can only be called from the device (i.e. from a kernel or another device function)

Programming model#

Most CUDA programming facilities exposed by Numba CUDA MLIR map directly to CUDA C++ equivalents. Therefore, it is recommended you read the official CUDA Programming Guide.

Kernel declaration#

A kernel function is a GPU function that is called from CPU code. It has two fundamental characteristics:

Kernels cannot explicitly return a value; all result data must be written to an array passed to the function (if computing a scalar, you will probably pass a one-element array);
Kernels explicitly declare their thread hierarchy when called: i.e. the number of thread blocks and the number of threads per block (note that while a kernel is compiled once, it can be called multiple times with different block sizes or grid sizes).

At first sight, writing a CUDA kernel with Numba CUDA MLIR looks very much like writing a JIT function for the CPU with Numba:

@cuda.jit
def increment_by_one(an_array):
    """
    Increment all array elements by one.
    """
    # code elided here; read further for different implementations

Kernel invocation#

A kernel is typically launched in the following way:

threadsperblock = 32
blockspergrid = (an_array.size + (threadsperblock - 1)) // threadsperblock
increment_by_one[blockspergrid, threadsperblock](an_array)

We notice two steps here:

Instantiate the kernel proper, by specifying a number of blocks (or “blocks per grid”), and a number of threads per block. The product of the two will give the total number of threads launched. Kernel instantiation is done by taking the compiled kernel function (here increment_by_one) and indexing it with a tuple of integers.
Running the kernel, by passing it the input array (and any separate output arrays if necessary). Kernels run asynchronously: launches queue their execution on the device and then return immediately. You can use cuda.synchronize() to wait for all previous kernel launches to finish executing.

Note

Passing an array that resides in host memory will implicitly cause a copy back to the host, which will be synchronous. In this case, the kernel launch will not return until the data is copied back, and therefore appears to execute synchronously.

Choosing the block size#

It might seem curious to have a two-level hierarchy when declaring the number of threads needed by a kernel. The block size (i.e. number of threads per block) is often crucial:

On the software side, the block size determines how many threads share a given area of shared memory.
On the hardware side, the block size must be large enough for full occupation of execution units; recommendations can be found in the CUDA Programming Guide.

Multi-dimensional blocks and grids#

To help deal with multi-dimensional arrays, CUDA allows you to specify multi-dimensional blocks and grids. In the example above, you could make blockspergrid and threadsperblock tuples of one, two or three integers. Compared to 1D declarations of equivalent sizes, this doesn’t change anything to the efficiency or behaviour of generated code, but can help you write your algorithms in a more natural way.

Thread positioning#

When running a kernel, the kernel function’s code is executed by every thread once. It therefore has to know which thread it is in, in order to know which array element(s) it is responsible for (complex algorithms may define more complex responsibilities, but the underlying principle is the same).

One way is for the thread to determine its position in the grid and block and manually compute the corresponding array position:

@cuda.jit
def increment_by_one(an_array):
    # Thread id in a 1D block
    tx = cuda.threadIdx.x
    # Block id in a 1D grid
    ty = cuda.blockIdx.x
    # Block width, i.e. number of threads per block
    bw = cuda.blockDim.x
    # Compute flattened index inside the array
    pos = tx + ty * bw
    if pos < an_array.size:  # Check array boundaries
        an_array[pos] += 1

Note

Unless you are sure the block size and grid size is a divisor of your array size, you must check boundaries as shown above.

threadIdx, blockIdx, blockDim and gridDim are special objects provided for the sole purpose of knowing the geometry of the thread hierarchy and the position of the current thread within that geometry.

These objects can be 1D, 2D or 3D, depending on how the kernel was invoked. To access the value at each dimension, use the x, y and z attributes of these objects, respectively.

cuda.threadIdx: The thread indices in the current thread block. For 1D blocks, the index (given by the x attribute) is an integer spanning the range from 0 inclusive to cuda.blockDim exclusive. A similar rule exists for each dimension when more than one dimension is used.

cuda.blockDim: The shape of the block of threads, as declared when instantiating the kernel. This value is the same for all threads in a given kernel, even if they belong to different blocks (i.e. each block is “full”).

cuda.blockIdx: The block indices in the grid of threads launched a kernel. For a 1D grid, the index (given by the x attribute) is an integer spanning the range from 0 inclusive to cuda.gridDim exclusive. A similar rule exists for each dimension when more than one dimension is used.

cuda.gridDim: The shape of the grid of blocks, i.e. the total number of blocks launched by this kernel invocation, as declared when instantiating the kernel.

Absolute positions#

Simple algorithms will tend to always use thread indices in the same way as shown in the example above. Numba CUDA MLIR provides additional facilities to automate such calculations:

cuda.grid(ndim): Return the absolute position of the current thread in the entire grid of blocks. ndim should correspond to the number of dimensions declared when instantiating the kernel. If ndim is 1, a single integer is returned. If ndim is 2 or 3, a tuple of the given number of integers is returned.

cuda.gridsize(ndim): Return the absolute size (or shape) in threads of the entire grid of blocks. ndim has the same meaning as in grid() above.

With these functions, the incrementation example can become:

@cuda.jit
def increment_by_one(an_array):
    pos = cuda.grid(1)
    if pos < an_array.size:
        an_array[pos] += 1

The same example for a 2D array and grid of threads would be:

@cuda.jit
def increment_a_2D_array(an_array):
    x, y = cuda.grid(2)
    if x < an_array.shape[0] and y < an_array.shape[1]:
       an_array[x, y] += 1

Note the grid computation when instantiating the kernel must still be done manually, for example:

threadsperblock = (16, 16)
blockspergrid_x = math.ceil(an_array.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(an_array.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)
increment_a_2D_array[blockspergrid, threadsperblock](an_array)

Writing Device Functions#

CUDA device functions can only be invoked from within the device (by a kernel or another device function). To define a device function:

from numba_cuda_mlir import cuda

@cuda.jit(device=True)
def a_device_function(a, b):
    return a + b

Unlike a kernel function, a device function can return a value like normal functions.