Cooperative Groups
Supported features
Numba’s Cooperative Groups support presently provides grid groups and grid synchronization, along with cooperative kernel launches.
Cooperative groups are supported on Linux, and Windows for devices in TCC mode.
Using Grid Groups
To get the current grid group, use the cg.this_grid()
function:
g = cuda.cg.this_grid()
Synchronizing the grid is done with the sync()
method of the grid group:
g.sync()
Cooperative Launches
Unlike the CUDA C/C++ API, a cooperative launch is invoked using the same syntax as a normal kernel launch - Numba automatically determines whether a cooperative launch is required based on whether a grid group is synchronized in the kernel.
The grid size limit for a cooperative launch is more restrictive than for a
normal launch - the grid must be no larger than the maximum number of active
blocks on the device on which it is launched. To get maximum grid size for a
cooperative launch of a kernel with a given block size and dynamic shared
memory requirement, use the max_cooperative_grid_blocks()
method of kernel
overloads:
- _Kernel.max_cooperative_grid_blocks(blockdim, dynsmemsize=0)
Calculates the maximum number of blocks that can be launched for this kernel in a cooperative grid in the current context, for the given block and dynamic shared memory sizes.
- Parameters:
blockdim – Block dimensions, either as a scalar for a 1D block, or a tuple for 2D or 3D blocks.
dynsmemsize – Dynamic shared memory size in bytes.
- Returns:
The maximum number of blocks in the grid.
This can be used to ensure that the kernel is launched with no more than the
maximum number of blocks. Exceeding the maximum number of blocks for the
cooperative launch will result in a CUDA_ERROR_COOPERATIVE_LAUNCH_TOO_LARGE
error.
Applications and Example
Grid group synchronization can be used to implement a global barrier across all threads in the grid - applications of this include a global reduction to a single value, or looping over rows of a large matrix sequentially using the entire grid to operate on column elements in parallel.
In the following example, rows are written sequentially by the grid. Each thread in the grid reads a value from the previous row written by it’s opposite thread. A grid sync is needed to ensure that threads in the grid don’t run ahead of threads in other blocks, or fail to see updates from their opposite thread.
First we’ll define our kernel:
1from numba import cuda, int32
2import numpy as np
3
4sig = (int32[:,::1],)
5
6@cuda.jit(sig)
7def sequential_rows(M):
8 col = cuda.grid(1)
9 g = cuda.cg.this_grid()
10
11 rows = M.shape[0]
12 cols = M.shape[1]
13
14 for row in range(1, rows):
15 opposite = cols - col - 1
16 # Each row's elements are one greater than the previous row
17 M[row, col] = M[row - 1, opposite] + 1
18 # Wait until all threads have written their column element,
19 # and that the write is visible to all other threads
20 g.sync()
Then create some empty input data and determine the grid and block sizes:
1# Empty input data
2A = np.zeros((1024, 1024), dtype=np.int32)
3# A somewhat arbitrary choice (one warp), but generally smaller block sizes
4# allow more blocks to be launched (noting that other limitations on
5# occupancy apply such as shared memory size)
6blockdim = 32
7griddim = A.shape[1] // blockdim
Finally we launch the kernel and print the result:
1# Kernel launch - this is implicitly a cooperative launch
2sequential_rows[griddim, blockdim](A)
3
4# What do the results look like?
5# print(A)
6#
7# [[ 0 0 0 ... 0 0 0]
8# [ 1 1 1 ... 1 1 1]
9# [ 2 2 2 ... 2 2 2]
10# ...
11# [1021 1021 1021 ... 1021 1021 1021]
12# [1022 1022 1022 ... 1022 1022 1022]
13# [1023 1023 1023 ... 1023 1023 1023]]
The maximum grid size for sequential_rows
can be enquired using:
overload = sequential_rows.overloads[(int32[:,::1],)
max_blocks = overload.max_cooperative_grid_blocks(blockdim)
print(max_blocks)
# 1152 (e.g. on Quadro RTX 8000 with Numba 0.52.1 and CUDA 11.0)