cuda.coop._experimental: Cooperative Algorithms#
The cuda.coop._experimental library provides cooperative algorithms that operate
at the level of blocks and warps. It is designed to be used within
Numba CUDA kernels.
Note: this API is marked as experimental, and we anticipate the Python package namespace and API details will change in a subsequent release.
Here’s an example showing how to use the cuda.coop._experimental library to
perform block-level reduction within a Numba CUDA kernel.
Block-level reduction example. View complete source on GitHub#
def custom_reduce_example():
"""Demonstrate block reduction with custom operator (maximum)."""
def max_op(a, b):
return a if a > b else b
threads_per_block = 128
block_reduce = coop.block.make_reduce(numba.int32, threads_per_block, max_op)
@cuda.jit(link=block_reduce.files)
def kernel(input, output):
# Each thread contributes one element
block_output = block_reduce(input[cuda.threadIdx.x])
# Only thread 0 writes the result
if cuda.threadIdx.x == 0:
output[0] = block_output
# Create test data
h_input = np.random.randint(0, 100, threads_per_block, dtype=np.int32)
d_input = cuda.to_device(h_input)
d_output = cuda.device_array(1, dtype=np.int32)
# Launch kernel
kernel[1, threads_per_block](d_input, d_output)
h_output = d_output.copy_to_host()
h_expected = np.max(h_input)
assert h_output[0] == h_expected
print(f"Block max reduction: {h_output[0]} (expected: {h_expected})")
return h_output[0]
Example Collections#
For complete runnable examples and more advanced usage patterns, see our full collection of examples.