cooperative
: Cooperative Algorithms
The cuda.cccl.cooperative
library provides cooperative algorithms that operate
at the level of blocks and warps. It is designed to be used within
Numba CUDA kernels.
Here’s an example showing how to use the cuda.cccl.cooperative
library to
perform block-level reduction within a Numba CUDA kernel.
Block-level reduction example. View complete source on GitHub
def custom_reduce_example():
"""Demonstrate block reduction with custom operator (maximum)."""
def max_op(a, b):
return a if a > b else b
threads_per_block = 128
block_reduce = coop.block.reduce(numba.int32, threads_per_block, max_op)
@cuda.jit(link=block_reduce.files)
def kernel(input, output):
# Each thread contributes one element
block_output = block_reduce(input[cuda.threadIdx.x])
# Only thread 0 writes the result
if cuda.threadIdx.x == 0:
output[0] = block_output
# Create test data
h_input = np.random.randint(0, 100, threads_per_block, dtype=np.int32)
d_input = cuda.to_device(h_input)
d_output = cuda.device_array(1, dtype=np.int32)
# Launch kernel
kernel[1, threads_per_block](d_input, d_output)
h_output = d_output.copy_to_host()
h_expected = np.max(h_input)
assert h_output[0] == h_expected
print(f"Block max reduction: {h_output[0]} (expected: {h_expected})")
return h_output[0]
Example Collections
For complete runnable examples and more advanced usage patterns, see our full collection of examples.