cooperative: Cooperative Algorithms

The cuda.cccl.cooperative library provides cooperative algorithms that operate at the level of blocks and warps. It is designed to be used within Numba CUDA kernels.

Here’s an example showing how to use the cuda.cccl.cooperative library to perform block-level reduction within a Numba CUDA kernel.

Block-level reduction example. View complete source on GitHub
def custom_reduce_example():
    """Demonstrate block reduction with custom operator (maximum)."""

    def max_op(a, b):
        return a if a > b else b

    threads_per_block = 128
    block_reduce = coop.block.reduce(numba.int32, threads_per_block, max_op)

    @cuda.jit(link=block_reduce.files)
    def kernel(input, output):
        # Each thread contributes one element
        block_output = block_reduce(input[cuda.threadIdx.x])

        # Only thread 0 writes the result
        if cuda.threadIdx.x == 0:
            output[0] = block_output

    # Create test data
    h_input = np.random.randint(0, 100, threads_per_block, dtype=np.int32)
    d_input = cuda.to_device(h_input)
    d_output = cuda.device_array(1, dtype=np.int32)

    # Launch kernel
    kernel[1, threads_per_block](d_input, d_output)
    h_output = d_output.copy_to_host()
    h_expected = np.max(h_input)

    assert h_output[0] == h_expected
    print(f"Block max reduction: {h_output[0]} (expected: {h_expected})")
    return h_output[0]

Example Collections

For complete runnable examples and more advanced usage patterns, see our full collection of examples.

External API References