`cooperative`: Cooperative Algorithms

The cuda.cccl.cooperative library provides cooperative algorithms that operate at the level of blocks and warps. It is designed to be used within Numba CUDA kernels.

Here’s an example showing how to use the cuda.cccl.cooperative library to perform block-level reduction within a Numba CUDA kernel.

Block-level reduction example. View complete source on GitHub

def custom_reduce_example():
    """Demonstrate block reduction with custom operator (maximum)."""

    def max_op(a, b):
        return a if a > b else b

    threads_per_block = 128
    block_reduce = coop.block.reduce(numba.int32, threads_per_block, max_op)

    @cuda.jit(link=block_reduce.files)
    def kernel(input, output):
        # Each thread contributes one element
        block_output = block_reduce(input[cuda.threadIdx.x])

        # Only thread 0 writes the result
        if cuda.threadIdx.x == 0:
            output[0] = block_output

    # Create test data
    h_input = np.random.randint(0, 100, threads_per_block, dtype=np.int32)
    d_input = cuda.to_device(h_input)
    d_output = cuda.device_array(1, dtype=np.int32)

    # Launch kernel
    kernel[1, threads_per_block](d_input, d_output)
    h_output = d_output.copy_to_host()
    h_expected = np.max(h_input)

    assert h_output[0] == h_expected
    print(f"Block max reduction: {h_output[0]} (expected: {h_expected})")
    return h_output[0]

Example Collections

For complete runnable examples and more advanced usage patterns, see our full collection of examples.

External API References

cuda.cccl.cooperative API Reference