cuda.coop._experimental: Cooperative Algorithms#

The cuda.coop._experimental library provides cooperative algorithms that operate at the level of blocks and warps. It is designed to be used within Numba CUDA kernels.

Note: this API is marked as experimental, and we anticipate the Python package namespace and API details will change in a subsequent release.

Here’s an example showing how to use the cuda.coop._experimental library to perform block-level reduction within a Numba CUDA kernel.

Block-level reduction example. View complete source on GitHub#
def custom_reduce_example():
    """Demonstrate block reduction with custom operator (maximum)."""

    def max_op(a, b):
        return a if a > b else b

    threads_per_block = 128
    block_reduce = coop.block.make_reduce(numba.int32, threads_per_block, max_op)

    @cuda.jit(link=block_reduce.files)
    def kernel(input, output):
        # Each thread contributes one element
        block_output = block_reduce(input[cuda.threadIdx.x])

        # Only thread 0 writes the result
        if cuda.threadIdx.x == 0:
            output[0] = block_output

    # Create test data
    h_input = np.random.randint(0, 100, threads_per_block, dtype=np.int32)
    d_input = cuda.to_device(h_input)
    d_output = cuda.device_array(1, dtype=np.int32)

    # Launch kernel
    kernel[1, threads_per_block](d_input, d_output)
    h_output = d_output.copy_to_host()
    h_expected = np.max(h_input)

    assert h_output[0] == h_expected
    print(f"Block max reduction: {h_output[0]} (expected: {h_expected})")
    return h_output[0]

Example Collections#

For complete runnable examples and more advanced usage patterns, see our full collection of examples.

API Reference#