CUDA Cooperative

Warning

Python exposure of cooperative algorithms is in public beta. The API is subject to change without notice.

cuda.cooperative.experimental.warp.exclusive_sum(dtype, threads_in_warp=32)

Computes an exclusive warp-wide prefix sum using addition (+) as the scan operator. The value of 0 is applied as the initial value, and is assigned to the output in lane0.

Example

The code snippet below illustrates an exclusive prefix sum of 32 integer items:

patch.patch_numba_linker(lto=True)

Below is the code snippet that demonstrates the usage of the exclusive_sum API:

# Specialize exclusive sum for a warp of threads
warp_exclusive_sum = cudax.warp.exclusive_sum(numba.int32)

# Link the exclusive sum to a CUDA kernel
@cuda.jit(link=warp_exclusive_sum.files)
def kernel(data):
    # Collectively compute the warp-wide exclusive prefix sum
    data[cuda.threadIdx.x] = warp_exclusive_sum(data[cuda.threadIdx.x])

Suppose the set of input thread_data across the warp of threads is { [1, 1, 1, 1], [1, 1, 1, 1], ..., [1, 1, 1, 1] }. The corresponding output thread_data in those threads will be { [0, 1, 2, 3], [4, 5, 6, 7], ..., [28, 29, 30, 31] }.

Parameters
  • dtype – Data type being scanned

  • threads_in_warp – The number of threads in a warp

Returns

A callable object that can be linked to and invoked from a CUDA kernel

cuda.cooperative.experimental.warp.merge_sort_keys(dtype, items_per_thread, compare_op, threads_in_warp=32, methods=None)

Performs a warp-wide merge sort over a blocked arrangement of keys.

Example

The code snippet below illustrates a sort of 128 integer keys that are partitioned in a blocked arrangement across a warp of 32 threads where each thread owns 4 consecutive keys. We start by importing necessary modules:

patch.patch_numba_linker(lto=True)

Below is the code snippet that demonstrates the usage of the merge_sort_keys API:

# Define comparison operator
def compare_op(a, b):
    return a > b

# Specialize merge sort for a warp of threads owning 4 integer items each
items_per_thread = 4
warp_merge_sort = cudax.warp.merge_sort_keys(
    numba.int32, items_per_thread, compare_op
)

# Link the merge sort to a CUDA kernel
@cuda.jit(link=warp_merge_sort.files)
def kernel(keys):
    # Obtain a segment of consecutive items that are blocked across threads
    thread_keys = cuda.local.array(shape=items_per_thread, dtype=numba.int32)

    for i in range(items_per_thread):
        thread_keys[i] = keys[cuda.threadIdx.x * items_per_thread + i]

    # Collectively sort the keys
    warp_merge_sort(thread_keys)

    # Copy the sorted keys back to the output
    for i in range(items_per_thread):
        keys[cuda.threadIdx.x * items_per_thread + i] = thread_keys[i]

Suppose the set of input thread_keys across the warp of threads is { [0, 1, 2, 3], [4, 5, 6, 7], ..., [124, 125, 126, 127] }. The corresponding output thread_keys in those threads will be { [127, 126, 125, 124], [123, 122, 121, 120], ..., [3, 2, 1, 0] }.

Parameters
  • dtype – Numba data type of the keys to be sorted

  • threads_in_warp – The number of threads in a warp

  • items_per_thread – The number of items each thread owns

  • compare_op – Comparison function object which returns true if the first argument is ordered before the second one

Returns

A callable object that can be linked to and invoked from a CUDA kernel

cuda.cooperative.experimental.warp.reduce(dtype, binary_op, threads_in_warp=32, methods=None)

Computes a warp-wide reduction for lane0 using the specified binary reduction functor. Each thread contributes one input element.

Warning

The return value is undefined in threads other than thread0.

Example

The code snippet below illustrates a max reduction of 32 integer items that are partitioned across a warp of threads.

patch.patch_numba_linker(lto=True)

Below is the code snippet that demonstrates the usage of the reduce API:

warp_reduce = cudax.warp.reduce(numba.int32, op)

@cuda.jit(link=warp_reduce.files)
def kernel(input, output):
    warp_output = warp_reduce(input[cuda.threadIdx.x])

    if cuda.threadIdx.x == 0:
        output[0] = warp_output

Suppose the set of inputs across the warp of threads is { 0, 1, 2, 3, ..., 31 }. The corresponding output in the threads lane0 will be { 31 }.

Parameters
  • dtype – Data type being reduced

  • threads_in_warp – The number of threads in a warp

  • binary_op – Binary reduction function

Returns

A callable object that can be linked to and invoked from a CUDA kernel

cuda.cooperative.experimental.warp.sum(dtype, threads_in_warp=32)

Computes a warp-wide reduction for lane0 using addition (+) as the reduction operator. Each thread contributes one input element.

Warning

The return value is undefined in threads other than thread0.

Example

The code snippet below illustrates a reduction of 32 integer items that are partitioned across a warp of threads.

patch.patch_numba_linker(lto=True)

Below is the code snippet that demonstrates the usage of the reduce API:

warp_sum = cudax.warp.sum(numba.int32)

@cuda.jit(link=warp_sum.files)
def kernel(input, output):
    warp_output = warp_sum(input[cuda.threadIdx.x])

    if cuda.threadIdx.x == 0:
        output[0] = warp_output

Suppose the set of inputs across the warp of threads is { 1, 1, 1, 1, ..., 1 }. The corresponding output in the threads lane0 will be { 32 }.

Parameters
  • dtype – Data type being reduced

  • threads_in_warp – The number of threads in a warp

Returns

A callable object that can be linked to and invoked from a CUDA kernel

cuda.cooperative.experimental.block.exclusive_sum(dtype: Type[numba.types.Number], threads_per_block: int, items_per_thread: int = 1, prefix_op: Optional[Callable] = None, algorithm: Literal['raking', 'raking_memoize', 'warp_scans'] = 'raking') Callable

Computes an exclusive block-wide prefix sum.

cuda.cooperative.experimental.block.inclusive_sum(dtype: Type[numba.types.Number], threads_per_block: int, items_per_thread: int = 1, prefix_op: Optional[Callable] = None, algorithm: Literal['raking', 'raking_memoize', 'warp_scans'] = 'raking') Callable

Computes an inclusive block-wide prefix sum.

cuda.cooperative.experimental.block.load(dtype, threads_per_block, items_per_thread=1, algorithm='direct')

Creates an operation that performs a block-wide load.

Returns a callable object that can be linked to and invoked from device code. It can be invoked with the following signatures:

  • (src: numba.types.Array, dest: numba.types.Array) -> None: Each thread loads

    items_per_thread items from src into dest. dest must contain at least items_per_thread items.

Different data movement strategies can be selected via the algorithm parameter:

  • algorithm=”direct” (default): A blocked arrangement of data is read directly from memory.

  • algorithm=”striped”: A striped arrangement of data is read directly from memory.

  • algorithm=”vectorize”: A blocked arrangement of data is read directly from memory using CUDA’s built-in vectorized loads as a coalescing optimization.

  • algorithm=”transpose”: A striped arrangement of data is read directly from memory and is then locally transposed into a blocked arrangement.

  • algorithm=”warp_transpose”: A warp-striped arrangement of data is read directly from memory and is then locally transposed into a blocked arrangement.

  • algorithm=”warp_transpose_timesliced”: A warp-striped arrangement of data is read directly from memory and is then locally transposed into a blocked arrangement one warp at a time.

For more details, [read the corresponding CUB C++ documentation](https://nvidia.github.io/cccl/cub/api/classcub_1_1BlockLoad.html).

Parameters
  • dtype – Data type being loaded

  • threads_per_block – The number of threads in a block, either an integer or a tuple of 2 or 3 integers

  • items_per_thread – The number of items each thread loads

  • algorithm – The data movement algorithm to use

Example

The code snippet below illustrates a striped load and store of 128 integer items by 32 threads, with each thread handling 4 integers.

import numba
import numpy as np
from numba import cuda
from pynvjitlink import patch

import cuda.cooperative.experimental as cudax

patch.patch_numba_linker(lto=True)
threads_per_block = 32
items_per_thread = 4
block_load = cudax.block.load(
    numba.int32, threads_per_block, items_per_thread, "striped"
)
block_store = cudax.block.store(
    numba.int32, threads_per_block, items_per_thread, "striped"
)

@cuda.jit(link=block_load.files + block_store.files)
def kernel(input, output):
    tmp = cuda.local.array(items_per_thread, numba.int32)
    block_load(input, tmp)
    block_store(output, tmp)

cuda.cooperative.experimental.block.merge_sort_keys(dtype, threads_per_block, items_per_thread, compare_op, methods=None)

Performs a block-wide merge sort over a blocked arrangement of keys.

Example

The code snippet below illustrates a sort of 512 integer keys that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive keys. We start by importing necessary modules:

patch.patch_numba_linker(lto=True)

Below is the code snippet that demonstrates the usage of the merge_sort_keys API:

# Define comparison operator
def compare_op(a, b):
    return a > b

# Specialize merge sort for a 1D block of 128 threads owning 4 integer items each
items_per_thread = 4
threads_per_block = 128
block_merge_sort = cudax.block.merge_sort_keys(
    numba.int32, threads_per_block, items_per_thread, compare_op
)

# Link the merge sort to a CUDA kernel
@cuda.jit(link=block_merge_sort.files)
def kernel(keys):
    # Obtain a segment of consecutive items that are blocked across threads
    thread_keys = cuda.local.array(shape=items_per_thread, dtype=numba.int32)

    for i in range(items_per_thread):
        thread_keys[i] = keys[cuda.threadIdx.x * items_per_thread + i]

    # Collectively sort the keys
    block_merge_sort(thread_keys)

    # Copy the sorted keys back to the output
    for i in range(items_per_thread):
        keys[cuda.threadIdx.x * items_per_thread + i] = thread_keys[i]

Suppose the set of input thread_keys across the block of threads is { [0, 1, 2, 3], [4, 5, 6, 7], ..., [508, 509, 510, 511] }. The corresponding output thread_keys in those threads will be { [511, 510, 509, 508], [507, 506, 505, 504], ..., [3, 2, 1, 0] }.

Parameters
  • dtype – Numba data type of the keys to be sorted

  • threads_per_block – The number of threads in a block

  • items_per_thread – The number of items each thread owns

  • compare_op – Comparison function object which returns true if the first argument is ordered before the second one

Returns

A callable object that can be linked to and invoked from a CUDA kernel

cuda.cooperative.experimental.block.radix_sort_keys(dtype, threads_per_block, items_per_thread)

Performs an ascending block-wide radix sort over a blocked arrangement of keys.

Example

The code snippet below illustrates a sort of 512 integer keys that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive keys. We start by importing necessary modules:

patch.patch_numba_linker(lto=True)

Below is the code snippet that demonstrates the usage of the radix_sort_keys API:

# Specialize radix sort for a 1D block of 128 threads owning 4 integer items each
items_per_thread = 4
threads_per_block = 128
block_radix_sort = cudax.block.radix_sort_keys(
    numba.int32, threads_per_block, items_per_thread
)

# Link the radix sort to a CUDA kernel
@cuda.jit(link=block_radix_sort.files)
def kernel(keys):
    # Obtain a segment of consecutive items that are blocked across threads
    thread_keys = cuda.local.array(shape=items_per_thread, dtype=numba.int32)

    for i in range(items_per_thread):
        thread_keys[i] = keys[cuda.threadIdx.x * items_per_thread + i]

    # Collectively sort the keys
    block_radix_sort(thread_keys)

    # Copy the sorted keys back to the output
    for i in range(items_per_thread):
        keys[cuda.threadIdx.x * items_per_thread + i] = thread_keys[i]

Suppose the set of input thread_keys across the block of threads is { [511, 510, 509, 508], [507, 506, 505, 504], ..., [3, 2, 1, 0] }. The corresponding output thread_keys in those threads will be { [0, 1, 2, 3], [4, 5, 6, 7], ..., [508, 509, 510, 511] }.

Parameters
  • dtype – Numba data type of the keys to be sorted

  • threads_per_block – The number of threads in a block

  • items_per_thread – The number of items each thread owns

Returns

A callable object that can be linked to and invoked from a CUDA kernel

cuda.cooperative.experimental.block.radix_sort_keys_descending(dtype, threads_per_block, items_per_thread)

Performs an descending block-wide radix sort over a blocked arrangement of keys.

Example

The code snippet below illustrates a sort of 512 integer keys that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive keys. We start by importing necessary modules:

patch.patch_numba_linker(lto=True)

Below is the code snippet that demonstrates the usage of the radix_sort_keys API:

# Specialize radix sort for a 1D block of 128 threads owning 4 integer items each
items_per_thread = 4
threads_per_block = 128
block_radix_sort = cudax.block.radix_sort_keys_descending(
    numba.int32, threads_per_block, items_per_thread
)

# Link the radix sort to a CUDA kernel
@cuda.jit(link=block_radix_sort.files)
def kernel(keys):
    # Obtain a segment of consecutive items that are blocked across threads
    thread_keys = cuda.local.array(shape=items_per_thread, dtype=numba.int32)

    for i in range(items_per_thread):
        thread_keys[i] = keys[cuda.threadIdx.x * items_per_thread + i]

    # Collectively sort the keys
    block_radix_sort(thread_keys)

    # Copy the sorted keys back to the output
    for i in range(items_per_thread):
        keys[cuda.threadIdx.x * items_per_thread + i] = thread_keys[i]

Suppose the set of input thread_keys across the block of threads is { [0, 1, 2, 3], [4, 5, 6, 7], ..., [508, 509, 510, 511] }. The corresponding output thread_keys in those threads will be { [511, 510, 509, 508], [507, 506, 505, 504], ..., [3, 2, 1, 0] }.

Parameters
  • dtype – Numba data type of the keys to be sorted

  • threads_per_block – The number of threads in a block

  • items_per_thread – The number of items each thread owns

Returns

A callable object that can be linked to and invoked from a CUDA kernel

cuda.cooperative.experimental.block.reduce(dtype, threads_per_block, binary_op, items_per_thread=1, methods=None)

Creates an operation that computes a block-wide reduction for thread0 using the specified binary reduction functor.

Returns a callable object that can be linked to and invoked from device code. It can be invoked with the following signatures:

  • (item: dtype) -> dtype): Each thread contributes a single item to the reduction.

  • (items: numba.types.Array) -> dtype: Each thread contributes an array of items to the

    reduction. The array must contain at least items_per_thread items; only the first items_per_thread items will be included in the reduction.

  • (item: dtype, num_valid: int) -> dtype: The first num_valid threads contribute a

    single item to the reduction. The items contributed by all other threads are ignored.

Parameters
  • dtype – Data type being reduced

  • threads_per_block – The number of threads in a block

  • binary_op – Binary reduction function

  • items_per_thread – The number of items each thread contributes to the reduction

  • methods – A dict of methods for user-defined types

Warning

The return value is undefined in threads other than thread0.

Example

The code snippet below illustrates a max reduction of 128 integer items that are partitioned across 128 threads.

import numba
import numpy as np
from numba import cuda
from pynvjitlink import patch

import cuda.cooperative.experimental as cudax

patch.patch_numba_linker(lto=True)
def op(a, b):
    return a if a > b else b

threads_per_block = 128
block_reduce = cudax.block.reduce(numba.int32, threads_per_block, op)

@cuda.jit(link=block_reduce.files)
def kernel(input, output):
    block_output = block_reduce(input[cuda.threadIdx.x])

    if cuda.threadIdx.x == 0:
        output[0] = block_output

Suppose the set of inputs across the block of threads is { 0, 1, 2, 3, ..., 127 }. The corresponding output in the threads thread0 will be { 127 }.

cuda.cooperative.experimental.block.store(dtype, threads_per_block, items_per_thread=1, algorithm='direct')

Creates an operation that performs a block-wide store.

Returns a callable object that can be linked to and invoked from device code. It can be invoked with the following signatures:

  • (dest: numba.types.Array, src: numba.types.Array) -> None: Each thread stores

    items_per_thread items from src into dest. src must contain at least items_per_thread items.

Different data movement strategies can be selected via the algorithm parameter:

  • algorithm=”direct” (default): A blocked arrangement of data is written directly to memory.

  • algorithm=”striped”: A striped arrangement of data is written directly to memory.

  • algorithm=”vectorize”: A blocked arrangement of data is written directly to memory using CUDA’s built-in vectorized stores as a coalescing optimization.

  • algorithm=”transpose”: A blocked arrangement is locally transposed into a striped arrangement which is then written to memory.

  • algorithm=”warp_transpose”: A blocked arrangement is locally transposed into a warp-striped arrangement which is then written to memory.

  • algorithm=”warp_transpose_timesliced”: A blocked arrangement is locally transposed into a warp-striped arrangement which is then written to memory. To reduce the shared memory requireent, only one warp’s worth of shared memory is provisioned and is subsequently time-sliced among warps.

For more details, [read the corresponding CUB C++ documentation](https://nvidia.github.io/cccl/cub/api/classcub_1_1BlockStore.html).

Parameters
  • dtype – Data type being stored

  • threads_per_block – The number of threads in a block, either an integer or a tuple of 2 or 3 integers

  • items_per_thread – The number of items each thread loads

  • algorithm – The data movement algorithm to use

Example

The code snippet below illustrates a striped load and store of 128 integer items by 32 threads, with each thread handling 4 integers.

import numba
import numpy as np
from numba import cuda
from pynvjitlink import patch

import cuda.cooperative.experimental as cudax

patch.patch_numba_linker(lto=True)
threads_per_block = 32
items_per_thread = 4
block_load = cudax.block.load(
    numba.int32, threads_per_block, items_per_thread, "striped"
)
block_store = cudax.block.store(
    numba.int32, threads_per_block, items_per_thread, "striped"
)

@cuda.jit(link=block_load.files + block_store.files)
def kernel(input, output):
    tmp = cuda.local.array(items_per_thread, numba.int32)
    block_load(input, tmp)
    block_store(output, tmp)

cuda.cooperative.experimental.block.sum(dtype, threads_per_block, items_per_thread=1, methods=None)

Creates an operation that computes a block-wide reduction for thread0 using addition (+) as the reduction operator.

Returns a callable object that can be linked to and invoked from device code. It can be invoked with the following signatures:

  • (item: dtype) -> dtype): Each thread contributes a single item to the reduction.

  • (items: numba.types.Array) -> dtype: Each thread contributes an array of items to the

    reduction. The array must contain at least items_per_thread items; only the first items_per_thread items will be included in the reduction.

  • (item: dtype, num_valid: int) -> dtype: The first num_valid threads contribute a

    single item to the reduction. The items contributed by all other threads are ignored.

Parameters
  • dtype – Data type being reduced

  • threads_per_block – The number of threads in a block

  • items_per_thread – The number of items each thread owns

  • methods – A dict of methods for user-defined types

Warning

The return value is undefined in threads other than thread0.

Example

The code snippet below illustrates a sum of 128 integer items that are partitioned across 128 threads.

import numba
import numpy as np
from numba import cuda
from pynvjitlink import patch

import cuda.cooperative.experimental as cudax

patch.patch_numba_linker(lto=True)
threads_per_block = 128
block_sum = cudax.block.sum(numba.int32, threads_per_block)

@cuda.jit(link=block_sum.files)
def kernel(input, output):
    block_output = block_sum(input[cuda.threadIdx.x])

    if cuda.threadIdx.x == 0:
        output[0] = block_output

Suppose the set of inputs across the block of threads is { 1, 1, 1, 1, ..., 1 }. The corresponding output in the threads thread0 will be { 128 }.