CUDA Cooperative

Warning

Python exposure of cooperative algorithms is in public beta. The API is subject to change without notice.

cuda.cccl.cooperative.experimental.warp.exclusive_sum(dtype, threads_in_warp=32)

Computes an exclusive warp-wide prefix sum using addition (+) as the scan operator. The value of 0 is applied as the initial value, and is assigned to the output in lane₀.

Example

The code snippet below illustrates an exclusive prefix sum of 32 integer items:

patch.patch_numba_linker(lto=True)

Below is the code snippet that demonstrates the usage of the exclusive_sum API:

# Specialize exclusive sum for a warp of threads
warp_exclusive_sum = cudax.warp.exclusive_sum(numba.int32)

# Link the exclusive sum to a CUDA kernel
@cuda.jit(link=warp_exclusive_sum.files)
def kernel(data):
    # Collectively compute the warp-wide exclusive prefix sum
    data[cuda.threadIdx.x] = warp_exclusive_sum(data[cuda.threadIdx.x])

Suppose the set of input thread_data across the warp of threads is { [1, 1, 1, 1], [1, 1, 1, 1], ..., [1, 1, 1, 1] }. The corresponding output thread_data in those threads will be { [0, 1, 2, 3], [4, 5, 6, 7], ..., [28, 29, 30, 31] }.

Parameters

dtype – Data type being scanned
threads_in_warp – The number of threads in a warp

Returns

A callable object that can be linked to and invoked from a CUDA kernel

cuda.cccl.cooperative.experimental.warp.merge_sort_keys(dtype, items_per_thread, compare_op, threads_in_warp=32, methods=None)

Performs a warp-wide merge sort over a blocked arrangement of keys.

Example

The code snippet below illustrates a sort of 128 integer keys that are partitioned in a blocked arrangement across a warp of 32 threads where each thread owns 4 consecutive keys. We start by importing necessary modules:

patch.patch_numba_linker(lto=True)

Below is the code snippet that demonstrates the usage of the merge_sort_keys API:

# Define comparison operator
def compare_op(a, b):
    return a > b

# Specialize merge sort for a warp of threads owning 4 integer items each
items_per_thread = 4
warp_merge_sort = cudax.warp.merge_sort_keys(
    numba.int32, items_per_thread, compare_op
)

# Link the merge sort to a CUDA kernel
@cuda.jit(link=warp_merge_sort.files)
def kernel(keys):
    # Obtain a segment of consecutive items that are blocked across threads
    thread_keys = cuda.local.array(shape=items_per_thread, dtype=numba.int32)

    for i in range(items_per_thread):
        thread_keys[i] = keys[cuda.threadIdx.x * items_per_thread + i]

    # Collectively sort the keys
    warp_merge_sort(thread_keys)

    # Copy the sorted keys back to the output
    for i in range(items_per_thread):
        keys[cuda.threadIdx.x * items_per_thread + i] = thread_keys[i]

Suppose the set of input thread_keys across the warp of threads is { [0, 1, 2, 3], [4, 5, 6, 7], ..., [124, 125, 126, 127] }. The corresponding output thread_keys in those threads will be { [127, 126, 125, 124], [123, 122, 121, 120], ..., [3, 2, 1, 0] }.

Parameters

dtype – Numba data type of the keys to be sorted
threads_in_warp – The number of threads in a warp
items_per_thread – The number of items each thread owns
compare_op – Comparison function object which returns true if the first argument is ordered before the second one

Returns

A callable object that can be linked to and invoked from a CUDA kernel

cuda.cccl.cooperative.experimental.warp.reduce(dtype, binary_op, threads_in_warp=32, methods=None)

Computes a warp-wide reduction for lane₀ using the specified binary reduction functor. Each thread contributes one input element.

Warning

The return value is undefined in threads other than thread₀.

Example

The code snippet below illustrates a max reduction of 32 integer items that are partitioned across a warp of threads.

patch.patch_numba_linker(lto=True)

Below is the code snippet that demonstrates the usage of the reduce API:

warp_reduce = cudax.warp.reduce(numba.int32, op)

@cuda.jit(link=warp_reduce.files)
def kernel(input, output):
    warp_output = warp_reduce(input[cuda.threadIdx.x])

    if cuda.threadIdx.x == 0:
        output[0] = warp_output

Suppose the set of inputs across the warp of threads is { 0, 1, 2, 3, ..., 31 }. The corresponding output in the threads lane₀ will be { 31 }.

Parameters

dtype – Data type being reduced
threads_in_warp – The number of threads in a warp
binary_op – Binary reduction function

Returns

A callable object that can be linked to and invoked from a CUDA kernel

cuda.cccl.cooperative.experimental.warp.sum(dtype, threads_in_warp=32)

Computes a warp-wide reduction for lane₀ using addition (+) as the reduction operator. Each thread contributes one input element.

Warning

The return value is undefined in threads other than thread₀.

Example

The code snippet below illustrates a reduction of 32 integer items that are partitioned across a warp of threads.

patch.patch_numba_linker(lto=True)

Below is the code snippet that demonstrates the usage of the reduce API:

warp_sum = cudax.warp.sum(numba.int32)

@cuda.jit(link=warp_sum.files)
def kernel(input, output):
    warp_output = warp_sum(input[cuda.threadIdx.x])

    if cuda.threadIdx.x == 0:
        output[0] = warp_output

Suppose the set of inputs across the warp of threads is { 1, 1, 1, 1, ..., 1 }. The corresponding output in the threads lane₀ will be { 32 }.

Parameters

dtype – Data type being reduced
threads_in_warp – The number of threads in a warp

Returns

A callable object that can be linked to and invoked from a CUDA kernel

cuda.cccl.cooperative.experimental.block.exclusive_scan(dtype: Union[str, type, np.number, np.dtype, numba.types.Type], threads_per_block: Union[dim3, int, Tuple[int, int], Tuple[int, int, int]], scan_op: Union[Literal['add', 'plus', 'mul', 'multiplies', 'min', 'minimum', 'max', 'maximum', 'bit_and', 'bit_or', 'bit_xor'], Literal['+', '*', '&', '|', '^'], Callable[[numba.types.Number, numba.types.Number], numba.types.Number], Callable[[np.ndarray, np.ndarray], np.ndarray], Callable[[np.number, np.number], np.number]], items_per_thread: int, initial_value: Any = None, prefix_op: Callable = None, algorithm: Literal['raking', 'raking_memoize', 'warp_scans'] = 'raking', methods: dict = None) → Callable

Computes an exclusive block-wide prefix scan using the specified scan operator.

Parameters

dtype (DtypeType) – Supplies the data type of the input and output arrays.
threads_per_block (DimType) – Supplies the number of threads in the block, either as an integer for a 1-D block or a tuple of two or three integers for a 2-D or 3-D block, respectively.
scan_op (ScanOpType) – Supplies the scan operator to use for the block-wide scan.
items_per_thread (int, optional) – Supplies the number of items partitioned onto each thread. This parameter must be greater than or equal to 1.
initial_value (Any, optional) – Optionally supplies the initial value to use for the block-wide scan. If a non-None value is supplied, prefix_op must be None.
prefix_op (Callable, optional) – Optionally supplies a callable that will be invoked by the first warp of threads in a block with the block aggregate value; only the return value of the first lane in the warp is applied as the prefix value. If a non-None value is supplied, initial_value must be None.
algorithm (Literal["raking", "raking_memoize", "warp_scans"], optional) – Optionally supplies the algorithm to use for the block-wide scan. Must be one of "raking", "raking_memoize", or "warp_scans". The default is "raking".
methods (dict, optional) – Optionally supplies a dictionary of methods to use for user-defined types. The default is None. Not supported if items_per_thread > 1.

Raises

ValueError – If algorithm is not one of the supported algorithms ("raking", "raking_memoize", or "warp_scans").
ValueError – If items_per_thread is less than 1.
ValueError – If items_per_thread is greater than 1 and methods is not None (i.e. a user-defined type is being used).
ValueError – If scan_op is an unsupported operator type.
ValueError – If initial_value is provided but the scan_op is a sum operator (sum operators do not support initial values).
ValueError – If initial_value is provided with items_per_thread=1, and block_prefix_callback_op is not None (this combination is not supported).
ValueError – If initial_value is required but not provided. An initial value is required when items_per_thread > 1 and block_prefix_callback_op is None. If not provided, the function will attempt to create a default value (0) for the given data type, but will raise an error if this is not possible.

Returns

A callable that can be linked to a CUDA kernel and invoked to perform the block-wide exclusive prefix scan.

Return type

Callable

cuda.cccl.cooperative.experimental.block.exclusive_sum(dtype: Union[str, type, np.number, np.dtype, numba.types.Type], threads_per_block: Union[dim3, int, Tuple[int, int], Tuple[int, int, int]], items_per_thread: int, prefix_op: Callable = None, algorithm: Literal['raking', 'raking_memoize', 'warp_scans'] = 'raking', methods: dict = None) → Callable

Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. The value of 0 is applied as the initial value, and is assigned to the first output element in the first thread.

Example

The code snippet below illustrates an exclusive prefix sum of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.

patch.patch_numba_linker(lto=True)

Below is the code snippet that demonstrates the usage of the exclusive_sum API:

items_per_thread = 4
threads_per_block = 128

# Specialize exclusive sum for a 1D block of 128 threads owning 4 integer items each
block_exclusive_sum = cudax.block.exclusive_sum(
    numba.int32, threads_per_block, items_per_thread
)

# Link the exclusive sum to a CUDA kernel
@cuda.jit(link=block_exclusive_sum.files)
def kernel(data):
    # Obtain a segment of consecutive items that are blocked across threads
    thread_data = cuda.local.array(shape=items_per_thread, dtype=numba.int32)
    for i in range(items_per_thread):
        thread_data[i] = data[cuda.threadIdx.x * items_per_thread + i]

    # Collectively compute the block-wide exclusive prefix sum
    block_exclusive_sum(thread_data, thread_data)

    # Copy the scanned keys back to the output
    for i in range(items_per_thread):
        data[cuda.threadIdx.x * items_per_thread + i] = thread_data[i]

Suppose the set of input thread_data across the block of threads is { [1, 1, 1, 1], [1, 1, 1, 1], ..., [1, 1, 1, 1] }.

The corresponding output thread_data in those threads will be { [0, 1, 2, 3], [4, 5, 6, 7], ..., [508, 509, 510, 511] }.

Parameters

dtype (DtypeType) – Supplies the data type of the input and output arrays.
threads_per_block (DimType) – Supplies the number of threads in the block, either as an integer for a 1D block or a tuple of two or three integers for a 2D or 3D block, respectively.
items_per_thread (int, optional) – Supplies the number of items partitioned onto each thread. This parameter must be greater than or equal to 1.
prefix_op (Callable, optional) – Optionally supplies a callable that will be invoked by the first warp of threads in a block with the block aggregate value; only the return value of the first lane in the warp is applied as the prefix value.
algorithm (Literal["raking", "raking_memoize", "warp_scans"], optional) – Optionally supplies the algorithm to use for the block-wide scan. Must be one of the following: "raking", "raking_memoize", or "warp_scans". The default is "raking".
methods (dict, optional) – Optionally supplies a dictionary of methods to use for user-defined types. The default is None. Not supported if items_per_thread > 1.

Raises

ValueError – If algorithm is not one of the supported algorithms ("raking", "raking_memoize", or "warp_scans").
ValueError – If items_per_thread is less than 1.
ValueError – If items_per_thread is greater than 1 and methods is not None (i.e. a user-defined type is being used).

Returns

A callable that can be linked to a CUDA kernel and invoked to perform the block-wide exclusive prefix scan.

Return type

Callable

cuda.cccl.cooperative.experimental.block.inclusive_scan(dtype: Union[str, type, np.number, np.dtype, numba.types.Type], threads_per_block: Union[dim3, int, Tuple[int, int], Tuple[int, int, int]], scan_op: Union[Literal['add', 'plus', 'mul', 'multiplies', 'min', 'minimum', 'max', 'maximum', 'bit_and', 'bit_or', 'bit_xor'], Literal['+', '*', '&', '|', '^'], Callable[[numba.types.Number, numba.types.Number], numba.types.Number], Callable[[np.ndarray, np.ndarray], np.ndarray], Callable[[np.number, np.number], np.number]], items_per_thread: int, initial_value: Any = None, prefix_op: Callable = None, algorithm: Literal['raking', 'raking_memoize', 'warp_scans'] = 'raking', methods: dict = None) → Callable

Computes an inclusive block-wide prefix scan using the specified scan operator.

Parameters

dtype (DtypeType) – Supplies the data type of the input and output arrays.
threads_per_block (DimType) – Supplies the number of threads in the block, either as an integer for a 1-D block or a tuple of two or three integers for a 2-D or 3-D block, respectively.
scan_op (ScanOpType) – Supplies the scan operator to use for the block-wide scan.
items_per_thread (int, optional) – Supplies the number of items partitioned onto each thread. This parameter must be greater than or equal to 1.
initial_value (Any, optional) – Optionally supplies the initial value to use for the block-wide scan. If a non-None value is supplied, prefix_op must be None. Only supported when items_per_thread > 1; a ValueError will be raised if this is not the case.
prefix_op (Callable, optional) – Optionally supplies a callable that will be invoked by the first warp of threads in a block with the block aggregate value; only the return value of the first lane in the warp is applied as the prefix value. If a non-None value is supplied, initial_value must be None; a ValueError will be raised if this is not the case.
algorithm (Literal["raking", "raking_memoize", "warp_scans"], optional) – Optionally supplies the algorithm to use for the block-wide scan. Must be one of "raking", "raking_memoize", or "warp_scans". The default is "raking".
methods (dict, optional) – Optionally supplies a dictionary of methods to use for user-defined types. The default is None. Not supported if items_per_thread > 1.

Raises

ValueError – If algorithm is not one of the supported algorithms ("raking", "raking_memoize", or "warp_scans").
ValueError – If items_per_thread is less than 1.
ValueError – If scan_op is an unsupported operator type.
ValueError – If initial_value is provided but the scan_op is a sum operator (sum operators do not support initial values).
ValueError – If initial_value is provided with items_per_thread=1 (initial values are not supported for inclusive scans with a single item per thread).

Returns

A callable that can be linked to a CUDA kernel and invoked to perform the block-wide inclusive prefix scan.

Return type

Callable

cuda.cccl.cooperative.experimental.block.inclusive_sum(dtype: Union[str, type, np.number, np.dtype, numba.types.Type], threads_per_block: Union[dim3, int, Tuple[int, int], Tuple[int, int, int]], items_per_thread: int, prefix_op: Callable = None, algorithm: Literal['raking', 'raking_memoize', 'warp_scans'] = 'raking', methods: dict = None) → Callable

Computes an inclusive block-wide prefix scan using addition (+) as the scan operator.

Parameters

dtype (DtypeType) – Supplies the data type of the input and output arrays.
threads_per_block (DimType) – Supplies the number of threads in the block, either as an integer for a 1D block or a tuple of two or three integers for a 2D or 3D block, respectively.
items_per_thread (int, optional) – Supplies the number of items partitioned onto each thread. This parameter must be greater than or equal to 1.
prefix_op (Callable, optional) – Optionally supplies a callable that will be invoked by the first warp of threads in a block with the block aggregate value; only the return value of the first lane in the warp is applied as the prefix value.
algorithm (Literal["raking", "raking_memoize", "warp_scans"], optional) – Optionally supplies the algorithm to use for the block-wide scan. Must be one of the following: "raking", "raking_memoize", or "warp_scans". The default is "raking".
methods (dict, optional) – Optionally supplies a dictionary of methods to use for user-defined types. The default is None. Not supported if items_per_thread > 1.

Raises

ValueError – If algorithm is not one of the supported algorithms ("raking", "raking_memoize", or "warp_scans").
ValueError – If items_per_thread is less than 1.

Returns

A callable that can be linked to a CUDA kernel and invoked to perform the block-wide inclusive prefix scan.

Return type

Callable

cuda.cccl.cooperative.experimental.block.load(dtype, threads_per_block, items_per_thread=1, algorithm='direct')

Creates an operation that performs a block-wide load.

Returns a callable object that can be linked to and invoked from device code. It can be invoked with the following signatures:

(src: numba.types.Array, dest: numba.types.Array) -> None: Each thread loads
items_per_thread items from src into dest. dest must contain at least items_per_thread items.

Different data movement strategies can be selected via the algorithm parameter:

algorithm=”direct” (default): A blocked arrangement of data is read directly from memory.
algorithm=”striped”: A striped arrangement of data is read directly from memory.
algorithm=”vectorize”: A blocked arrangement of data is read directly from memory using CUDA’s built-in vectorized loads as a coalescing optimization.
algorithm=”transpose”: A striped arrangement of data is read directly from memory and is then locally transposed into a blocked arrangement.
algorithm=”warp_transpose”: A warp-striped arrangement of data is read directly from memory and is then locally transposed into a blocked arrangement.
algorithm=”warp_transpose_timesliced”: A warp-striped arrangement of data is read directly from memory and is then locally transposed into a blocked arrangement one warp at a time.

For more details, [read the corresponding CUB C++ documentation](https://nvidia.github.io/cccl/cub/api/classcub_1_1BlockLoad.html).

Parameters

dtype – Data type being loaded
threads_per_block – The number of threads in a block, either an integer or a tuple of 2 or 3 integers
items_per_thread – The number of items each thread loads
algorithm – The data movement algorithm to use

Example

The code snippet below illustrates a striped load and store of 128 integer items by 32 threads, with each thread handling 4 integers.

import numba
import numpy as np
from numba import cuda
from pynvjitlink import patch

import cuda.cccl.cooperative.experimental as cudax

patch.patch_numba_linker(lto=True)

threads_per_block = 32
items_per_thread = 4
block_load = cudax.block.load(
    numba.int32, threads_per_block, items_per_thread, "striped"
)
block_store = cudax.block.store(
    numba.int32, threads_per_block, items_per_thread, "striped"
)

@cuda.jit(link=block_load.files + block_store.files)
def kernel(input, output):
    tmp = cuda.local.array(items_per_thread, numba.int32)
    block_load(input, tmp)
    block_store(output, tmp)

cuda.cccl.cooperative.experimental.block.merge_sort_keys(dtype: Union[str, type, np.dtype, numba.types.Type], threads_per_block: int, items_per_thread: int, compare_op: Callable, methods: Literal['construct', 'assign'] = None)

Performs a block-wide merge sort over a blocked arrangement of keys.

Example

The code snippet below illustrates a sort of 512 integer keys that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive keys. We start by importing necessary modules:

patch.patch_numba_linker(lto=True)

Below is the code snippet that demonstrates the usage of the merge_sort_keys API:

# Define comparison operator
def compare_op(a, b):
    return a > b

# Specialize merge sort for a 1D block of 128 threads owning 4 integer items each
items_per_thread = 4
threads_per_block = 128
block_merge_sort = cudax.block.merge_sort_keys(
    numba.int32, threads_per_block, items_per_thread, compare_op
)

# Link the merge sort to a CUDA kernel
@cuda.jit(link=block_merge_sort.files)
def kernel(keys):
    # Obtain a segment of consecutive items that are blocked across threads
    thread_keys = cuda.local.array(shape=items_per_thread, dtype=numba.int32)

    for i in range(items_per_thread):
        thread_keys[i] = keys[cuda.threadIdx.x * items_per_thread + i]

    # Collectively sort the keys
    block_merge_sort(thread_keys)

    # Copy the sorted keys back to the output
    for i in range(items_per_thread):
        keys[cuda.threadIdx.x * items_per_thread + i] = thread_keys[i]

Suppose the set of input thread_keys across the block of threads is { [0, 1, 2, 3], [4, 5, 6, 7], ..., [508, 509, 510, 511] }. The corresponding output thread_keys in those threads will be { [511, 510, 509, 508], [507, 506, 505, 504], ..., [3, 2, 1, 0] }.

Parameters

dtype – Numba data type of the keys to be sorted
threads_per_block – The number of threads in a block, either an integer or a tuple of 2 or 3 integers
items_per_thread – The number of items each thread owns
compare_op – Comparison function object which returns true if the first argument is ordered before the second one

Returns

A callable object that can be linked to and invoked from a CUDA kernel

cuda.cccl.cooperative.experimental.block.radix_sort_keys(dtype, threads_per_block, items_per_thread)

Performs an ascending block-wide radix sort over a blocked arrangement of keys.

Example

The code snippet below illustrates a sort of 512 integer keys that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive keys. We start by importing necessary modules:

patch.patch_numba_linker(lto=True)

Below is the code snippet that demonstrates the usage of the radix_sort_keys API:

# Specialize radix sort for a 1D block of 128 threads owning 4 integer items each
items_per_thread = 4
threads_per_block = 128
block_radix_sort = cudax.block.radix_sort_keys(
    numba.int32, threads_per_block, items_per_thread
)

# Link the radix sort to a CUDA kernel
@cuda.jit(link=block_radix_sort.files)
def kernel(keys):
    # Obtain a segment of consecutive items that are blocked across threads
    thread_keys = cuda.local.array(shape=items_per_thread, dtype=numba.int32)

    for i in range(items_per_thread):
        thread_keys[i] = keys[cuda.threadIdx.x * items_per_thread + i]

    # Collectively sort the keys
    block_radix_sort(thread_keys)

    # Copy the sorted keys back to the output
    for i in range(items_per_thread):
        keys[cuda.threadIdx.x * items_per_thread + i] = thread_keys[i]

Suppose the set of input thread_keys across the block of threads is { [511, 510, 509, 508], [507, 506, 505, 504], ..., [3, 2, 1, 0] }. The corresponding output thread_keys in those threads will be { [0, 1, 2, 3], [4, 5, 6, 7], ..., [508, 509, 510, 511] }.

Parameters

dtype – Data type of the keys to be sorted
threads_per_block – The number of threads in a block, either an integer or a tuple of 2 or 3 integers
items_per_thread – The number of items each thread owns

Returns

A callable object that can be linked to and invoked from a CUDA kernel

cuda.cccl.cooperative.experimental.block.radix_sort_keys_descending(dtype, threads_per_block, items_per_thread)

Performs an descending block-wide radix sort over a blocked arrangement of keys.

Example

The code snippet below illustrates a sort of 512 integer keys that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive keys. We start by importing necessary modules:

patch.patch_numba_linker(lto=True)

Below is the code snippet that demonstrates the usage of the radix_sort_keys API:

# Specialize radix sort for a 1D block of 128 threads owning 4 integer items each
items_per_thread = 4
threads_per_block = 128
block_radix_sort = cudax.block.radix_sort_keys_descending(
    numba.int32, threads_per_block, items_per_thread
)

# Link the radix sort to a CUDA kernel
@cuda.jit(link=block_radix_sort.files)
def kernel(keys):
    # Obtain a segment of consecutive items that are blocked across threads
    thread_keys = cuda.local.array(shape=items_per_thread, dtype=numba.int32)

    for i in range(items_per_thread):
        thread_keys[i] = keys[cuda.threadIdx.x * items_per_thread + i]

    # Collectively sort the keys
    block_radix_sort(thread_keys)

    # Copy the sorted keys back to the output
    for i in range(items_per_thread):
        keys[cuda.threadIdx.x * items_per_thread + i] = thread_keys[i]

Suppose the set of input thread_keys across the block of threads is { [0, 1, 2, 3], [4, 5, 6, 7], ..., [508, 509, 510, 511] }. The corresponding output thread_keys in those threads will be { [511, 510, 509, 508], [507, 506, 505, 504], ..., [3, 2, 1, 0] }.

Parameters

dtype – Data type of the keys to be sorted
threads_per_block – The number of threads in a block, either an integer or a tuple of 2 or 3 integers
items_per_thread – The number of items each thread owns

Returns

A callable object that can be linked to and invoked from a CUDA kernel

cuda.cccl.cooperative.experimental.block.reduce(dtype, threads_per_block, binary_op, items_per_thread=1, algorithm='warp_reductions', methods=None)

Creates an operation that computes a block-wide reduction for thread ₀ using the specified binary reduction functor.

Returns a callable object that can be linked to and invoked from device code. It can be invoked with the following signatures:

(item: dtype) -> dtype): Each thread contributes a single item to the reduction.
(items: numba.types.Array) -> dtype: Each thread contributes an array of items to the
reduction. The array must contain at least items_per_thread items; only the first items_per_thread items will be included in the reduction.
(item: dtype, num_valid: int) -> dtype: The first num_valid threads contribute a
single item to the reduction. The items contributed by all other threads are ignored.

Parameters

dtype – Data type being reduced
threads_per_block – The number of threads in a block, either an integer or a tuple of 2 or 3 integers
binary_op – Binary reduction function
items_per_thread – The number of items each thread contributes to the reduction
algorithm – Algorithm to use for the reduction (one of “raking”, “raking_commutative_only”, “warp_reductions”)
methods – A dict of methods for user-defined types

Warning

The return value is undefined in threads other than thread ₀.

Example

The code snippet below illustrates a max reduction of 128 integer items that are partitioned across 128 threads.

import numba
import numpy as np
from numba import cuda
from pynvjitlink import patch

import cuda.cccl.cooperative.experimental as cudax

patch.patch_numba_linker(lto=True)

def op(a, b):
    return a if a > b else b

threads_per_block = 128
block_reduce = cudax.block.reduce(numba.int32, threads_per_block, op)

@cuda.jit(link=block_reduce.files)
def kernel(input, output):
    block_output = block_reduce(input[cuda.threadIdx.x])

    if cuda.threadIdx.x == 0:
        output[0] = block_output

Suppose the set of inputs across the block of threads is { 0, 1, 2, 3, ..., 127 }. The corresponding output in the threads thread ₀ will be { 127 }.

cuda.cccl.cooperative.experimental.block.scan(dtype: Union[str, type, np.number, np.dtype, numba.types.Type], threads_per_block: Union[dim3, int, Tuple[int, int], Tuple[int, int, int]], items_per_thread: int, initial_value: Any = None, mode: Literal['exclusive', 'inclusive'] = 'exclusive', scan_op: Union[Literal['add', 'plus', 'mul', 'multiplies', 'min', 'minimum', 'max', 'maximum', 'bit_and', 'bit_or', 'bit_xor'], Literal['+', '*', '&', '|', '^'], Callable[[numba.types.Number, numba.types.Number], numba.types.Number], Callable[[np.ndarray, np.ndarray], np.ndarray], Callable[[np.number, np.number], np.number]] = '+', block_prefix_callback_op: Callable = None, algorithm: Literal['raking', 'raking_memoize', 'warp_scans'] = 'raking', methods: dict = None) → Callable

Creates a block-wide prefix scan primitive based on the CUB library’s BlockScan functionality.

This function is the low-level implementation used by the higher-level APIs such as exclusive_sum, inclusive_sum, exclusive_scan, and inclusive_scan.

Parameters

dtype (DtypeType) – Supplies the data type of the input and output arrays.
threads_per_block (DimType) – Supplies the number of threads in the block, either as an integer for a 1D block or a tuple of two or three integers for a 2D or 3D block, respectively.
items_per_thread (int, optional) – Supplies the number of items partitioned onto each thread. This parameter must be greater than or equal to 1.
initial_value (Any, optional) – Optionally supplies the initial value to use for the block-wide scan.
mode (Literal["exclusive", "inclusive"], optional) – Supplies the scan mode to use. Must be one of "exclusive" or "inclusive". The default is "exclusive".
scan_op (ScanOpType, optional) – Supplies the scan operator to use for the block-wide scan. The default is the sum operator (+).
block_prefix_callback_op (Callable, optional) – Optionally supplies a callable that will be invoked by the first warp of threads in a block with the block aggregate value; only the return value of the first lane in the warp is applied as the prefix value.
algorithm (Literal["raking", "raking_memoize", "warp_scans"], optional) – Supplies the algorithm to use for the block-wide scan. Must be one of "raking", "raking_memoize", or "warp_scans". The default is "raking".
methods (dict, optional) – Optionally supplies a dictionary of methods to use for user-defined types. The default is None.

Raises

ValueError – If algorithm is not one of the supported algorithms ("raking", "raking_memoize", or "warp_scans").
ValueError – If items_per_thread is less than 1.
ValueError – If mode is not one of the supported modes ("exclusive" or "inclusive").
ValueError – If scan_op is an unsupported operator type.
ValueError – If initial_value is provided but the scan_op is a sum operator (sum operators do not support initial values).
ValueError – If initial_value is provided with an inclusive scan (mode="inclusive") and items_per_thread=1 (initial values are not supported for inclusive scans with a single item per thread).
ValueError – If initial_value is provided with an exclusive scan (mode="exclusive"), items_per_thread=1, and block_prefix_callback_op is not None (this combination is not supported).
ValueError – If initial_value is required but not provided. An initial value is required when items_per_thread > 1 and block_prefix_callback_op is None. If not provided, the function will attempt to create a default value (0) for the given data type, but will raise an error if this is not possible.

Returns

A callable that can be linked to a CUDA kernel and invoked to perform the block-wide prefix scan.

Return type

Callable

cuda.cccl.cooperative.experimental.block.store(dtype, threads_per_block, items_per_thread=1, algorithm='direct')

Creates an operation that performs a block-wide store.

Returns a callable object that can be linked to and invoked from device code. It can be invoked with the following signatures:

(dest: numba.types.Array, src: numba.types.Array) -> None: Each thread stores
items_per_thread items from src into dest. src must contain at least items_per_thread items.

Different data movement strategies can be selected via the algorithm parameter:

algorithm=”direct” (default): A blocked arrangement of data is written directly to memory.
algorithm=”striped”: A striped arrangement of data is written directly to memory.
algorithm=”vectorize”: A blocked arrangement of data is written directly to memory using CUDA’s built-in vectorized stores as a coalescing optimization.
algorithm=”transpose”: A blocked arrangement is locally transposed into a striped arrangement which is then written to memory.
algorithm=”warp_transpose”: A blocked arrangement is locally transposed into a warp-striped arrangement which is then written to memory.
algorithm=”warp_transpose_timesliced”: A blocked arrangement is locally transposed into a warp-striped arrangement which is then written to memory. To reduce the shared memory requireent, only one warp’s worth of shared memory is provisioned and is subsequently time-sliced among warps.

For more details, [read the corresponding CUB C++ documentation](https://nvidia.github.io/cccl/cub/api/classcub_1_1BlockStore.html).

Parameters

dtype – Data type being stored
threads_per_block – The number of threads in a block, either an integer or a tuple of 2 or 3 integers
items_per_thread – The number of items each thread loads
algorithm – The data movement algorithm to use

Example

The code snippet below illustrates a striped load and store of 128 integer items by 32 threads, with each thread handling 4 integers.

import numba
import numpy as np
from numba import cuda
from pynvjitlink import patch

import cuda.cccl.cooperative.experimental as cudax

patch.patch_numba_linker(lto=True)

threads_per_block = 32
items_per_thread = 4
block_load = cudax.block.load(
    numba.int32, threads_per_block, items_per_thread, "striped"
)
block_store = cudax.block.store(
    numba.int32, threads_per_block, items_per_thread, "striped"
)

@cuda.jit(link=block_load.files + block_store.files)
def kernel(input, output):
    tmp = cuda.local.array(items_per_thread, numba.int32)
    block_load(input, tmp)
    block_store(output, tmp)

cuda.cccl.cooperative.experimental.block.sum(dtype, threads_per_block, items_per_thread=1, algorithm='warp_reductions', methods=None)

Creates an operation that computes a block-wide reduction for thread ₀ using addition (+) as the reduction operator.

Returns a callable object that can be linked to and invoked from device code. It can be invoked with the following signatures:

(item: dtype) -> dtype): Each thread contributes a single item to the reduction.
(items: numba.types.Array) -> dtype: Each thread contributes an array of items to the
reduction. The array must contain at least items_per_thread items; only the first items_per_thread items will be included in the reduction.
(item: dtype, num_valid: int) -> dtype: The first num_valid threads contribute a
single item to the reduction. The items contributed by all other threads are ignored.

Parameters

dtype – Data type being reduced
threads_per_block – The number of threads in a block, either an integer or a tuple of 2 or 3 integers
items_per_thread – The number of items each thread owns
algorithm – Algorithm to use for the reduction (one of “raking”, “raking_commutative_only”, “warp_reductions”)
methods – A dict of methods for user-defined types

Warning

The return value is undefined in threads other than thread ₀.

Example

The code snippet below illustrates a sum of 128 integer items that are partitioned across 128 threads.

import numba
import numpy as np
from numba import cuda
from pynvjitlink import patch

import cuda.cccl.cooperative.experimental as cudax

patch.patch_numba_linker(lto=True)

threads_per_block = 128
block_sum = cudax.block.sum(numba.int32, threads_per_block)

@cuda.jit(link=block_sum.files)
def kernel(input, output):
    block_output = block_sum(input[cuda.threadIdx.x])

    if cuda.threadIdx.x == 0:
        output[0] = block_output

Suppose the set of inputs across the block of threads is { 1, 1, 1, 1, ..., 1 }. The corresponding output in the threads thread ₀ will be { 128 }.