cuda.coop API Reference#

Warning

cuda.coop is in public beta. The API is subject to change without notice.

cuda.coop.warp.make_exclusive_sum(dtype, threads_in_warp=32)#

Creates an exclusive warp-wide prefix sum primitive using addition (+) as the scan operator. The value of 0 is applied as the initial value and is assigned to the output in lane 0.

Example

The code snippet below illustrates an exclusive prefix sum of 32 integer items:

numba.config.CUDA_LOW_OCCUPANCY_WARNINGS = 0

The following snippet shows how to invoke the returned warp_exclusive_sum primitive:

# Specialize exclusive sum for a warp of threads
warp_exclusive_sum = coop.warp.make_exclusive_sum(numba.int32)

# Link the exclusive sum to a CUDA kernel
@cuda.jit(link=warp_exclusive_sum.files)
def kernel(data):
    # Collectively compute the warp-wide exclusive prefix sum
    data[cuda.threadIdx.x] = warp_exclusive_sum(data[cuda.threadIdx.x])

Suppose the set of input thread_data across the warp of threads is { [1, 1, 1, 1], [1, 1, 1, 1], ..., [1, 1, 1, 1] }. The corresponding output thread_data in those threads will be { [0, 1, 2, 3], [4, 5, 6, 7], ..., [28, 29, 30, 31] }.

Parameters:
  • dtype – Data type being scanned

  • threads_in_warp – The number of threads in a warp

Returns:

A callable object that can be linked to and invoked from a CUDA kernel

cuda.coop.warp.make_reduce(dtype, binary_op, threads_in_warp=32, methods=None)#

Creates a warp-wide reduction primitive for lane 0 using the specified binary reduction functor. Each thread contributes one input element.

Warning

The return value is undefined in threads other than thread 0.

Example

The code snippet below illustrates a max reduction of 32 integer items that are partitioned across a warp of threads.

numba.config.CUDA_LOW_OCCUPANCY_WARNINGS = 0

The following snippet shows how to invoke the returned warp_reduce primitive:

warp_reduce = coop.warp.make_reduce(numba.int32, op)

@cuda.jit(link=warp_reduce.files)
def kernel(input, output):
    warp_output = warp_reduce(input[cuda.threadIdx.x])

    if cuda.threadIdx.x == 0:
        output[0] = warp_output

Suppose the set of inputs across the warp of threads is { 0, 1, 2, 3, ..., 31 }. The corresponding output in lane 0 will be { 31 }.

Parameters:
  • dtype – Data type being reduced

  • threads_in_warp – The number of threads in a warp

  • binary_op – Binary reduction function

Returns:

A callable object that can be linked to and invoked from a CUDA kernel

cuda.coop.warp.make_sum(dtype, threads_in_warp=32)#

Creates a warp-wide reduction primitive for lane 0 using addition (+) as the reduction operator. Each thread contributes one input element.

Warning

The return value is undefined in threads other than thread 0.

Example

The code snippet below illustrates a reduction of 32 integer items that are partitioned across a warp of threads.

numba.config.CUDA_LOW_OCCUPANCY_WARNINGS = 0

The following snippet shows how to invoke the returned warp_sum primitive:

warp_sum = coop.warp.make_sum(numba.int32)

@cuda.jit(link=warp_sum.files)
def kernel(input, output):
    warp_output = warp_sum(input[cuda.threadIdx.x])

    if cuda.threadIdx.x == 0:
        output[0] = warp_output

Suppose the set of inputs across the warp of threads is { 1, 1, 1, 1, ..., 1 }. The corresponding output in lane 0 will be { 32 }.

Parameters:
  • dtype – Data type being reduced

  • threads_in_warp – The number of threads in a warp

Returns:

A callable object that can be linked to and invoked from a CUDA kernel

cuda.coop.warp.make_merge_sort_keys(
dtype,
items_per_thread,
compare_op,
threads_in_warp=32,
methods=None,
)#

Creates a warp-wide merge sort primitive over a blocked arrangement of keys.

Example

The code snippet below illustrates a sort of 128 integer keys that are partitioned in a blocked arrangement across a warp of 32 threads where each thread owns 4 consecutive keys.

The following snippet shows how to invoke the returned warp_merge_sort primitive:

# Define comparison operator
def compare_op(a, b):
    return a > b

# Specialize merge sort for a warp of threads owning 4 integer items each
items_per_thread = 4
warp_merge_sort = coop.warp.make_merge_sort_keys(
    numba.int32, items_per_thread, compare_op
)

# Link the merge sort to a CUDA kernel
@cuda.jit(link=warp_merge_sort.files)
def kernel(keys):
    # Obtain a segment of consecutive items that are blocked across threads
    thread_keys = cuda.local.array(shape=items_per_thread, dtype=numba.int32)

    for i in range(items_per_thread):
        thread_keys[i] = keys[cuda.threadIdx.x * items_per_thread + i]

    # Collectively sort the keys
    warp_merge_sort(thread_keys)

    # Copy the sorted keys back to the output
    for i in range(items_per_thread):
        keys[cuda.threadIdx.x * items_per_thread + i] = thread_keys[i]

Suppose the set of input thread_keys across the warp of threads is { [0, 1, 2, 3], [4, 5, 6, 7], ..., [124, 125, 126, 127] }. The corresponding output thread_keys in those threads will be { [127, 126, 125, 124], [123, 122, 121, 120], ..., [3, 2, 1, 0] }.

Parameters:
  • dtype – Numba data type of the keys to be sorted

  • threads_in_warp – The number of threads in a warp

  • items_per_thread – The number of items each thread owns

  • compare_op – Comparison function object. Returns true if the first argument is ordered before the second.

Returns:

A callable object that can be linked to and invoked from a CUDA kernel

class cuda.coop.block.BlockExchangeType(*values)#

Enum representing the type of block exchange operation. Currently only StripedToBlocked is supported.

StripedToBlocked = 1#
cuda.coop.block.make_exchange(
block_exchange_type: BlockExchangeType,
dtype: str | type | np.number | np.dtype | numba.types.Type,
threads_per_block: dim3 | int | Tuple[int, int] | Tuple[int, int, int],
items_per_thread: int,
warp_time_slicing: bool = False,
methods: dict = None,
)#

Creates a block-wide exchange primitive for rearranging data partitioned across CUDA thread blocks.

Example

The snippet below shows how to create and invoke the returned block_exchange primitive for striped-to-blocked exchange.

block_exchange = coop.block.make_exchange(
    coop.block.BlockExchangeType.StripedToBlocked,
    dtype=numba.int32,
    threads_per_block=128,
    items_per_thread=4,
)
temp_storage_bytes = block_exchange.temp_storage_bytes

@cuda.jit(link=block_exchange.files)
def kernel(thread_data):
    temp_storage = cuda.shared.array(
        shape=temp_storage_bytes,
        dtype=numba.uint8,
    )
    block_exchange(temp_storage, thread_data)
Parameters:
  • block_exchange_type – Exchange mode to perform. Currently, only StripedToBlocked is supported.

  • dtype (cuda.coop._typing.DtypeType) – Data type of input and output values.

  • threads_per_block (cuda.coop._typing.DimType) – Number of threads in the block.

  • items_per_thread (int) – Number of items owned by each thread.

  • warp_time_slicing (bool, optional) – Whether to use warp time-slicing. If true, shared memory usage is reduced at the expense of parallelism.

  • methods (dict, optional) – Optional method dictionary for user-defined types.

Raises:
Returns:

An cuda.coop._types.Invocable object representing the specialized kernel callable from a Numba JIT’d CUDA kernel.

cuda.coop.block.make_exclusive_scan(
dtype: str | type | np.number | np.dtype | numba.types.Type,
threads_per_block: dim3 | int | Tuple[int, int] | Tuple[int, int, int],
scan_op: Literal['add', 'plus', 'mul', 'multiplies', 'min', 'minimum', 'max', 'maximum', 'bit_and', 'bit_or', 'bit_xor'] | Literal['+', '*', '&', '|', '^'] | Callable[[numba.types.Number, numba.types.Number], numba.types.Number] | Callable[[np.ndarray, np.ndarray], np.ndarray] | Callable[[np.number, np.number], np.number],
items_per_thread: int,
initial_value: Any = None,
prefix_op: Callable = None,
algorithm: Literal['raking', 'raking_memoize', 'warp_scans'] = 'raking',
methods: dict = None,
) Callable#

Creates an exclusive block-wide prefix scan primitive with the specified scan operator.

Example

The snippet below shows how to create and invoke the returned block_exclusive_scan primitive.

block_exclusive_scan = coop.block.make_exclusive_scan(
    dtype=numba.int32,
    threads_per_block=128,
    scan_op="max",
    items_per_thread=4,
)

@cuda.jit(link=block_exclusive_scan.files)
def kernel(thread_data):
    block_exclusive_scan(thread_data, thread_data)
Parameters:
  • dtype (DtypeType) – Data type of the input and output values.

  • threads_per_block (DimType) – Number of threads in the block.

  • scan_op (ScanOpType) – Scan operator.

  • items_per_thread (int, optional) – Number of items owned by each thread.

  • initial_value (Any, optional) – Optional initial value when supported.

  • prefix_op (Callable, optional) – Optional block prefix callback operator.

  • algorithm (Literal["raking", "raking_memoize", "warp_scans"], optional) – Scan algorithm.

  • methods (dict, optional) – Optional method dictionary for user-defined types.

Returns:

Callable primitive object for exclusive prefix scan.

Return type:

Callable

cuda.coop.block.make_exclusive_sum(
dtype: str | type | np.number | np.dtype | numba.types.Type,
threads_per_block: dim3 | int | Tuple[int, int] | Tuple[int, int, int],
items_per_thread: int,
prefix_op: Callable = None,
algorithm: Literal['raking', 'raking_memoize', 'warp_scans'] = 'raking',
methods: dict = None,
) Callable#

Creates an exclusive block-wide prefix sum primitive using addition (+) as the scan operator.

Example

The code snippet below illustrates an exclusive prefix sum of 512 integer items in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.

numba.config.CUDA_LOW_OCCUPANCY_WARNINGS = 0

The following snippet shows how to invoke the returned block_exclusive_sum primitive:

items_per_thread = 4
threads_per_block = 128

# Specialize exclusive sum for a 1D block of 128 threads owning 4 integer items each
block_exclusive_sum = coop.block.make_exclusive_sum(
    numba.int32, threads_per_block, items_per_thread
)

# Link the exclusive sum to a CUDA kernel
@cuda.jit(link=block_exclusive_sum.files)
def kernel(data):
    # Obtain a segment of consecutive items that are blocked across threads
    thread_data = cuda.local.array(shape=items_per_thread, dtype=numba.int32)
    for i in range(items_per_thread):
        thread_data[i] = data[cuda.threadIdx.x * items_per_thread + i]

    # Collectively compute the block-wide exclusive prefix sum
    block_exclusive_sum(thread_data, thread_data)

    # Copy the scanned keys back to the output
    for i in range(items_per_thread):
        data[cuda.threadIdx.x * items_per_thread + i] = thread_data[i]

Suppose the set of input thread_data across the block of threads is { [1, 1, 1, 1], [1, 1, 1, 1], ..., [1, 1, 1, 1] }.

The corresponding output thread_data in those threads will be { [0, 1, 2, 3], [4, 5, 6, 7], ..., [508, 509, 510, 511] }.

Parameters:
  • dtype (DtypeType) – Data type of the input and output values.

  • threads_per_block (DimType) – Number of threads in the block.

  • items_per_thread (int, optional) – Number of items owned by each thread.

  • prefix_op (Callable, optional) – Optional block prefix callback operator.

  • algorithm (Literal["raking", "raking_memoize", "warp_scans"], optional) – Scan algorithm.

  • methods (dict, optional) – Optional method dictionary for user-defined types.

Returns:

Callable primitive object for exclusive prefix sum.

Return type:

Callable

cuda.coop.block.make_inclusive_scan(
dtype: str | type | np.number | np.dtype | numba.types.Type,
threads_per_block: dim3 | int | Tuple[int, int] | Tuple[int, int, int],
scan_op: Literal['add', 'plus', 'mul', 'multiplies', 'min', 'minimum', 'max', 'maximum', 'bit_and', 'bit_or', 'bit_xor'] | Literal['+', '*', '&', '|', '^'] | Callable[[numba.types.Number, numba.types.Number], numba.types.Number] | Callable[[np.ndarray, np.ndarray], np.ndarray] | Callable[[np.number, np.number], np.number],
items_per_thread: int,
initial_value: Any = None,
prefix_op: Callable = None,
algorithm: Literal['raking', 'raking_memoize', 'warp_scans'] = 'raking',
methods: dict = None,
) Callable#

Creates an inclusive block-wide prefix scan primitive with the specified scan operator.

Example

The snippet below shows how to create and invoke the returned block_inclusive_scan primitive.

block_inclusive_scan = coop.block.make_inclusive_scan(
    dtype=numba.int32,
    threads_per_block=128,
    scan_op="min",
    items_per_thread=4,
)

@cuda.jit(link=block_inclusive_scan.files)
def kernel(thread_data):
    block_inclusive_scan(thread_data, thread_data)
Parameters:
  • dtype (DtypeType) – Data type of the input and output values.

  • threads_per_block (DimType) – Number of threads in the block.

  • scan_op (ScanOpType) – Scan operator.

  • items_per_thread (int, optional) – Number of items owned by each thread.

  • initial_value (Any, optional) – Optional initial value when supported.

  • prefix_op (Callable, optional) – Optional block prefix callback operator.

  • algorithm (Literal["raking", "raking_memoize", "warp_scans"], optional) – Scan algorithm.

  • methods (dict, optional) – Optional method dictionary for user-defined types.

Returns:

Callable primitive object for inclusive prefix scan.

Return type:

Callable

cuda.coop.block.make_inclusive_sum(
dtype: str | type | np.number | np.dtype | numba.types.Type,
threads_per_block: dim3 | int | Tuple[int, int] | Tuple[int, int, int],
items_per_thread: int,
prefix_op: Callable = None,
algorithm: Literal['raking', 'raking_memoize', 'warp_scans'] = 'raking',
methods: dict = None,
) Callable#

Creates an inclusive block-wide prefix sum primitive using addition (+) as the scan operator.

Example

The snippet below shows how to create and invoke the returned block_inclusive_sum primitive.

block_inclusive_sum = coop.block.make_inclusive_sum(
    dtype=numba.int32,
    threads_per_block=128,
    items_per_thread=4,
)

@cuda.jit(link=block_inclusive_sum.files)
def kernel(thread_data):
    block_inclusive_sum(thread_data, thread_data)
Parameters:
  • dtype (DtypeType) – Data type of the input and output values.

  • threads_per_block (DimType) – Number of threads in the block.

  • items_per_thread (int, optional) – Number of items owned by each thread.

  • prefix_op (Callable, optional) – Optional block prefix callback operator.

  • algorithm (Literal["raking", "raking_memoize", "warp_scans"], optional) – Scan algorithm.

  • methods (dict, optional) – Optional method dictionary for user-defined types.

Returns:

Callable primitive object for inclusive prefix sum.

Return type:

Callable

cuda.coop.block.make_load(
dtype,
threads_per_block,
items_per_thread=1,
algorithm='direct',
)#

Creates a block-wide load primitive.

Returns a callable object that can be linked to and invoked from device code. It can be invoked with the following signatures:

  • (src: numba.types.Array, dest: numba.types.Array) -> None: Each thread loads items_per_thread items from src into dest. dest must contain at least items_per_thread items.

Different data movement strategies can be selected via the algorithm parameter:

  • algorithm=”direct” (default): Reads blocked data directly.

  • algorithm=”striped”: Reads striped data directly.

  • algorithm=”vectorize”: Reads blocked data directly using CUDA’s built-in vectorized loads as a coalescing optimization.

  • algorithm=”transpose”: Reads striped data and then locally transposes it into a blocked arrangement.

  • algorithm=”warp_transpose”: Reads warp-striped data and then locally transposes it into a blocked arrangement.

  • algorithm=”warp_transpose_timesliced”: Reads warp-striped data and then locally transposes it into a blocked arrangement one warp at a time.

For more details, read the corresponding CUB C++ documentation: https://nvidia.github.io/cccl/cub/api/classcub_1_1BlockLoad.html

Parameters:
  • dtype – Data type being loaded

  • threads_per_block – Number of threads in a block. Can be an integer or a tuple of 2 or 3 integers.

  • items_per_thread – The number of items each thread loads

  • algorithm – The data movement algorithm to use

Example

The code snippet below illustrates a striped load and store of 128 integer items by 32 threads, with each thread handling 4 integers.

import numba
import numpy as np
from numba import cuda

from cuda import coop

The following snippet shows how to invoke the returned block_load and block_store primitives:

threads_per_block = 32
items_per_thread = 4
block_load = coop.block.make_load(
    numba.int32, threads_per_block, items_per_thread, "striped"
)
block_store = coop.block.make_store(
    numba.int32, threads_per_block, items_per_thread, "striped"
)

@cuda.jit(link=block_load.files + block_store.files)
def kernel(input, output):
    tmp = cuda.local.array(items_per_thread, numba.int32)
    block_load(input, tmp)
    block_store(output, tmp)

cuda.coop.block.make_merge_sort_keys(
dtype: str | type | np.dtype | numba.types.Type,
threads_per_block: int,
items_per_thread: int,
compare_op: Callable,
methods: Literal['construct', 'assign'] = None,
)#

Creates a block-wide merge sort primitive over a blocked arrangement of keys.

Example

The code snippet below illustrates a sort of 512 integer keys partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive keys.

numba.config.CUDA_LOW_OCCUPANCY_WARNINGS = 0

The following snippet shows how to invoke the returned block_merge_sort primitive:

# Define comparison operator
def compare_op(a, b):
    return a > b

# Specialize merge sort for a 1D block of 128 threads owning 4 integer items each
items_per_thread = 4
threads_per_block = 128
block_merge_sort = coop.block.make_merge_sort_keys(
    numba.int32, threads_per_block, items_per_thread, compare_op
)

# Link the merge sort to a CUDA kernel
@cuda.jit(link=block_merge_sort.files)
def kernel(keys):
    # Obtain a segment of consecutive items that are blocked across threads
    thread_keys = cuda.local.array(shape=items_per_thread, dtype=numba.int32)

    for i in range(items_per_thread):
        thread_keys[i] = keys[cuda.threadIdx.x * items_per_thread + i]

    # Collectively sort the keys
    block_merge_sort(thread_keys)

    # Copy the sorted keys back to the output
    for i in range(items_per_thread):
        keys[cuda.threadIdx.x * items_per_thread + i] = thread_keys[i]

Suppose the set of input thread_keys across the block of threads is { [0, 1, 2, 3], [4, 5, 6, 7], ..., [508, 509, 510, 511] }. The corresponding output thread_keys in those threads will be { [511, 510, 509, 508], [507, 506, 505, 504], ..., [3, 2, 1, 0] }.

Parameters:
  • dtype – Numba data type of the keys to be sorted

  • threads_per_block – The number of threads in a block, either an integer or a tuple of 2 or 3 integers

  • items_per_thread – The number of items each thread owns

  • compare_op – Comparison function object. Returns true if the first argument is ordered before the second.

Returns:

A callable object that can be linked to and invoked from a CUDA kernel

cuda.coop.block.make_radix_sort_keys(dtype, threads_per_block, items_per_thread)#

Creates an ascending block-wide radix sort primitive over a blocked arrangement of keys.

Example

The code snippet below illustrates a sort of 512 integer keys partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive keys.

numba.config.CUDA_LOW_OCCUPANCY_WARNINGS = 0

The following snippet shows how to invoke the returned block_radix_sort primitive:

# Specialize radix sort for a 1D block of 128 threads owning 4 integer items each
items_per_thread = 4
threads_per_block = 128
block_radix_sort = coop.block.make_radix_sort_keys(
    numba.int32, threads_per_block, items_per_thread
)

# Link the radix sort to a CUDA kernel
@cuda.jit(link=block_radix_sort.files)
def kernel(keys):
    # Obtain a segment of consecutive items that are blocked across threads
    thread_keys = cuda.local.array(shape=items_per_thread, dtype=numba.int32)

    for i in range(items_per_thread):
        thread_keys[i] = keys[cuda.threadIdx.x * items_per_thread + i]

    # Collectively sort the keys
    block_radix_sort(thread_keys)

    # Copy the sorted keys back to the output
    for i in range(items_per_thread):
        keys[cuda.threadIdx.x * items_per_thread + i] = thread_keys[i]

Suppose the set of input thread_keys across the block of threads is { [511, 510, 509, 508], [507, 506, 505, 504], ..., [3, 2, 1, 0] }. The corresponding output thread_keys in those threads will be { [0, 1, 2, 3], [4, 5, 6, 7], ..., [508, 509, 510, 511] }.

Parameters:
  • dtype – Data type of the keys to be sorted

  • threads_per_block – The number of threads in a block, either an integer or a tuple of 2 or 3 integers

  • items_per_thread – The number of items each thread owns

Returns:

A callable object that can be linked to and invoked from a CUDA kernel

cuda.coop.block.make_radix_sort_keys_descending(
dtype,
threads_per_block,
items_per_thread,
)#

Creates a descending block-wide radix sort primitive over a blocked arrangement of keys.

Example

The code snippet below illustrates a sort of 512 integer keys partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive keys.

numba.config.CUDA_LOW_OCCUPANCY_WARNINGS = 0

The following snippet shows how to invoke the returned block_radix_sort primitive:

# Specialize radix sort for a 1D block of 128 threads owning 4 integer items each
items_per_thread = 4
threads_per_block = 128
block_radix_sort = coop.block.make_radix_sort_keys_descending(
    numba.int32, threads_per_block, items_per_thread
)

# Link the radix sort to a CUDA kernel
@cuda.jit(link=block_radix_sort.files)
def kernel(keys):
    # Obtain a segment of consecutive items that are blocked across threads
    thread_keys = cuda.local.array(shape=items_per_thread, dtype=numba.int32)

    for i in range(items_per_thread):
        thread_keys[i] = keys[cuda.threadIdx.x * items_per_thread + i]

    # Collectively sort the keys
    block_radix_sort(thread_keys)

    # Copy the sorted keys back to the output
    for i in range(items_per_thread):
        keys[cuda.threadIdx.x * items_per_thread + i] = thread_keys[i]

Suppose the set of input thread_keys across the block of threads is { [0, 1, 2, 3], [4, 5, 6, 7], ..., [508, 509, 510, 511] }. The corresponding output thread_keys in those threads will be { [511, 510, 509, 508], [507, 506, 505, 504], ..., [3, 2, 1, 0] }.

Parameters:
  • dtype – Data type of the keys to be sorted

  • threads_per_block – The number of threads in a block, either an integer or a tuple of 2 or 3 integers

  • items_per_thread – The number of items each thread owns

Returns:

A callable object that can be linked to and invoked from a CUDA kernel

cuda.coop.block.make_reduce(
dtype,
threads_per_block,
binary_op,
items_per_thread=1,
algorithm='warp_reductions',
methods=None,
)#

Creates a block-wide reduction primitive for thread 0 using the specified binary reduction functor.

Returns a callable object that can be linked to and invoked from device code. It can be invoked with the following signatures:

  • (item: dtype) -> dtype): Each thread contributes a single item to the reduction.

  • (items: numba.types.Array) -> dtype: Each thread contributes an array of items to the reduction. The array must contain at least items_per_thread items; only the first items_per_thread items are included in the reduction.

  • (item: dtype, num_valid: int) -> dtype: The first num_valid threads contribute a single item to the reduction. Items from all other threads are ignored.

Parameters:
  • dtype – Data type being reduced

  • threads_per_block – Number of threads in a block. Can be an integer or a tuple of 2 or 3 integers.

  • binary_op – Binary reduction function

  • items_per_thread – The number of items each thread contributes to the reduction

  • algorithm – Algorithm to use for the reduction (one of “raking”, “raking_commutative_only”, “warp_reductions”)

  • methods – A dict of methods for user-defined types

Warning

The return value is undefined in threads other than thread 0.

Example

The code snippet below illustrates a max reduction of 128 integer items that are partitioned across 128 threads.

import numba
import numpy as np
from numba import cuda

from cuda import coop

numba.config.CUDA_LOW_OCCUPANCY_WARNINGS = 0

The following snippet shows how to invoke the returned block_reduce primitive:

def op(a, b):
    return a if a > b else b

threads_per_block = 128
block_reduce = coop.block.make_reduce(numba.int32, threads_per_block, op)

@cuda.jit(link=block_reduce.files)
def kernel(input, output):
    block_output = block_reduce(input[cuda.threadIdx.x])

    if cuda.threadIdx.x == 0:
        output[0] = block_output

Suppose the set of inputs across the block of threads is { 0, 1, 2, 3, ..., 127 }. The output in thread 0 is { 127 }.

cuda.coop.block.make_scan(
dtype: str | type | np.number | np.dtype | numba.types.Type,
threads_per_block: dim3 | int | Tuple[int, int] | Tuple[int, int, int],
items_per_thread: int,
initial_value: Any = None,
mode: Literal['exclusive', 'inclusive'] = 'exclusive',
scan_op: Literal['add', 'plus', 'mul', 'multiplies', 'min', 'minimum', 'max', 'maximum', 'bit_and', 'bit_or', 'bit_xor'] | Literal['+', '*', '&', '|', '^'] | Callable[[numba.types.Number, numba.types.Number], numba.types.Number] | Callable[[np.ndarray, np.ndarray], np.ndarray] | Callable[[np.number, np.number], np.number] = '+',
block_prefix_callback_op: Callable = None,
algorithm: Literal['raking', 'raking_memoize', 'warp_scans'] = 'raking',
methods: dict = None,
) Callable#

Creates a block-wide prefix scan primitive based on CUB’s BlockScan APIs.

The returned primitive is callable from a Numba CUDA kernel and supports sum and generic scan operators in inclusive and exclusive modes.

Example

The snippet below creates a scan primitive and invokes the returned block_scan primitive inside a kernel.

block_scan = coop.block.make_scan(
    dtype=numba.int32,
    threads_per_block=128,
    items_per_thread=4,
    mode="exclusive",
    scan_op="+",
)

@cuda.jit(link=block_scan.files)
def kernel(input_arr, output_arr):
    tid = cuda.threadIdx.x
    thread_data = cuda.local.array(4, dtype=numba.int32)
    for i in range(4):
        thread_data[i] = input_arr[tid * 4 + i]
    block_scan(thread_data, thread_data)
    for i in range(4):
        output_arr[tid * 4 + i] = thread_data[i]
Parameters:
  • dtype (DtypeType) – Data type of the input and output values.

  • threads_per_block (DimType) – Number of threads in the block. Can be an integer for 1D blocks or a tuple of two or three integers for 2D and 3D blocks.

  • items_per_thread (int, optional) – Number of items owned by each thread. Must be greater than or equal to 1.

  • initial_value (Any, optional) – Optional initial value for scan variants that support it.

  • mode (Literal["exclusive", "inclusive"], optional) – Scan mode. Must be "exclusive" or "inclusive".

  • scan_op (ScanOpType, optional) – Scan operator. The default is "+".

  • block_prefix_callback_op (Callable, optional) – Optional block prefix callback operator invoked by the first warp.

  • algorithm (Literal["raking", "raking_memoize", "warp_scans"], optional) – Scan algorithm. Must be "raking", "raking_memoize", or "warp_scans".

  • methods (dict, optional) – Optional method dictionary used for user-defined types.

Raises:
  • ValueError – If algorithm is unsupported.

  • ValueError – If items_per_thread < 1.

  • ValueError – If mode is not "exclusive" or "inclusive".

  • ValueError – If scan_op is unsupported.

  • ValueError – If initial_value is provided for sum scans.

  • ValueError – If initial_value is used with inclusive scans and items_per_thread == 1.

  • ValueError – If initial_value is used with exclusive scans and items_per_thread == 1 while block_prefix_callback_op is provided.

  • ValueError – If an initial value is required but cannot be inferred from dtype.

Returns:

Callable primitive object that can be linked to and invoked from a CUDA kernel.

Return type:

Callable

cuda.coop.block.make_store(
dtype,
threads_per_block,
items_per_thread=1,
algorithm='direct',
)#

Creates a block-wide store primitive.

Returns a callable object that can be linked to and invoked from device code. It can be invoked with the following signatures:

  • (dest: numba.types.Array, src: numba.types.Array) -> None: Each thread stores items_per_thread items from src into dest. src must contain at least items_per_thread items.

Different data movement strategies can be selected via the algorithm parameter:

  • algorithm=”direct” (default): Writes blocked data directly.

  • algorithm=”striped”: Writes striped data directly.

  • algorithm=”vectorize”: Writes blocked data directly using CUDA’s built-in vectorized stores as a coalescing optimization.

  • algorithm=”transpose”: Locally transposes blocked data into a striped arrangement before writing to memory.

  • algorithm=”warp_transpose”: Locally transposes blocked data into a warp-striped arrangement before writing to memory.

  • algorithm=”warp_transpose_timesliced”: Locally transposes blocked data into a warp-striped arrangement before writing to memory. To reduce shared memory requirements, only one warp’s worth of shared memory is provisioned and time-sliced across warps.

For more details, read the corresponding CUB C++ documentation: https://nvidia.github.io/cccl/cub/api/classcub_1_1BlockStore.html

Parameters:
  • dtype – Data type being stored

  • threads_per_block – Number of threads in a block. Can be an integer or a tuple of 2 or 3 integers.

  • items_per_thread – The number of items each thread stores

  • algorithm – The data movement algorithm to use

Example

The code snippet below illustrates a striped load and store of 128 integer items by 32 threads, with each thread handling 4 integers.

import numba
import numpy as np
from numba import cuda

from cuda import coop

The following snippet shows how to invoke the returned block_load and block_store primitives:

threads_per_block = 32
items_per_thread = 4
block_load = coop.block.make_load(
    numba.int32, threads_per_block, items_per_thread, "striped"
)
block_store = coop.block.make_store(
    numba.int32, threads_per_block, items_per_thread, "striped"
)

@cuda.jit(link=block_load.files + block_store.files)
def kernel(input, output):
    tmp = cuda.local.array(items_per_thread, numba.int32)
    block_load(input, tmp)
    block_store(output, tmp)

cuda.coop.block.make_sum(
dtype,
threads_per_block,
items_per_thread=1,
algorithm='warp_reductions',
methods=None,
)#

Creates a block-wide reduction primitive for thread 0 using addition (+) as the reduction operator.

Returns a callable object that can be linked to and invoked from device code. It can be invoked with the following signatures:

  • (item: dtype) -> dtype): Each thread contributes a single item to the reduction.

  • (items: numba.types.Array) -> dtype: Each thread contributes an array of items to the reduction. The array must contain at least items_per_thread items; only the first items_per_thread items are included in the reduction.

  • (item: dtype, num_valid: int) -> dtype: The first num_valid threads contribute a single item to the reduction. Items from all other threads are ignored.

Parameters:
  • dtype – Data type being reduced

  • threads_per_block – Number of threads in a block. Can be an integer or a tuple of 2 or 3 integers.

  • items_per_thread – The number of items each thread owns

  • algorithm – Algorithm to use for the reduction (one of “raking”, “raking_commutative_only”, “warp_reductions”)

  • methods – A dict of methods for user-defined types

Warning

The return value is undefined in threads other than thread 0.

Example

The code snippet below illustrates a sum of 128 integer items partitioned across 128 threads.

import numba
import numpy as np
from numba import cuda

from cuda import coop

numba.config.CUDA_LOW_OCCUPANCY_WARNINGS = 0

The following snippet shows how to invoke the returned block_sum primitive:

threads_per_block = 128
block_sum = coop.block.make_sum(numba.int32, threads_per_block)

@cuda.jit(link=block_sum.files)
def kernel(input, output):
    block_output = block_sum(input[cuda.threadIdx.x])

    if cuda.threadIdx.x == 0:
        output[0] = block_output

Suppose the set of inputs across the block of threads is { 1, 1, 1, 1, ..., 1 }. The output in thread 0 is { 128 }.