cuda.coop API Reference#
Warning
cuda.coop is in public beta.
The API is subject to change without notice.
- cuda.coop.warp.make_exclusive_sum(dtype, threads_in_warp=32)#
Creates an exclusive warp-wide prefix sum primitive using addition (+) as the scan operator. The value of 0 is applied as the initial value and is assigned to the output in lane 0.
Example
The code snippet below illustrates an exclusive prefix sum of 32 integer items:
numba.config.CUDA_LOW_OCCUPANCY_WARNINGS = 0
The following snippet shows how to invoke the returned
warp_exclusive_sumprimitive:# Specialize exclusive sum for a warp of threads warp_exclusive_sum = coop.warp.make_exclusive_sum(numba.int32) # Link the exclusive sum to a CUDA kernel @cuda.jit(link=warp_exclusive_sum.files) def kernel(data): # Collectively compute the warp-wide exclusive prefix sum data[cuda.threadIdx.x] = warp_exclusive_sum(data[cuda.threadIdx.x])
Suppose the set of input
thread_dataacross the warp of threads is{ [1, 1, 1, 1], [1, 1, 1, 1], ..., [1, 1, 1, 1] }. The corresponding outputthread_datain those threads will be{ [0, 1, 2, 3], [4, 5, 6, 7], ..., [28, 29, 30, 31] }.- Parameters:
dtype – Data type being scanned
threads_in_warp – The number of threads in a warp
- Returns:
A callable object that can be linked to and invoked from a CUDA kernel
- cuda.coop.warp.make_reduce(dtype, binary_op, threads_in_warp=32, methods=None)#
Creates a warp-wide reduction primitive for lane 0 using the specified binary reduction functor. Each thread contributes one input element.
Warning
The return value is undefined in threads other than thread 0.
Example
The code snippet below illustrates a max reduction of 32 integer items that are partitioned across a warp of threads.
numba.config.CUDA_LOW_OCCUPANCY_WARNINGS = 0
The following snippet shows how to invoke the returned
warp_reduceprimitive:warp_reduce = coop.warp.make_reduce(numba.int32, op) @cuda.jit(link=warp_reduce.files) def kernel(input, output): warp_output = warp_reduce(input[cuda.threadIdx.x]) if cuda.threadIdx.x == 0: output[0] = warp_output
Suppose the set of inputs across the warp of threads is
{ 0, 1, 2, 3, ..., 31 }. The corresponding output in lane 0 will be{ 31 }.- Parameters:
dtype – Data type being reduced
threads_in_warp – The number of threads in a warp
binary_op – Binary reduction function
- Returns:
A callable object that can be linked to and invoked from a CUDA kernel
- cuda.coop.warp.make_sum(dtype, threads_in_warp=32)#
Creates a warp-wide reduction primitive for lane 0 using addition (+) as the reduction operator. Each thread contributes one input element.
Warning
The return value is undefined in threads other than thread 0.
Example
The code snippet below illustrates a reduction of 32 integer items that are partitioned across a warp of threads.
numba.config.CUDA_LOW_OCCUPANCY_WARNINGS = 0
The following snippet shows how to invoke the returned
warp_sumprimitive:warp_sum = coop.warp.make_sum(numba.int32) @cuda.jit(link=warp_sum.files) def kernel(input, output): warp_output = warp_sum(input[cuda.threadIdx.x]) if cuda.threadIdx.x == 0: output[0] = warp_output
Suppose the set of inputs across the warp of threads is
{ 1, 1, 1, 1, ..., 1 }. The corresponding output in lane 0 will be{ 32 }.- Parameters:
dtype – Data type being reduced
threads_in_warp – The number of threads in a warp
- Returns:
A callable object that can be linked to and invoked from a CUDA kernel
- cuda.coop.warp.make_merge_sort_keys(
- dtype,
- items_per_thread,
- compare_op,
- threads_in_warp=32,
- methods=None,
Creates a warp-wide merge sort primitive over a blocked arrangement of keys.
Example
The code snippet below illustrates a sort of 128 integer keys that are partitioned in a blocked arrangement across a warp of 32 threads where each thread owns 4 consecutive keys.
The following snippet shows how to invoke the returned
warp_merge_sortprimitive:# Define comparison operator def compare_op(a, b): return a > b # Specialize merge sort for a warp of threads owning 4 integer items each items_per_thread = 4 warp_merge_sort = coop.warp.make_merge_sort_keys( numba.int32, items_per_thread, compare_op ) # Link the merge sort to a CUDA kernel @cuda.jit(link=warp_merge_sort.files) def kernel(keys): # Obtain a segment of consecutive items that are blocked across threads thread_keys = cuda.local.array(shape=items_per_thread, dtype=numba.int32) for i in range(items_per_thread): thread_keys[i] = keys[cuda.threadIdx.x * items_per_thread + i] # Collectively sort the keys warp_merge_sort(thread_keys) # Copy the sorted keys back to the output for i in range(items_per_thread): keys[cuda.threadIdx.x * items_per_thread + i] = thread_keys[i]
Suppose the set of input
thread_keysacross the warp of threads is{ [0, 1, 2, 3], [4, 5, 6, 7], ..., [124, 125, 126, 127] }. The corresponding outputthread_keysin those threads will be{ [127, 126, 125, 124], [123, 122, 121, 120], ..., [3, 2, 1, 0] }.- Parameters:
dtype – Numba data type of the keys to be sorted
threads_in_warp – The number of threads in a warp
items_per_thread – The number of items each thread owns
compare_op – Comparison function object. Returns true if the first argument is ordered before the second.
- Returns:
A callable object that can be linked to and invoked from a CUDA kernel
- class cuda.coop.block.BlockExchangeType(*values)#
Enum representing the type of block exchange operation. Currently only
StripedToBlockedis supported.- StripedToBlocked = 1#
- cuda.coop.block.make_exchange(
- block_exchange_type: BlockExchangeType,
- dtype: str | type | np.number | np.dtype | numba.types.Type,
- threads_per_block: dim3 | int | Tuple[int, int] | Tuple[int, int, int],
- items_per_thread: int,
- warp_time_slicing: bool = False,
- methods: dict = None,
Creates a block-wide exchange primitive for rearranging data partitioned across CUDA thread blocks.
Example
The snippet below shows how to create and invoke the returned
block_exchangeprimitive for striped-to-blocked exchange.block_exchange = coop.block.make_exchange( coop.block.BlockExchangeType.StripedToBlocked, dtype=numba.int32, threads_per_block=128, items_per_thread=4, ) temp_storage_bytes = block_exchange.temp_storage_bytes @cuda.jit(link=block_exchange.files) def kernel(thread_data): temp_storage = cuda.shared.array( shape=temp_storage_bytes, dtype=numba.uint8, ) block_exchange(temp_storage, thread_data)
- Parameters:
block_exchange_type – Exchange mode to perform. Currently, only
StripedToBlockedis supported.dtype (
cuda.coop._typing.DtypeType) – Data type of input and output values.threads_per_block (
cuda.coop._typing.DimType) – Number of threads in the block.items_per_thread (int) – Number of items owned by each thread.
warp_time_slicing (bool, optional) – Whether to use warp time-slicing. If true, shared memory usage is reduced at the expense of parallelism.
methods (dict, optional) – Optional method dictionary for user-defined types.
- Raises:
ValueError – If
block_exchange_typeis not a valid enum value ofBlockExchangeType.ValueError – If
items_per_threadis less than 1.
- Returns:
An
cuda.coop._types.Invocableobject representing the specialized kernel callable from a Numba JIT’d CUDA kernel.
- cuda.coop.block.make_exclusive_scan(
- dtype: str | type | np.number | np.dtype | numba.types.Type,
- threads_per_block: dim3 | int | Tuple[int, int] | Tuple[int, int, int],
- scan_op: Literal['add', 'plus', 'mul', 'multiplies', 'min', 'minimum', 'max', 'maximum', 'bit_and', 'bit_or', 'bit_xor'] | Literal['+', '*', '&', '|', '^'] | Callable[[numba.types.Number, numba.types.Number], numba.types.Number] | Callable[[np.ndarray, np.ndarray], np.ndarray] | Callable[[np.number, np.number], np.number],
- items_per_thread: int,
- initial_value: Any = None,
- prefix_op: Callable = None,
- algorithm: Literal['raking', 'raking_memoize', 'warp_scans'] = 'raking',
- methods: dict = None,
Creates an exclusive block-wide prefix scan primitive with the specified scan operator.
Example
The snippet below shows how to create and invoke the returned
block_exclusive_scanprimitive.block_exclusive_scan = coop.block.make_exclusive_scan( dtype=numba.int32, threads_per_block=128, scan_op="max", items_per_thread=4, ) @cuda.jit(link=block_exclusive_scan.files) def kernel(thread_data): block_exclusive_scan(thread_data, thread_data)
- Parameters:
dtype (DtypeType) – Data type of the input and output values.
threads_per_block (DimType) – Number of threads in the block.
scan_op (ScanOpType) – Scan operator.
items_per_thread (int, optional) – Number of items owned by each thread.
initial_value (Any, optional) – Optional initial value when supported.
prefix_op (Callable, optional) – Optional block prefix callback operator.
algorithm (Literal["raking", "raking_memoize", "warp_scans"], optional) – Scan algorithm.
methods (dict, optional) – Optional method dictionary for user-defined types.
- Returns:
Callable primitive object for exclusive prefix scan.
- Return type:
Callable
- cuda.coop.block.make_exclusive_sum(
- dtype: str | type | np.number | np.dtype | numba.types.Type,
- threads_per_block: dim3 | int | Tuple[int, int] | Tuple[int, int, int],
- items_per_thread: int,
- prefix_op: Callable = None,
- algorithm: Literal['raking', 'raking_memoize', 'warp_scans'] = 'raking',
- methods: dict = None,
Creates an exclusive block-wide prefix sum primitive using addition (+) as the scan operator.
Example
The code snippet below illustrates an exclusive prefix sum of 512 integer items in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.
numba.config.CUDA_LOW_OCCUPANCY_WARNINGS = 0
The following snippet shows how to invoke the returned
block_exclusive_sumprimitive:items_per_thread = 4 threads_per_block = 128 # Specialize exclusive sum for a 1D block of 128 threads owning 4 integer items each block_exclusive_sum = coop.block.make_exclusive_sum( numba.int32, threads_per_block, items_per_thread ) # Link the exclusive sum to a CUDA kernel @cuda.jit(link=block_exclusive_sum.files) def kernel(data): # Obtain a segment of consecutive items that are blocked across threads thread_data = cuda.local.array(shape=items_per_thread, dtype=numba.int32) for i in range(items_per_thread): thread_data[i] = data[cuda.threadIdx.x * items_per_thread + i] # Collectively compute the block-wide exclusive prefix sum block_exclusive_sum(thread_data, thread_data) # Copy the scanned keys back to the output for i in range(items_per_thread): data[cuda.threadIdx.x * items_per_thread + i] = thread_data[i]
Suppose the set of input
thread_dataacross the block of threads is{ [1, 1, 1, 1], [1, 1, 1, 1], ..., [1, 1, 1, 1] }.The corresponding output
thread_datain those threads will be{ [0, 1, 2, 3], [4, 5, 6, 7], ..., [508, 509, 510, 511] }.- Parameters:
dtype (DtypeType) – Data type of the input and output values.
threads_per_block (DimType) – Number of threads in the block.
items_per_thread (int, optional) – Number of items owned by each thread.
prefix_op (Callable, optional) – Optional block prefix callback operator.
algorithm (Literal["raking", "raking_memoize", "warp_scans"], optional) – Scan algorithm.
methods (dict, optional) – Optional method dictionary for user-defined types.
- Returns:
Callable primitive object for exclusive prefix sum.
- Return type:
Callable
- cuda.coop.block.make_inclusive_scan(
- dtype: str | type | np.number | np.dtype | numba.types.Type,
- threads_per_block: dim3 | int | Tuple[int, int] | Tuple[int, int, int],
- scan_op: Literal['add', 'plus', 'mul', 'multiplies', 'min', 'minimum', 'max', 'maximum', 'bit_and', 'bit_or', 'bit_xor'] | Literal['+', '*', '&', '|', '^'] | Callable[[numba.types.Number, numba.types.Number], numba.types.Number] | Callable[[np.ndarray, np.ndarray], np.ndarray] | Callable[[np.number, np.number], np.number],
- items_per_thread: int,
- initial_value: Any = None,
- prefix_op: Callable = None,
- algorithm: Literal['raking', 'raking_memoize', 'warp_scans'] = 'raking',
- methods: dict = None,
Creates an inclusive block-wide prefix scan primitive with the specified scan operator.
Example
The snippet below shows how to create and invoke the returned
block_inclusive_scanprimitive.block_inclusive_scan = coop.block.make_inclusive_scan( dtype=numba.int32, threads_per_block=128, scan_op="min", items_per_thread=4, ) @cuda.jit(link=block_inclusive_scan.files) def kernel(thread_data): block_inclusive_scan(thread_data, thread_data)
- Parameters:
dtype (DtypeType) – Data type of the input and output values.
threads_per_block (DimType) – Number of threads in the block.
scan_op (ScanOpType) – Scan operator.
items_per_thread (int, optional) – Number of items owned by each thread.
initial_value (Any, optional) – Optional initial value when supported.
prefix_op (Callable, optional) – Optional block prefix callback operator.
algorithm (Literal["raking", "raking_memoize", "warp_scans"], optional) – Scan algorithm.
methods (dict, optional) – Optional method dictionary for user-defined types.
- Returns:
Callable primitive object for inclusive prefix scan.
- Return type:
Callable
- cuda.coop.block.make_inclusive_sum(
- dtype: str | type | np.number | np.dtype | numba.types.Type,
- threads_per_block: dim3 | int | Tuple[int, int] | Tuple[int, int, int],
- items_per_thread: int,
- prefix_op: Callable = None,
- algorithm: Literal['raking', 'raking_memoize', 'warp_scans'] = 'raking',
- methods: dict = None,
Creates an inclusive block-wide prefix sum primitive using addition (+) as the scan operator.
Example
The snippet below shows how to create and invoke the returned
block_inclusive_sumprimitive.block_inclusive_sum = coop.block.make_inclusive_sum( dtype=numba.int32, threads_per_block=128, items_per_thread=4, ) @cuda.jit(link=block_inclusive_sum.files) def kernel(thread_data): block_inclusive_sum(thread_data, thread_data)
- Parameters:
dtype (DtypeType) – Data type of the input and output values.
threads_per_block (DimType) – Number of threads in the block.
items_per_thread (int, optional) – Number of items owned by each thread.
prefix_op (Callable, optional) – Optional block prefix callback operator.
algorithm (Literal["raking", "raking_memoize", "warp_scans"], optional) – Scan algorithm.
methods (dict, optional) – Optional method dictionary for user-defined types.
- Returns:
Callable primitive object for inclusive prefix sum.
- Return type:
Callable
- cuda.coop.block.make_load(
- dtype,
- threads_per_block,
- items_per_thread=1,
- algorithm='direct',
Creates a block-wide load primitive.
Returns a callable object that can be linked to and invoked from device code. It can be invoked with the following signatures:
(src: numba.types.Array, dest: numba.types.Array) -> None: Each thread loads items_per_thread items from src into dest. dest must contain at least items_per_thread items.
Different data movement strategies can be selected via the algorithm parameter:
algorithm=”direct” (default): Reads blocked data directly.
algorithm=”striped”: Reads striped data directly.
algorithm=”vectorize”: Reads blocked data directly using CUDA’s built-in vectorized loads as a coalescing optimization.
algorithm=”transpose”: Reads striped data and then locally transposes it into a blocked arrangement.
algorithm=”warp_transpose”: Reads warp-striped data and then locally transposes it into a blocked arrangement.
algorithm=”warp_transpose_timesliced”: Reads warp-striped data and then locally transposes it into a blocked arrangement one warp at a time.
For more details, read the corresponding CUB C++ documentation: https://nvidia.github.io/cccl/cub/api/classcub_1_1BlockLoad.html
- Parameters:
dtype – Data type being loaded
threads_per_block – Number of threads in a block. Can be an integer or a tuple of 2 or 3 integers.
items_per_thread – The number of items each thread loads
algorithm – The data movement algorithm to use
Example
The code snippet below illustrates a striped load and store of 128 integer items by 32 threads, with each thread handling 4 integers.
import numba import numpy as np from numba import cuda from cuda import coop
The following snippet shows how to invoke the returned
block_loadandblock_storeprimitives:threads_per_block = 32 items_per_thread = 4 block_load = coop.block.make_load( numba.int32, threads_per_block, items_per_thread, "striped" ) block_store = coop.block.make_store( numba.int32, threads_per_block, items_per_thread, "striped" ) @cuda.jit(link=block_load.files + block_store.files) def kernel(input, output): tmp = cuda.local.array(items_per_thread, numba.int32) block_load(input, tmp) block_store(output, tmp)
- cuda.coop.block.make_merge_sort_keys(
- dtype: str | type | np.dtype | numba.types.Type,
- threads_per_block: int,
- items_per_thread: int,
- compare_op: Callable,
- methods: Literal['construct', 'assign'] = None,
Creates a block-wide merge sort primitive over a blocked arrangement of keys.
Example
The code snippet below illustrates a sort of 512 integer keys partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive keys.
numba.config.CUDA_LOW_OCCUPANCY_WARNINGS = 0
The following snippet shows how to invoke the returned
block_merge_sortprimitive:# Define comparison operator def compare_op(a, b): return a > b # Specialize merge sort for a 1D block of 128 threads owning 4 integer items each items_per_thread = 4 threads_per_block = 128 block_merge_sort = coop.block.make_merge_sort_keys( numba.int32, threads_per_block, items_per_thread, compare_op ) # Link the merge sort to a CUDA kernel @cuda.jit(link=block_merge_sort.files) def kernel(keys): # Obtain a segment of consecutive items that are blocked across threads thread_keys = cuda.local.array(shape=items_per_thread, dtype=numba.int32) for i in range(items_per_thread): thread_keys[i] = keys[cuda.threadIdx.x * items_per_thread + i] # Collectively sort the keys block_merge_sort(thread_keys) # Copy the sorted keys back to the output for i in range(items_per_thread): keys[cuda.threadIdx.x * items_per_thread + i] = thread_keys[i]
Suppose the set of input
thread_keysacross the block of threads is{ [0, 1, 2, 3], [4, 5, 6, 7], ..., [508, 509, 510, 511] }. The corresponding outputthread_keysin those threads will be{ [511, 510, 509, 508], [507, 506, 505, 504], ..., [3, 2, 1, 0] }.- Parameters:
dtype – Numba data type of the keys to be sorted
threads_per_block – The number of threads in a block, either an integer or a tuple of 2 or 3 integers
items_per_thread – The number of items each thread owns
compare_op – Comparison function object. Returns true if the first argument is ordered before the second.
- Returns:
A callable object that can be linked to and invoked from a CUDA kernel
- cuda.coop.block.make_radix_sort_keys(dtype, threads_per_block, items_per_thread)#
Creates an ascending block-wide radix sort primitive over a blocked arrangement of keys.
Example
The code snippet below illustrates a sort of 512 integer keys partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive keys.
numba.config.CUDA_LOW_OCCUPANCY_WARNINGS = 0
The following snippet shows how to invoke the returned
block_radix_sortprimitive:# Specialize radix sort for a 1D block of 128 threads owning 4 integer items each items_per_thread = 4 threads_per_block = 128 block_radix_sort = coop.block.make_radix_sort_keys( numba.int32, threads_per_block, items_per_thread ) # Link the radix sort to a CUDA kernel @cuda.jit(link=block_radix_sort.files) def kernel(keys): # Obtain a segment of consecutive items that are blocked across threads thread_keys = cuda.local.array(shape=items_per_thread, dtype=numba.int32) for i in range(items_per_thread): thread_keys[i] = keys[cuda.threadIdx.x * items_per_thread + i] # Collectively sort the keys block_radix_sort(thread_keys) # Copy the sorted keys back to the output for i in range(items_per_thread): keys[cuda.threadIdx.x * items_per_thread + i] = thread_keys[i]
Suppose the set of input
thread_keysacross the block of threads is{ [511, 510, 509, 508], [507, 506, 505, 504], ..., [3, 2, 1, 0] }. The corresponding outputthread_keysin those threads will be{ [0, 1, 2, 3], [4, 5, 6, 7], ..., [508, 509, 510, 511] }.- Parameters:
dtype – Data type of the keys to be sorted
threads_per_block – The number of threads in a block, either an integer or a tuple of 2 or 3 integers
items_per_thread – The number of items each thread owns
- Returns:
A callable object that can be linked to and invoked from a CUDA kernel
- cuda.coop.block.make_radix_sort_keys_descending(
- dtype,
- threads_per_block,
- items_per_thread,
Creates a descending block-wide radix sort primitive over a blocked arrangement of keys.
Example
The code snippet below illustrates a sort of 512 integer keys partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive keys.
numba.config.CUDA_LOW_OCCUPANCY_WARNINGS = 0
The following snippet shows how to invoke the returned
block_radix_sortprimitive:# Specialize radix sort for a 1D block of 128 threads owning 4 integer items each items_per_thread = 4 threads_per_block = 128 block_radix_sort = coop.block.make_radix_sort_keys_descending( numba.int32, threads_per_block, items_per_thread ) # Link the radix sort to a CUDA kernel @cuda.jit(link=block_radix_sort.files) def kernel(keys): # Obtain a segment of consecutive items that are blocked across threads thread_keys = cuda.local.array(shape=items_per_thread, dtype=numba.int32) for i in range(items_per_thread): thread_keys[i] = keys[cuda.threadIdx.x * items_per_thread + i] # Collectively sort the keys block_radix_sort(thread_keys) # Copy the sorted keys back to the output for i in range(items_per_thread): keys[cuda.threadIdx.x * items_per_thread + i] = thread_keys[i]
Suppose the set of input
thread_keysacross the block of threads is{ [0, 1, 2, 3], [4, 5, 6, 7], ..., [508, 509, 510, 511] }. The corresponding outputthread_keysin those threads will be{ [511, 510, 509, 508], [507, 506, 505, 504], ..., [3, 2, 1, 0] }.- Parameters:
dtype – Data type of the keys to be sorted
threads_per_block – The number of threads in a block, either an integer or a tuple of 2 or 3 integers
items_per_thread – The number of items each thread owns
- Returns:
A callable object that can be linked to and invoked from a CUDA kernel
- cuda.coop.block.make_reduce(
- dtype,
- threads_per_block,
- binary_op,
- items_per_thread=1,
- algorithm='warp_reductions',
- methods=None,
Creates a block-wide reduction primitive for thread 0 using the specified binary reduction functor.
Returns a callable object that can be linked to and invoked from device code. It can be invoked with the following signatures:
(item: dtype) -> dtype): Each thread contributes a single item to the reduction.
(items: numba.types.Array) -> dtype: Each thread contributes an array of items to the reduction. The array must contain at least items_per_thread items; only the first items_per_thread items are included in the reduction.
(item: dtype, num_valid: int) -> dtype: The first num_valid threads contribute a single item to the reduction. Items from all other threads are ignored.
- Parameters:
dtype – Data type being reduced
threads_per_block – Number of threads in a block. Can be an integer or a tuple of 2 or 3 integers.
binary_op – Binary reduction function
items_per_thread – The number of items each thread contributes to the reduction
algorithm – Algorithm to use for the reduction (one of “raking”, “raking_commutative_only”, “warp_reductions”)
methods – A dict of methods for user-defined types
Warning
The return value is undefined in threads other than thread 0.
Example
The code snippet below illustrates a max reduction of 128 integer items that are partitioned across 128 threads.
import numba import numpy as np from numba import cuda from cuda import coop numba.config.CUDA_LOW_OCCUPANCY_WARNINGS = 0
The following snippet shows how to invoke the returned
block_reduceprimitive:def op(a, b): return a if a > b else b threads_per_block = 128 block_reduce = coop.block.make_reduce(numba.int32, threads_per_block, op) @cuda.jit(link=block_reduce.files) def kernel(input, output): block_output = block_reduce(input[cuda.threadIdx.x]) if cuda.threadIdx.x == 0: output[0] = block_output
Suppose the set of inputs across the block of threads is
{ 0, 1, 2, 3, ..., 127 }. The output in thread 0 is{ 127 }.
- cuda.coop.block.make_scan(
- dtype: str | type | np.number | np.dtype | numba.types.Type,
- threads_per_block: dim3 | int | Tuple[int, int] | Tuple[int, int, int],
- items_per_thread: int,
- initial_value: Any = None,
- mode: Literal['exclusive', 'inclusive'] = 'exclusive',
- scan_op: Literal['add', 'plus', 'mul', 'multiplies', 'min', 'minimum', 'max', 'maximum', 'bit_and', 'bit_or', 'bit_xor'] | Literal['+', '*', '&', '|', '^'] | Callable[[numba.types.Number, numba.types.Number], numba.types.Number] | Callable[[np.ndarray, np.ndarray], np.ndarray] | Callable[[np.number, np.number], np.number] = '+',
- block_prefix_callback_op: Callable = None,
- algorithm: Literal['raking', 'raking_memoize', 'warp_scans'] = 'raking',
- methods: dict = None,
Creates a block-wide prefix scan primitive based on CUB’s BlockScan APIs.
The returned primitive is callable from a Numba CUDA kernel and supports sum and generic scan operators in inclusive and exclusive modes.
Example
The snippet below creates a scan primitive and invokes the returned
block_scanprimitive inside a kernel.block_scan = coop.block.make_scan( dtype=numba.int32, threads_per_block=128, items_per_thread=4, mode="exclusive", scan_op="+", ) @cuda.jit(link=block_scan.files) def kernel(input_arr, output_arr): tid = cuda.threadIdx.x thread_data = cuda.local.array(4, dtype=numba.int32) for i in range(4): thread_data[i] = input_arr[tid * 4 + i] block_scan(thread_data, thread_data) for i in range(4): output_arr[tid * 4 + i] = thread_data[i]
- Parameters:
dtype (DtypeType) – Data type of the input and output values.
threads_per_block (DimType) – Number of threads in the block. Can be an integer for 1D blocks or a tuple of two or three integers for 2D and 3D blocks.
items_per_thread (int, optional) – Number of items owned by each thread. Must be greater than or equal to 1.
initial_value (Any, optional) – Optional initial value for scan variants that support it.
mode (Literal["exclusive", "inclusive"], optional) – Scan mode. Must be
"exclusive"or"inclusive".scan_op (ScanOpType, optional) – Scan operator. The default is
"+".block_prefix_callback_op (Callable, optional) – Optional block prefix callback operator invoked by the first warp.
algorithm (Literal["raking", "raking_memoize", "warp_scans"], optional) – Scan algorithm. Must be
"raking","raking_memoize", or"warp_scans".methods (dict, optional) – Optional method dictionary used for user-defined types.
- Raises:
ValueError – If
algorithmis unsupported.ValueError – If
items_per_thread < 1.ValueError – If
modeis not"exclusive"or"inclusive".ValueError – If
scan_opis unsupported.ValueError – If
initial_valueis provided for sum scans.ValueError – If
initial_valueis used with inclusive scans anditems_per_thread == 1.ValueError – If
initial_valueis used with exclusive scans anditems_per_thread == 1whileblock_prefix_callback_opis provided.ValueError – If an initial value is required but cannot be inferred from
dtype.
- Returns:
Callable primitive object that can be linked to and invoked from a CUDA kernel.
- Return type:
Callable
- cuda.coop.block.make_store(
- dtype,
- threads_per_block,
- items_per_thread=1,
- algorithm='direct',
Creates a block-wide store primitive.
Returns a callable object that can be linked to and invoked from device code. It can be invoked with the following signatures:
(dest: numba.types.Array, src: numba.types.Array) -> None: Each thread stores items_per_thread items from src into dest. src must contain at least items_per_thread items.
Different data movement strategies can be selected via the algorithm parameter:
algorithm=”direct” (default): Writes blocked data directly.
algorithm=”striped”: Writes striped data directly.
algorithm=”vectorize”: Writes blocked data directly using CUDA’s built-in vectorized stores as a coalescing optimization.
algorithm=”transpose”: Locally transposes blocked data into a striped arrangement before writing to memory.
algorithm=”warp_transpose”: Locally transposes blocked data into a warp-striped arrangement before writing to memory.
algorithm=”warp_transpose_timesliced”: Locally transposes blocked data into a warp-striped arrangement before writing to memory. To reduce shared memory requirements, only one warp’s worth of shared memory is provisioned and time-sliced across warps.
For more details, read the corresponding CUB C++ documentation: https://nvidia.github.io/cccl/cub/api/classcub_1_1BlockStore.html
- Parameters:
dtype – Data type being stored
threads_per_block – Number of threads in a block. Can be an integer or a tuple of 2 or 3 integers.
items_per_thread – The number of items each thread stores
algorithm – The data movement algorithm to use
Example
The code snippet below illustrates a striped load and store of 128 integer items by 32 threads, with each thread handling 4 integers.
import numba import numpy as np from numba import cuda from cuda import coop
The following snippet shows how to invoke the returned
block_loadandblock_storeprimitives:threads_per_block = 32 items_per_thread = 4 block_load = coop.block.make_load( numba.int32, threads_per_block, items_per_thread, "striped" ) block_store = coop.block.make_store( numba.int32, threads_per_block, items_per_thread, "striped" ) @cuda.jit(link=block_load.files + block_store.files) def kernel(input, output): tmp = cuda.local.array(items_per_thread, numba.int32) block_load(input, tmp) block_store(output, tmp)
- cuda.coop.block.make_sum(
- dtype,
- threads_per_block,
- items_per_thread=1,
- algorithm='warp_reductions',
- methods=None,
Creates a block-wide reduction primitive for thread 0 using addition (+) as the reduction operator.
Returns a callable object that can be linked to and invoked from device code. It can be invoked with the following signatures:
(item: dtype) -> dtype): Each thread contributes a single item to the reduction.
(items: numba.types.Array) -> dtype: Each thread contributes an array of items to the reduction. The array must contain at least items_per_thread items; only the first items_per_thread items are included in the reduction.
(item: dtype, num_valid: int) -> dtype: The first num_valid threads contribute a single item to the reduction. Items from all other threads are ignored.
- Parameters:
dtype – Data type being reduced
threads_per_block – Number of threads in a block. Can be an integer or a tuple of 2 or 3 integers.
items_per_thread – The number of items each thread owns
algorithm – Algorithm to use for the reduction (one of “raking”, “raking_commutative_only”, “warp_reductions”)
methods – A dict of methods for user-defined types
Warning
The return value is undefined in threads other than thread 0.
Example
The code snippet below illustrates a sum of 128 integer items partitioned across 128 threads.
import numba import numpy as np from numba import cuda from cuda import coop numba.config.CUDA_LOW_OCCUPANCY_WARNINGS = 0
The following snippet shows how to invoke the returned
block_sumprimitive:threads_per_block = 128 block_sum = coop.block.make_sum(numba.int32, threads_per_block) @cuda.jit(link=block_sum.files) def kernel(input, output): block_output = block_sum(input[cuda.threadIdx.x]) if cuda.threadIdx.x == 0: output[0] = block_output
Suppose the set of inputs across the block of threads is
{ 1, 1, 1, 1, ..., 1 }. The output in thread 0 is{ 128 }.