CUDA Cooperative
Warning
Python exposure of cooperative algorithms is in public beta. The API is subject to change without notice.
- cuda.cooperative.experimental.warp.exclusive_sum(dtype, threads_in_warp=32)
Computes an exclusive warp-wide prefix sum using addition (+) as the scan operator. The value of 0 is applied as the initial value, and is assigned to the output in lane0.
Example
The code snippet below illustrates an exclusive prefix sum of 32 integer items:
patch.patch_numba_linker(lto=True)
Below is the code snippet that demonstrates the usage of the
exclusive_sum
API:# Specialize exclusive sum for a warp of threads warp_exclusive_sum = cudax.warp.exclusive_sum(numba.int32) # Link the exclusive sum to a CUDA kernel @cuda.jit(link=warp_exclusive_sum.files) def kernel(data): # Collectively compute the warp-wide exclusive prefix sum data[cuda.threadIdx.x] = warp_exclusive_sum(data[cuda.threadIdx.x])
Suppose the set of input
thread_data
across the warp of threads is{ [1, 1, 1, 1], [1, 1, 1, 1], ..., [1, 1, 1, 1] }
. The corresponding outputthread_data
in those threads will be{ [0, 1, 2, 3], [4, 5, 6, 7], ..., [28, 29, 30, 31] }
.- Parameters
dtype – Data type being scanned
threads_in_warp – The number of threads in a warp
- Returns
A callable object that can be linked to and invoked from a CUDA kernel
- cuda.cooperative.experimental.warp.merge_sort_keys(dtype, items_per_thread, compare_op, threads_in_warp=32, methods=None)
Performs a warp-wide merge sort over a blocked arrangement of keys.
Example
The code snippet below illustrates a sort of 128 integer keys that are partitioned in a blocked arrangement across a warp of 32 threads where each thread owns 4 consecutive keys. We start by importing necessary modules:
patch.patch_numba_linker(lto=True)
Below is the code snippet that demonstrates the usage of the
merge_sort_keys
API:# Define comparison operator def compare_op(a, b): return a > b # Specialize merge sort for a warp of threads owning 4 integer items each items_per_thread = 4 warp_merge_sort = cudax.warp.merge_sort_keys( numba.int32, items_per_thread, compare_op ) # Link the merge sort to a CUDA kernel @cuda.jit(link=warp_merge_sort.files) def kernel(keys): # Obtain a segment of consecutive items that are blocked across threads thread_keys = cuda.local.array(shape=items_per_thread, dtype=numba.int32) for i in range(items_per_thread): thread_keys[i] = keys[cuda.threadIdx.x * items_per_thread + i] # Collectively sort the keys warp_merge_sort(thread_keys) # Copy the sorted keys back to the output for i in range(items_per_thread): keys[cuda.threadIdx.x * items_per_thread + i] = thread_keys[i]
Suppose the set of input
thread_keys
across the warp of threads is{ [0, 1, 2, 3], [4, 5, 6, 7], ..., [124, 125, 126, 127] }
. The corresponding outputthread_keys
in those threads will be{ [127, 126, 125, 124], [123, 122, 121, 120], ..., [3, 2, 1, 0] }
.- Parameters
dtype – Numba data type of the keys to be sorted
threads_in_warp – The number of threads in a warp
items_per_thread – The number of items each thread owns
compare_op – Comparison function object which returns true if the first argument is ordered before the second one
- Returns
A callable object that can be linked to and invoked from a CUDA kernel
- cuda.cooperative.experimental.warp.reduce(dtype, binary_op, threads_in_warp=32, methods=None)
Computes a warp-wide reduction for lane0 using the specified binary reduction functor. Each thread contributes one input element.
Warning
The return value is undefined in threads other than thread0.
Example
The code snippet below illustrates a max reduction of 32 integer items that are partitioned across a warp of threads.
patch.patch_numba_linker(lto=True)
Below is the code snippet that demonstrates the usage of the
reduce
API:warp_reduce = cudax.warp.reduce(numba.int32, op) @cuda.jit(link=warp_reduce.files) def kernel(input, output): warp_output = warp_reduce(input[cuda.threadIdx.x]) if cuda.threadIdx.x == 0: output[0] = warp_output
Suppose the set of inputs across the warp of threads is
{ 0, 1, 2, 3, ..., 31 }
. The corresponding output in the threads lane0 will be{ 31 }
.- Parameters
dtype – Data type being reduced
threads_in_warp – The number of threads in a warp
binary_op – Binary reduction function
- Returns
A callable object that can be linked to and invoked from a CUDA kernel
- cuda.cooperative.experimental.warp.sum(dtype, threads_in_warp=32)
Computes a warp-wide reduction for lane0 using addition (+) as the reduction operator. Each thread contributes one input element.
Warning
The return value is undefined in threads other than thread0.
Example
The code snippet below illustrates a reduction of 32 integer items that are partitioned across a warp of threads.
patch.patch_numba_linker(lto=True)
Below is the code snippet that demonstrates the usage of the
reduce
API:warp_sum = cudax.warp.sum(numba.int32) @cuda.jit(link=warp_sum.files) def kernel(input, output): warp_output = warp_sum(input[cuda.threadIdx.x]) if cuda.threadIdx.x == 0: output[0] = warp_output
Suppose the set of inputs across the warp of threads is
{ 1, 1, 1, 1, ..., 1 }
. The corresponding output in the threads lane0 will be{ 32 }
.- Parameters
dtype – Data type being reduced
threads_in_warp – The number of threads in a warp
- Returns
A callable object that can be linked to and invoked from a CUDA kernel
- cuda.cooperative.experimental.block.exclusive_sum(dtype: Type[numba.types.Number], threads_per_block: int, items_per_thread: int = 1, prefix_op: Optional[Callable] = None, algorithm: Literal['raking', 'raking_memoize', 'warp_scans'] = 'raking') Callable
Computes an exclusive block-wide prefix sum.
- cuda.cooperative.experimental.block.inclusive_sum(dtype: Type[numba.types.Number], threads_per_block: int, items_per_thread: int = 1, prefix_op: Optional[Callable] = None, algorithm: Literal['raking', 'raking_memoize', 'warp_scans'] = 'raking') Callable
Computes an inclusive block-wide prefix sum.
- cuda.cooperative.experimental.block.load(dtype, threads_per_block, items_per_thread=1, algorithm='direct')
Creates an operation that performs a block-wide load.
Returns a callable object that can be linked to and invoked from device code. It can be invoked with the following signatures:
- (src: numba.types.Array, dest: numba.types.Array) -> None: Each thread loads
items_per_thread items from src into dest. dest must contain at least items_per_thread items.
Different data movement strategies can be selected via the algorithm parameter:
algorithm=”direct” (default): A blocked arrangement of data is read directly from memory.
algorithm=”striped”: A striped arrangement of data is read directly from memory.
algorithm=”vectorize”: A blocked arrangement of data is read directly from memory using CUDA’s built-in vectorized loads as a coalescing optimization.
algorithm=”transpose”: A striped arrangement of data is read directly from memory and is then locally transposed into a blocked arrangement.
algorithm=”warp_transpose”: A warp-striped arrangement of data is read directly from memory and is then locally transposed into a blocked arrangement.
algorithm=”warp_transpose_timesliced”: A warp-striped arrangement of data is read directly from memory and is then locally transposed into a blocked arrangement one warp at a time.
For more details, [read the corresponding CUB C++ documentation](https://nvidia.github.io/cccl/cub/api/classcub_1_1BlockLoad.html).
- Parameters
dtype – Data type being loaded
threads_per_block – The number of threads in a block, either an integer or a tuple of 2 or 3 integers
items_per_thread – The number of items each thread loads
algorithm – The data movement algorithm to use
Example
The code snippet below illustrates a striped load and store of 128 integer items by 32 threads, with each thread handling 4 integers.
import numba import numpy as np from numba import cuda from pynvjitlink import patch import cuda.cooperative.experimental as cudax patch.patch_numba_linker(lto=True)
threads_per_block = 32 items_per_thread = 4 block_load = cudax.block.load( numba.int32, threads_per_block, items_per_thread, "striped" ) block_store = cudax.block.store( numba.int32, threads_per_block, items_per_thread, "striped" ) @cuda.jit(link=block_load.files + block_store.files) def kernel(input, output): tmp = cuda.local.array(items_per_thread, numba.int32) block_load(input, tmp) block_store(output, tmp)
- cuda.cooperative.experimental.block.merge_sort_keys(dtype, threads_per_block, items_per_thread, compare_op, methods=None)
Performs a block-wide merge sort over a blocked arrangement of keys.
Example
The code snippet below illustrates a sort of 512 integer keys that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive keys. We start by importing necessary modules:
patch.patch_numba_linker(lto=True)
Below is the code snippet that demonstrates the usage of the
merge_sort_keys
API:# Define comparison operator def compare_op(a, b): return a > b # Specialize merge sort for a 1D block of 128 threads owning 4 integer items each items_per_thread = 4 threads_per_block = 128 block_merge_sort = cudax.block.merge_sort_keys( numba.int32, threads_per_block, items_per_thread, compare_op ) # Link the merge sort to a CUDA kernel @cuda.jit(link=block_merge_sort.files) def kernel(keys): # Obtain a segment of consecutive items that are blocked across threads thread_keys = cuda.local.array(shape=items_per_thread, dtype=numba.int32) for i in range(items_per_thread): thread_keys[i] = keys[cuda.threadIdx.x * items_per_thread + i] # Collectively sort the keys block_merge_sort(thread_keys) # Copy the sorted keys back to the output for i in range(items_per_thread): keys[cuda.threadIdx.x * items_per_thread + i] = thread_keys[i]
Suppose the set of input
thread_keys
across the block of threads is{ [0, 1, 2, 3], [4, 5, 6, 7], ..., [508, 509, 510, 511] }
. The corresponding outputthread_keys
in those threads will be{ [511, 510, 509, 508], [507, 506, 505, 504], ..., [3, 2, 1, 0] }
.- Parameters
dtype – Numba data type of the keys to be sorted
threads_per_block – The number of threads in a block
items_per_thread – The number of items each thread owns
compare_op – Comparison function object which returns true if the first argument is ordered before the second one
- Returns
A callable object that can be linked to and invoked from a CUDA kernel
- cuda.cooperative.experimental.block.radix_sort_keys(dtype, threads_per_block, items_per_thread)
Performs an ascending block-wide radix sort over a blocked arrangement of keys.
Example
The code snippet below illustrates a sort of 512 integer keys that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive keys. We start by importing necessary modules:
patch.patch_numba_linker(lto=True)
Below is the code snippet that demonstrates the usage of the
radix_sort_keys
API:# Specialize radix sort for a 1D block of 128 threads owning 4 integer items each items_per_thread = 4 threads_per_block = 128 block_radix_sort = cudax.block.radix_sort_keys( numba.int32, threads_per_block, items_per_thread ) # Link the radix sort to a CUDA kernel @cuda.jit(link=block_radix_sort.files) def kernel(keys): # Obtain a segment of consecutive items that are blocked across threads thread_keys = cuda.local.array(shape=items_per_thread, dtype=numba.int32) for i in range(items_per_thread): thread_keys[i] = keys[cuda.threadIdx.x * items_per_thread + i] # Collectively sort the keys block_radix_sort(thread_keys) # Copy the sorted keys back to the output for i in range(items_per_thread): keys[cuda.threadIdx.x * items_per_thread + i] = thread_keys[i]
Suppose the set of input
thread_keys
across the block of threads is{ [511, 510, 509, 508], [507, 506, 505, 504], ..., [3, 2, 1, 0] }
. The corresponding outputthread_keys
in those threads will be{ [0, 1, 2, 3], [4, 5, 6, 7], ..., [508, 509, 510, 511] }
.- Parameters
dtype – Numba data type of the keys to be sorted
threads_per_block – The number of threads in a block
items_per_thread – The number of items each thread owns
- Returns
A callable object that can be linked to and invoked from a CUDA kernel
- cuda.cooperative.experimental.block.radix_sort_keys_descending(dtype, threads_per_block, items_per_thread)
Performs an descending block-wide radix sort over a blocked arrangement of keys.
Example
The code snippet below illustrates a sort of 512 integer keys that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive keys. We start by importing necessary modules:
patch.patch_numba_linker(lto=True)
Below is the code snippet that demonstrates the usage of the
radix_sort_keys
API:# Specialize radix sort for a 1D block of 128 threads owning 4 integer items each items_per_thread = 4 threads_per_block = 128 block_radix_sort = cudax.block.radix_sort_keys_descending( numba.int32, threads_per_block, items_per_thread ) # Link the radix sort to a CUDA kernel @cuda.jit(link=block_radix_sort.files) def kernel(keys): # Obtain a segment of consecutive items that are blocked across threads thread_keys = cuda.local.array(shape=items_per_thread, dtype=numba.int32) for i in range(items_per_thread): thread_keys[i] = keys[cuda.threadIdx.x * items_per_thread + i] # Collectively sort the keys block_radix_sort(thread_keys) # Copy the sorted keys back to the output for i in range(items_per_thread): keys[cuda.threadIdx.x * items_per_thread + i] = thread_keys[i]
Suppose the set of input
thread_keys
across the block of threads is{ [0, 1, 2, 3], [4, 5, 6, 7], ..., [508, 509, 510, 511] }
. The corresponding outputthread_keys
in those threads will be{ [511, 510, 509, 508], [507, 506, 505, 504], ..., [3, 2, 1, 0] }
.- Parameters
dtype – Numba data type of the keys to be sorted
threads_per_block – The number of threads in a block
items_per_thread – The number of items each thread owns
- Returns
A callable object that can be linked to and invoked from a CUDA kernel
- cuda.cooperative.experimental.block.reduce(dtype, threads_per_block, binary_op, items_per_thread=1, methods=None)
Creates an operation that computes a block-wide reduction for thread0 using the specified binary reduction functor.
Returns a callable object that can be linked to and invoked from device code. It can be invoked with the following signatures:
(item: dtype) -> dtype): Each thread contributes a single item to the reduction.
- (items: numba.types.Array) -> dtype: Each thread contributes an array of items to the
reduction. The array must contain at least items_per_thread items; only the first items_per_thread items will be included in the reduction.
- (item: dtype, num_valid: int) -> dtype: The first num_valid threads contribute a
single item to the reduction. The items contributed by all other threads are ignored.
- Parameters
dtype – Data type being reduced
threads_per_block – The number of threads in a block
binary_op – Binary reduction function
items_per_thread – The number of items each thread contributes to the reduction
methods – A dict of methods for user-defined types
Warning
The return value is undefined in threads other than thread0.
Example
The code snippet below illustrates a max reduction of 128 integer items that are partitioned across 128 threads.
import numba import numpy as np from numba import cuda from pynvjitlink import patch import cuda.cooperative.experimental as cudax patch.patch_numba_linker(lto=True)
def op(a, b): return a if a > b else b threads_per_block = 128 block_reduce = cudax.block.reduce(numba.int32, threads_per_block, op) @cuda.jit(link=block_reduce.files) def kernel(input, output): block_output = block_reduce(input[cuda.threadIdx.x]) if cuda.threadIdx.x == 0: output[0] = block_output
Suppose the set of inputs across the block of threads is
{ 0, 1, 2, 3, ..., 127 }
. The corresponding output in the threads thread0 will be{ 127 }
.
- cuda.cooperative.experimental.block.store(dtype, threads_per_block, items_per_thread=1, algorithm='direct')
Creates an operation that performs a block-wide store.
Returns a callable object that can be linked to and invoked from device code. It can be invoked with the following signatures:
- (dest: numba.types.Array, src: numba.types.Array) -> None: Each thread stores
items_per_thread items from src into dest. src must contain at least items_per_thread items.
Different data movement strategies can be selected via the algorithm parameter:
algorithm=”direct” (default): A blocked arrangement of data is written directly to memory.
algorithm=”striped”: A striped arrangement of data is written directly to memory.
algorithm=”vectorize”: A blocked arrangement of data is written directly to memory using CUDA’s built-in vectorized stores as a coalescing optimization.
algorithm=”transpose”: A blocked arrangement is locally transposed into a striped arrangement which is then written to memory.
algorithm=”warp_transpose”: A blocked arrangement is locally transposed into a warp-striped arrangement which is then written to memory.
algorithm=”warp_transpose_timesliced”: A blocked arrangement is locally transposed into a warp-striped arrangement which is then written to memory. To reduce the shared memory requireent, only one warp’s worth of shared memory is provisioned and is subsequently time-sliced among warps.
For more details, [read the corresponding CUB C++ documentation](https://nvidia.github.io/cccl/cub/api/classcub_1_1BlockStore.html).
- Parameters
dtype – Data type being stored
threads_per_block – The number of threads in a block, either an integer or a tuple of 2 or 3 integers
items_per_thread – The number of items each thread loads
algorithm – The data movement algorithm to use
Example
The code snippet below illustrates a striped load and store of 128 integer items by 32 threads, with each thread handling 4 integers.
import numba import numpy as np from numba import cuda from pynvjitlink import patch import cuda.cooperative.experimental as cudax patch.patch_numba_linker(lto=True)
threads_per_block = 32 items_per_thread = 4 block_load = cudax.block.load( numba.int32, threads_per_block, items_per_thread, "striped" ) block_store = cudax.block.store( numba.int32, threads_per_block, items_per_thread, "striped" ) @cuda.jit(link=block_load.files + block_store.files) def kernel(input, output): tmp = cuda.local.array(items_per_thread, numba.int32) block_load(input, tmp) block_store(output, tmp)
- cuda.cooperative.experimental.block.sum(dtype, threads_per_block, items_per_thread=1, methods=None)
Creates an operation that computes a block-wide reduction for thread0 using addition (+) as the reduction operator.
Returns a callable object that can be linked to and invoked from device code. It can be invoked with the following signatures:
(item: dtype) -> dtype): Each thread contributes a single item to the reduction.
- (items: numba.types.Array) -> dtype: Each thread contributes an array of items to the
reduction. The array must contain at least items_per_thread items; only the first items_per_thread items will be included in the reduction.
- (item: dtype, num_valid: int) -> dtype: The first num_valid threads contribute a
single item to the reduction. The items contributed by all other threads are ignored.
- Parameters
dtype – Data type being reduced
threads_per_block – The number of threads in a block
items_per_thread – The number of items each thread owns
methods – A dict of methods for user-defined types
Warning
The return value is undefined in threads other than thread0.
Example
The code snippet below illustrates a sum of 128 integer items that are partitioned across 128 threads.
import numba import numpy as np from numba import cuda from pynvjitlink import patch import cuda.cooperative.experimental as cudax patch.patch_numba_linker(lto=True)
threads_per_block = 128 block_sum = cudax.block.sum(numba.int32, threads_per_block) @cuda.jit(link=block_sum.files) def kernel(input, output): block_output = block_sum(input[cuda.threadIdx.x]) if cuda.threadIdx.x == 0: output[0] = block_output
Suppose the set of inputs across the block of threads is
{ 1, 1, 1, 1, ..., 1 }
. The corresponding output in the threads thread0 will be{ 128 }
.