CUDA Cooperative
Warning
Python exposure of cooperative algorithms is in public beta. The API is subject to change without notice.
- cuda.cooperative.experimental.warp.exclusive_sum(dtype, threads_in_warp=32)
Computes an exclusive warp-wide prefix sum using addition (+) as the scan operator. The value of 0 is applied as the initial value, and is assigned to the output in lane0.
Example
The code snippet below illustrates an exclusive prefix sum of 32 integer items:
import cuda.cooperative.experimental as cudax from pynvjitlink import patch patch.patch_numba_linker(lto=True)
Below is the code snippet that demonstrates the usage of the
exclusive_sum
API:# Specialize exclusive sum for a warp of threads warp_exclusive_sum = cudax.warp.exclusive_sum(numba.int32) temp_storage_bytes = warp_exclusive_sum.temp_storage_bytes # Link the exclusive sum to a CUDA kernel @cuda.jit(link=warp_exclusive_sum.files) def kernel(data): # Allocate shared memory for exclusive sum temp_storage = cuda.shared.array(shape=temp_storage_bytes, dtype=numba.uint8) # Collectively compute the warp-wide exclusive prefix sum data[cuda.threadIdx.x] = warp_exclusive_sum(temp_storage, data[cuda.threadIdx.x])
Suppose the set of input
thread_data
across the warp of threads is{ [1, 1, 1, 1], [1, 1, 1, 1], ..., [1, 1, 1, 1] }
. The corresponding outputthread_data
in those threads will be{ [0, 1, 2, 3], [4, 5, 6, 7], ..., [28, 29, 30, 31] }
.- Parameters
dtype – Data type being scanned
threads_in_warp – The number of threads in a warp
- Returns
A callable object that can be linked to and invoked from a CUDA kernel
- cuda.cooperative.experimental.warp.merge_sort_keys(dtype, items_per_thread, compare_op, threads_in_warp=32, methods=None)
Performs a warp-wide merge sort over a blocked arrangement of keys.
Example
The code snippet below illustrates a sort of 128 integer keys that are partitioned in a blocked arrangement across a warp of 32 threads where each thread owns 4 consecutive keys. We start by importing necessary modules:
import cuda.cooperative.experimental as cudax from pynvjitlink import patch patch.patch_numba_linker(lto=True)
Below is the code snippet that demonstrates the usage of the
merge_sort_keys
API:# Define comparison operator def compare_op(a, b): return a > b # Specialize merge sort for a warp of threads owning 4 integer items each items_per_thread = 4 warp_merge_sort = cudax.warp.merge_sort_keys(numba.int32, items_per_thread, compare_op) temp_storage_bytes = warp_merge_sort.temp_storage_bytes # Link the merge sort to a CUDA kernel @cuda.jit(link=warp_merge_sort.files) def kernel(keys): # Allocate shared memory for merge sort temp_storage = cuda.shared.array(temp_storage_bytes, numba.uint8) # Obtain a segment of consecutive items that are blocked across threads thread_keys = cuda.local.array(shape=items_per_thread, dtype=numba.int32) for i in range(items_per_thread): thread_keys[i] = keys[cuda.threadIdx.x * items_per_thread + i] # Collectively sort the keys warp_merge_sort(temp_storage, thread_keys) # Copy the sorted keys back to the output for i in range(items_per_thread): keys[cuda.threadIdx.x * items_per_thread + i] = thread_keys[i]
Suppose the set of input
thread_keys
across the warp of threads is{ [0, 1, 2, 3], [4, 5, 6, 7], ..., [124, 125, 126, 127] }
. The corresponding outputthread_keys
in those threads will be{ [127, 126, 125, 124], [123, 122, 121, 120], ..., [3, 2, 1, 0] }
.- Parameters
dtype – Numba data type of the keys to be sorted
threads_in_warp – The number of threads in a warp
items_per_thread – The number of items each thread owns
compare_op – Comparison function object which returns true if the first argument is ordered before the second one
- Returns
A callable object that can be linked to and invoked from a CUDA kernel
- cuda.cooperative.experimental.warp.reduce(dtype, binary_op, threads_in_warp=32, methods=None)
Computes a warp-wide reduction for lane0 using the specified binary reduction functor. Each thread contributes one input element.
Warning
The return value is undefined in threads other than thread0.
Example
The code snippet below illustrates a max reduction of 32 integer items that are partitioned across a warp of threads.
import cuda.cooperative.experimental as cudax from pynvjitlink import patch patch.patch_numba_linker(lto=True)
Below is the code snippet that demonstrates the usage of the
reduce
API:warp_reduce = cudax.warp.reduce(numba.int32, op) temp_storage_bytes = warp_reduce.temp_storage_bytes @cuda.jit(link=warp_reduce.files) def kernel(input, output): temp_storage = cuda.shared.array(shape=temp_storage_bytes, dtype=numba.uint8) warp_output = warp_reduce(temp_storage, input[cuda.threadIdx.x]) if cuda.threadIdx.x == 0: output[0] = warp_output
Suppose the set of inputs across the warp of threads is
{ 0, 1, 2, 3, ..., 31 }
. The corresponding output in the threads lane0 will be{ 31 }
.- Parameters
dtype – Data type being reduced
threads_in_warp – The number of threads in a warp
binary_op – Binary reduction function
- Returns
A callable object that can be linked to and invoked from a CUDA kernel
- cuda.cooperative.experimental.warp.sum(dtype, threads_in_warp=32)
Computes a warp-wide reduction for lane0 using addition (+) as the reduction operator. Each thread contributes one input element.
Warning
The return value is undefined in threads other than thread0.
Example
The code snippet below illustrates a reduction of 32 integer items that are partitioned across a warp of threads.
import cuda.cooperative.experimental as cudax from pynvjitlink import patch patch.patch_numba_linker(lto=True)
Below is the code snippet that demonstrates the usage of the
reduce
API:warp_sum = cudax.warp.sum(numba.int32) temp_storage_bytes = warp_sum.temp_storage_bytes @cuda.jit(link=warp_sum.files) def kernel(input, output): temp_storage = cuda.shared.array(shape=temp_storage_bytes, dtype=numba.uint8) warp_output = warp_sum(temp_storage, input[cuda.threadIdx.x]) if cuda.threadIdx.x == 0: output[0] = warp_output
Suppose the set of inputs across the warp of threads is
{ 1, 1, 1, 1, ..., 1 }
. The corresponding output in the threads lane0 will be{ 32 }
.- Parameters
dtype – Data type being reduced
threads_in_warp – The number of threads in a warp
- Returns
A callable object that can be linked to and invoked from a CUDA kernel
- cuda.cooperative.experimental.block.exclusive_sum(dtype, threads_in_block, items_per_thread, prefix_op=None)
Computes an exclusive block-wide prefix sum using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. The value of 0 is applied as the initial value, and is assigned to first output element in thread0.
Example
The code snippet below illustrates an exclusive prefix sum of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.
import cuda.cooperative.experimental as cudax from pynvjitlink import patch patch.patch_numba_linker(lto=True)
Below is the code snippet that demonstrates the usage of the
exclusive_sum
API:items_per_thread = 4 threads_per_block = 128 # Specialize exclusive sum for a 1D block of 128 threads owning 4 integer items each block_exclusive_sum = cudax.block.exclusive_sum(numba.int32, threads_per_block, items_per_thread) temp_storage_bytes = block_exclusive_sum.temp_storage_bytes # Link the exclusive sum to a CUDA kernel @cuda.jit(link=block_exclusive_sum.files) def kernel(data): # Allocate shared memory for exclusive sum temp_storage = cuda.shared.array(shape=temp_storage_bytes, dtype=numba.uint8) # Obtain a segment of consecutive items that are blocked across threads thread_data = cuda.local.array(shape=items_per_thread, dtype=numba.int32) for i in range(items_per_thread): thread_data[i] = data[cuda.threadIdx.x * items_per_thread + i] # Collectively compute the block-wide exclusive prefix sum block_exclusive_sum(temp_storage, thread_data, thread_data) # Copy the scanned keys back to the output for i in range(items_per_thread): data[cuda.threadIdx.x * items_per_thread + i] = thread_data[i]
Suppose the set of input
thread_data
across the block of threads is{ [1, 1, 1, 1], [1, 1, 1, 1], ..., [1, 1, 1, 1] }
. The corresponding outputthread_data
in those threads will be{ [0, 1, 2, 3], [4, 5, 6, 7], ..., [508, 509, 510, 511] }
.- Parameters
dtype – Data type being scanned
threads_in_block – The number of threads in a block
items_per_thread – The number of items each thread owns
- Returns
A callable object that can be linked to and invoked from a CUDA kernel
- cuda.cooperative.experimental.block.merge_sort_keys(dtype, threads_in_block, items_per_thread, compare_op, methods=None)
Performs a block-wide merge sort over a blocked arrangement of keys.
Example
The code snippet below illustrates a sort of 512 integer keys that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive keys. We start by importing necessary modules:
import cuda.cooperative.experimental as cudax from pynvjitlink import patch patch.patch_numba_linker(lto=True)
Below is the code snippet that demonstrates the usage of the
merge_sort_keys
API:# Define comparison operator def compare_op(a, b): return a > b # Specialize merge sort for a 1D block of 128 threads owning 4 integer items each items_per_thread = 4 threads_per_block = 128 block_merge_sort = cudax.block.merge_sort_keys(numba.int32, threads_per_block, items_per_thread, compare_op) temp_storage_bytes = block_merge_sort.temp_storage_bytes # Link the merge sort to a CUDA kernel @cuda.jit(link=block_merge_sort.files) def kernel(keys): # Allocate shared memory for merge sort temp_storage = cuda.shared.array(temp_storage_bytes, numba.uint8) # Obtain a segment of consecutive items that are blocked across threads thread_keys = cuda.local.array(shape=items_per_thread, dtype=numba.int32) for i in range(items_per_thread): thread_keys[i] = keys[cuda.threadIdx.x * items_per_thread + i] # Collectively sort the keys block_merge_sort(temp_storage, thread_keys) # Copy the sorted keys back to the output for i in range(items_per_thread): keys[cuda.threadIdx.x * items_per_thread + i] = thread_keys[i]
Suppose the set of input
thread_keys
across the block of threads is{ [0, 1, 2, 3], [4, 5, 6, 7], ..., [508, 509, 510, 511] }
. The corresponding outputthread_keys
in those threads will be{ [511, 510, 509, 508], [507, 506, 505, 504], ..., [3, 2, 1, 0] }
.- Parameters
dtype – Numba data type of the keys to be sorted
threads_in_block – The number of threads in a block
items_per_thread – The number of items each thread owns
compare_op – Comparison function object which returns true if the first argument is ordered before the second one
- Returns
A callable object that can be linked to and invoked from a CUDA kernel
- cuda.cooperative.experimental.block.radix_sort_keys(dtype, threads_in_block, items_per_thread)
Performs an ascending block-wide radix sort over a blocked arrangement of keys.
Example
The code snippet below illustrates a sort of 512 integer keys that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive keys. We start by importing necessary modules:
import cuda.cooperative.experimental as cudax from pynvjitlink import patch patch.patch_numba_linker(lto=True)
Below is the code snippet that demonstrates the usage of the
radix_sort_keys
API:# Specialize radix sort for a 1D block of 128 threads owning 4 integer items each items_per_thread = 4 threads_per_block = 128 block_radix_sort = cudax.block.radix_sort_keys(numba.int32, threads_per_block, items_per_thread) temp_storage_bytes = block_radix_sort.temp_storage_bytes # Link the radix sort to a CUDA kernel @cuda.jit(link=block_radix_sort.files) def kernel(keys): # Allocate shared memory for radix sort temp_storage = cuda.shared.array(temp_storage_bytes, numba.uint8) # Obtain a segment of consecutive items that are blocked across threads thread_keys = cuda.local.array(shape=items_per_thread, dtype=numba.int32) for i in range(items_per_thread): thread_keys[i] = keys[cuda.threadIdx.x * items_per_thread + i] # Collectively sort the keys block_radix_sort(temp_storage, thread_keys) # Copy the sorted keys back to the output for i in range(items_per_thread): keys[cuda.threadIdx.x * items_per_thread + i] = thread_keys[i]
Suppose the set of input
thread_keys
across the block of threads is{ [511, 510, 509, 508], [507, 506, 505, 504], ..., [3, 2, 1, 0] }
. The corresponding outputthread_keys
in those threads will be{ [0, 1, 2, 3], [4, 5, 6, 7], ..., [508, 509, 510, 511] }
.- Parameters
dtype – Numba data type of the keys to be sorted
threads_in_block – The number of threads in a block
items_per_thread – The number of items each thread owns
- Returns
A callable object that can be linked to and invoked from a CUDA kernel
- cuda.cooperative.experimental.block.radix_sort_keys_descending(dtype, threads_in_block, items_per_thread)
Performs an descending block-wide radix sort over a blocked arrangement of keys.
Example
The code snippet below illustrates a sort of 512 integer keys that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive keys. We start by importing necessary modules:
import cuda.cooperative.experimental as cudax from pynvjitlink import patch patch.patch_numba_linker(lto=True)
Below is the code snippet that demonstrates the usage of the
radix_sort_keys
API:# Specialize radix sort for a 1D block of 128 threads owning 4 integer items each items_per_thread = 4 threads_per_block = 128 block_radix_sort = cudax.block.radix_sort_keys_descending(numba.int32, threads_per_block, items_per_thread) temp_storage_bytes = block_radix_sort.temp_storage_bytes # Link the radix sort to a CUDA kernel @cuda.jit(link=block_radix_sort.files) def kernel(keys): # Allocate shared memory for radix sort temp_storage = cuda.shared.array(temp_storage_bytes, numba.uint8) # Obtain a segment of consecutive items that are blocked across threads thread_keys = cuda.local.array(shape=items_per_thread, dtype=numba.int32) for i in range(items_per_thread): thread_keys[i] = keys[cuda.threadIdx.x * items_per_thread + i] # Collectively sort the keys block_radix_sort(temp_storage, thread_keys) # Copy the sorted keys back to the output for i in range(items_per_thread): keys[cuda.threadIdx.x * items_per_thread + i] = thread_keys[i]
Suppose the set of input
thread_keys
across the block of threads is{ [0, 1, 2, 3], [4, 5, 6, 7], ..., [508, 509, 510, 511] }
. The corresponding outputthread_keys
in those threads will be{ [511, 510, 509, 508], [507, 506, 505, 504], ..., [3, 2, 1, 0] }
.- Parameters
dtype – Numba data type of the keys to be sorted
threads_in_block – The number of threads in a block
items_per_thread – The number of items each thread owns
- Returns
A callable object that can be linked to and invoked from a CUDA kernel
- cuda.cooperative.experimental.block.reduce(dtype, threads_in_block, binary_op, methods=None)
Computes a block-wide reduction for thread0 using the specified binary reduction functor. Each thread contributes one input element.
Warning
The return value is undefined in threads other than thread0.
Example
The code snippet below illustrates a max reduction of 128 integer items that are partitioned across 128 threads.
import cuda.cooperative.experimental as cudax from pynvjitlink import patch patch.patch_numba_linker(lto=True)
Below is the code snippet that demonstrates the usage of the
reduce
API:threads_in_block = 128 block_reduce = cudax.block.reduce(numba.int32, threads_in_block, op) temp_storage_bytes = block_reduce.temp_storage_bytes @cuda.jit(link=block_reduce.files) def kernel(input, output): temp_storage = cuda.shared.array(shape=temp_storage_bytes, dtype=numba.uint8) block_output = block_reduce(temp_storage, input[cuda.threadIdx.x]) if cuda.threadIdx.x == 0: output[0] = block_output
Suppose the set of inputs across the block of threads is
{ 0, 1, 2, 3, ..., 127 }
. The corresponding output in the threads thread0 will be{ 127 }
.- Parameters
dtype – Data type being reduced
threads_in_block – The number of threads in a block
binary_op – Binary reduction function
- Returns
A callable object that can be linked to and invoked from a CUDA kernel
- cuda.cooperative.experimental.block.sum(dtype, threads_in_block)
Computes a block-wide reduction for thread0 using addition (+) as the reduction operator. Each thread contributes one input element.
Warning
The return value is undefined in threads other than thread0.
Example
The code snippet below illustrates a reduction of 128 integer items that are partitioned across 128 threads.
import cuda.cooperative.experimental as cudax from pynvjitlink import patch patch.patch_numba_linker(lto=True)
Below is the code snippet that demonstrates the usage of the
reduce
API:threads_in_block = 128 block_sum = cudax.block.sum(numba.int32, threads_in_block) temp_storage_bytes = block_sum.temp_storage_bytes @cuda.jit(link=block_sum.files) def kernel(input, output): temp_storage = cuda.shared.array(shape=temp_storage_bytes, dtype=numba.uint8) block_output = block_sum(temp_storage, input[cuda.threadIdx.x]) if cuda.threadIdx.x == 0: output[0] = block_output
Suppose the set of inputs across the block of threads is
{ 1, 1, 1, 1, ..., 1 }
. The corresponding output in the threads thread0 will be{ 128 }
.- Parameters
dtype – Data type being reduced
threads_in_block – The number of threads in a block
- Returns
A callable object that can be linked to and invoked from a CUDA kernel