CUDA Cooperative

Warning

Python exposure of cooperative algorithms is in public beta. The API is subject to change without notice.

cuda.cooperative.experimental.warp.exclusive_sum(dtype, threads_in_warp=32)

Computes an exclusive warp-wide prefix sum using addition (+) as the scan operator. The value of 0 is applied as the initial value, and is assigned to the output in lane0.

Example

The code snippet below illustrates an exclusive prefix sum of 32 integer items:

import cuda.cooperative.experimental as cudax
from pynvjitlink import patch
patch.patch_numba_linker(lto=True)

Below is the code snippet that demonstrates the usage of the exclusive_sum API:

# Specialize exclusive sum for a warp of threads
warp_exclusive_sum = cudax.warp.exclusive_sum(numba.int32)
temp_storage_bytes = warp_exclusive_sum.temp_storage_bytes

# Link the exclusive sum to a CUDA kernel
@cuda.jit(link=warp_exclusive_sum.files)
def kernel(data):
    # Allocate shared memory for exclusive sum
    temp_storage = cuda.shared.array(shape=temp_storage_bytes, dtype=numba.uint8)

    # Collectively compute the warp-wide exclusive prefix sum
    data[cuda.threadIdx.x] = warp_exclusive_sum(temp_storage, data[cuda.threadIdx.x])

Suppose the set of input thread_data across the warp of threads is { [1, 1, 1, 1], [1, 1, 1, 1], ..., [1, 1, 1, 1] }. The corresponding output thread_data in those threads will be { [0, 1, 2, 3], [4, 5, 6, 7], ..., [28, 29, 30, 31] }.

Parameters
  • dtype – Data type being scanned

  • threads_in_warp – The number of threads in a warp

Returns

A callable object that can be linked to and invoked from a CUDA kernel

cuda.cooperative.experimental.warp.merge_sort_keys(dtype, items_per_thread, compare_op, threads_in_warp=32, methods=None)

Performs a warp-wide merge sort over a blocked arrangement of keys.

Example

The code snippet below illustrates a sort of 128 integer keys that are partitioned in a blocked arrangement across a warp of 32 threads where each thread owns 4 consecutive keys. We start by importing necessary modules:

import cuda.cooperative.experimental as cudax
from pynvjitlink import patch
patch.patch_numba_linker(lto=True)

Below is the code snippet that demonstrates the usage of the merge_sort_keys API:

# Define comparison operator
def compare_op(a, b):
    return a > b

# Specialize merge sort for a warp of threads owning 4 integer items each
items_per_thread = 4
warp_merge_sort = cudax.warp.merge_sort_keys(numba.int32, items_per_thread, compare_op)
temp_storage_bytes = warp_merge_sort.temp_storage_bytes

# Link the merge sort to a CUDA kernel
@cuda.jit(link=warp_merge_sort.files)
def kernel(keys):
    # Allocate shared memory for merge sort
    temp_storage = cuda.shared.array(temp_storage_bytes, numba.uint8)

    # Obtain a segment of consecutive items that are blocked across threads
    thread_keys = cuda.local.array(shape=items_per_thread, dtype=numba.int32)

    for i in range(items_per_thread):
        thread_keys[i] = keys[cuda.threadIdx.x * items_per_thread + i]

    # Collectively sort the keys
    warp_merge_sort(temp_storage, thread_keys)

    # Copy the sorted keys back to the output
    for i in range(items_per_thread):
        keys[cuda.threadIdx.x * items_per_thread + i] = thread_keys[i]

Suppose the set of input thread_keys across the warp of threads is { [0, 1, 2, 3], [4, 5, 6, 7], ..., [124, 125, 126, 127] }. The corresponding output thread_keys in those threads will be { [127, 126, 125, 124], [123, 122, 121, 120], ..., [3, 2, 1, 0] }.

Parameters
  • dtype – Numba data type of the keys to be sorted

  • threads_in_warp – The number of threads in a warp

  • items_per_thread – The number of items each thread owns

  • compare_op – Comparison function object which returns true if the first argument is ordered before the second one

Returns

A callable object that can be linked to and invoked from a CUDA kernel

cuda.cooperative.experimental.warp.reduce(dtype, binary_op, threads_in_warp=32, methods=None)

Computes a warp-wide reduction for lane0 using the specified binary reduction functor. Each thread contributes one input element.

Warning

The return value is undefined in threads other than thread0.

Example

The code snippet below illustrates a max reduction of 32 integer items that are partitioned across a warp of threads.

import cuda.cooperative.experimental as cudax
from pynvjitlink import patch
patch.patch_numba_linker(lto=True)

Below is the code snippet that demonstrates the usage of the reduce API:

warp_reduce = cudax.warp.reduce(numba.int32, op)
temp_storage_bytes = warp_reduce.temp_storage_bytes

@cuda.jit(link=warp_reduce.files)
def kernel(input, output):
    temp_storage = cuda.shared.array(shape=temp_storage_bytes, dtype=numba.uint8)
    warp_output = warp_reduce(temp_storage, input[cuda.threadIdx.x])

    if cuda.threadIdx.x == 0:
        output[0] = warp_output

Suppose the set of inputs across the warp of threads is { 0, 1, 2, 3, ..., 31 }. The corresponding output in the threads lane0 will be { 31 }.

Parameters
  • dtype – Data type being reduced

  • threads_in_warp – The number of threads in a warp

  • binary_op – Binary reduction function

Returns

A callable object that can be linked to and invoked from a CUDA kernel

cuda.cooperative.experimental.warp.sum(dtype, threads_in_warp=32)

Computes a warp-wide reduction for lane0 using addition (+) as the reduction operator. Each thread contributes one input element.

Warning

The return value is undefined in threads other than thread0.

Example

The code snippet below illustrates a reduction of 32 integer items that are partitioned across a warp of threads.

import cuda.cooperative.experimental as cudax
from pynvjitlink import patch
patch.patch_numba_linker(lto=True)

Below is the code snippet that demonstrates the usage of the reduce API:

warp_sum = cudax.warp.sum(numba.int32)
temp_storage_bytes = warp_sum.temp_storage_bytes

@cuda.jit(link=warp_sum.files)
def kernel(input, output):
    temp_storage = cuda.shared.array(shape=temp_storage_bytes, dtype=numba.uint8)
    warp_output = warp_sum(temp_storage, input[cuda.threadIdx.x])

    if cuda.threadIdx.x == 0:
        output[0] = warp_output

Suppose the set of inputs across the warp of threads is { 1, 1, 1, 1, ..., 1 }. The corresponding output in the threads lane0 will be { 32 }.

Parameters
  • dtype – Data type being reduced

  • threads_in_warp – The number of threads in a warp

Returns

A callable object that can be linked to and invoked from a CUDA kernel

cuda.cooperative.experimental.block.exclusive_sum(dtype, threads_in_block, items_per_thread, prefix_op=None)

Computes an exclusive block-wide prefix sum using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. The value of 0 is applied as the initial value, and is assigned to first output element in thread0.

Example

The code snippet below illustrates an exclusive prefix sum of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.

import cuda.cooperative.experimental as cudax
from pynvjitlink import patch
patch.patch_numba_linker(lto=True)

Below is the code snippet that demonstrates the usage of the exclusive_sum API:

items_per_thread = 4
threads_per_block = 128

# Specialize exclusive sum for a 1D block of 128 threads owning 4 integer items each
block_exclusive_sum = cudax.block.exclusive_sum(numba.int32, threads_per_block, items_per_thread)
temp_storage_bytes = block_exclusive_sum.temp_storage_bytes

# Link the exclusive sum to a CUDA kernel
@cuda.jit(link=block_exclusive_sum.files)
def kernel(data):
    # Allocate shared memory for exclusive sum
    temp_storage = cuda.shared.array(shape=temp_storage_bytes, dtype=numba.uint8)

    # Obtain a segment of consecutive items that are blocked across threads
    thread_data = cuda.local.array(shape=items_per_thread, dtype=numba.int32)
    for i in range(items_per_thread):
        thread_data[i] = data[cuda.threadIdx.x * items_per_thread + i]

    # Collectively compute the block-wide exclusive prefix sum
    block_exclusive_sum(temp_storage, thread_data, thread_data)

    # Copy the scanned keys back to the output
    for i in range(items_per_thread):
        data[cuda.threadIdx.x * items_per_thread + i] = thread_data[i]

Suppose the set of input thread_data across the block of threads is { [1, 1, 1, 1], [1, 1, 1, 1], ..., [1, 1, 1, 1] }. The corresponding output thread_data in those threads will be { [0, 1, 2, 3], [4, 5, 6, 7], ..., [508, 509, 510, 511] }.

Parameters
  • dtype – Data type being scanned

  • threads_in_block – The number of threads in a block

  • items_per_thread – The number of items each thread owns

Returns

A callable object that can be linked to and invoked from a CUDA kernel

cuda.cooperative.experimental.block.merge_sort_keys(dtype, threads_in_block, items_per_thread, compare_op, methods=None)

Performs a block-wide merge sort over a blocked arrangement of keys.

Example

The code snippet below illustrates a sort of 512 integer keys that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive keys. We start by importing necessary modules:

import cuda.cooperative.experimental as cudax
from pynvjitlink import patch
patch.patch_numba_linker(lto=True)

Below is the code snippet that demonstrates the usage of the merge_sort_keys API:

# Define comparison operator
def compare_op(a, b):
    return a > b

# Specialize merge sort for a 1D block of 128 threads owning 4 integer items each
items_per_thread = 4
threads_per_block = 128
block_merge_sort = cudax.block.merge_sort_keys(numba.int32, threads_per_block, items_per_thread, compare_op)
temp_storage_bytes = block_merge_sort.temp_storage_bytes

# Link the merge sort to a CUDA kernel
@cuda.jit(link=block_merge_sort.files)
def kernel(keys):
    # Allocate shared memory for merge sort
    temp_storage = cuda.shared.array(temp_storage_bytes, numba.uint8)

    # Obtain a segment of consecutive items that are blocked across threads
    thread_keys = cuda.local.array(shape=items_per_thread, dtype=numba.int32)

    for i in range(items_per_thread):
        thread_keys[i] = keys[cuda.threadIdx.x * items_per_thread + i]

    # Collectively sort the keys
    block_merge_sort(temp_storage, thread_keys)

    # Copy the sorted keys back to the output
    for i in range(items_per_thread):
        keys[cuda.threadIdx.x * items_per_thread + i] = thread_keys[i]

Suppose the set of input thread_keys across the block of threads is { [0, 1, 2, 3], [4, 5, 6, 7], ..., [508, 509, 510, 511] }. The corresponding output thread_keys in those threads will be { [511, 510, 509, 508], [507, 506, 505, 504], ..., [3, 2, 1, 0] }.

Parameters
  • dtype – Numba data type of the keys to be sorted

  • threads_in_block – The number of threads in a block

  • items_per_thread – The number of items each thread owns

  • compare_op – Comparison function object which returns true if the first argument is ordered before the second one

Returns

A callable object that can be linked to and invoked from a CUDA kernel

cuda.cooperative.experimental.block.radix_sort_keys(dtype, threads_in_block, items_per_thread)

Performs an ascending block-wide radix sort over a blocked arrangement of keys.

Example

The code snippet below illustrates a sort of 512 integer keys that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive keys. We start by importing necessary modules:

import cuda.cooperative.experimental as cudax
from pynvjitlink import patch
patch.patch_numba_linker(lto=True)

Below is the code snippet that demonstrates the usage of the radix_sort_keys API:

# Specialize radix sort for a 1D block of 128 threads owning 4 integer items each
items_per_thread = 4
threads_per_block = 128
block_radix_sort = cudax.block.radix_sort_keys(numba.int32, threads_per_block, items_per_thread)
temp_storage_bytes = block_radix_sort.temp_storage_bytes

# Link the radix sort to a CUDA kernel
@cuda.jit(link=block_radix_sort.files)
def kernel(keys):
    # Allocate shared memory for radix sort
    temp_storage = cuda.shared.array(temp_storage_bytes, numba.uint8)

    # Obtain a segment of consecutive items that are blocked across threads
    thread_keys = cuda.local.array(shape=items_per_thread, dtype=numba.int32)

    for i in range(items_per_thread):
        thread_keys[i] = keys[cuda.threadIdx.x * items_per_thread + i]

    # Collectively sort the keys
    block_radix_sort(temp_storage, thread_keys)

    # Copy the sorted keys back to the output
    for i in range(items_per_thread):
        keys[cuda.threadIdx.x * items_per_thread + i] = thread_keys[i]

Suppose the set of input thread_keys across the block of threads is { [511, 510, 509, 508], [507, 506, 505, 504], ..., [3, 2, 1, 0] }. The corresponding output thread_keys in those threads will be { [0, 1, 2, 3], [4, 5, 6, 7], ..., [508, 509, 510, 511] }.

Parameters
  • dtype – Numba data type of the keys to be sorted

  • threads_in_block – The number of threads in a block

  • items_per_thread – The number of items each thread owns

Returns

A callable object that can be linked to and invoked from a CUDA kernel

cuda.cooperative.experimental.block.radix_sort_keys_descending(dtype, threads_in_block, items_per_thread)

Performs an descending block-wide radix sort over a blocked arrangement of keys.

Example

The code snippet below illustrates a sort of 512 integer keys that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive keys. We start by importing necessary modules:

import cuda.cooperative.experimental as cudax
from pynvjitlink import patch
patch.patch_numba_linker(lto=True)

Below is the code snippet that demonstrates the usage of the radix_sort_keys API:

# Specialize radix sort for a 1D block of 128 threads owning 4 integer items each
items_per_thread = 4
threads_per_block = 128
block_radix_sort = cudax.block.radix_sort_keys_descending(numba.int32, threads_per_block, items_per_thread)
temp_storage_bytes = block_radix_sort.temp_storage_bytes

# Link the radix sort to a CUDA kernel
@cuda.jit(link=block_radix_sort.files)
def kernel(keys):
    # Allocate shared memory for radix sort
    temp_storage = cuda.shared.array(temp_storage_bytes, numba.uint8)

    # Obtain a segment of consecutive items that are blocked across threads
    thread_keys = cuda.local.array(shape=items_per_thread, dtype=numba.int32)

    for i in range(items_per_thread):
        thread_keys[i] = keys[cuda.threadIdx.x * items_per_thread + i]

    # Collectively sort the keys
    block_radix_sort(temp_storage, thread_keys)

    # Copy the sorted keys back to the output
    for i in range(items_per_thread):
        keys[cuda.threadIdx.x * items_per_thread + i] = thread_keys[i]

Suppose the set of input thread_keys across the block of threads is { [0, 1, 2, 3], [4, 5, 6, 7], ..., [508, 509, 510, 511] }. The corresponding output thread_keys in those threads will be { [511, 510, 509, 508], [507, 506, 505, 504], ..., [3, 2, 1, 0] }.

Parameters
  • dtype – Numba data type of the keys to be sorted

  • threads_in_block – The number of threads in a block

  • items_per_thread – The number of items each thread owns

Returns

A callable object that can be linked to and invoked from a CUDA kernel

cuda.cooperative.experimental.block.reduce(dtype, threads_in_block, binary_op, methods=None)

Computes a block-wide reduction for thread0 using the specified binary reduction functor. Each thread contributes one input element.

Warning

The return value is undefined in threads other than thread0.

Example

The code snippet below illustrates a max reduction of 128 integer items that are partitioned across 128 threads.

import cuda.cooperative.experimental as cudax
from pynvjitlink import patch
patch.patch_numba_linker(lto=True)

Below is the code snippet that demonstrates the usage of the reduce API:

threads_in_block = 128
block_reduce = cudax.block.reduce(numba.int32, threads_in_block, op)
temp_storage_bytes = block_reduce.temp_storage_bytes

@cuda.jit(link=block_reduce.files)
def kernel(input, output):
    temp_storage = cuda.shared.array(shape=temp_storage_bytes, dtype=numba.uint8)
    block_output = block_reduce(temp_storage, input[cuda.threadIdx.x])

    if cuda.threadIdx.x == 0:
        output[0] = block_output

Suppose the set of inputs across the block of threads is { 0, 1, 2, 3, ..., 127 }. The corresponding output in the threads thread0 will be { 127 }.

Parameters
  • dtype – Data type being reduced

  • threads_in_block – The number of threads in a block

  • binary_op – Binary reduction function

Returns

A callable object that can be linked to and invoked from a CUDA kernel

cuda.cooperative.experimental.block.sum(dtype, threads_in_block)

Computes a block-wide reduction for thread0 using addition (+) as the reduction operator. Each thread contributes one input element.

Warning

The return value is undefined in threads other than thread0.

Example

The code snippet below illustrates a reduction of 128 integer items that are partitioned across 128 threads.

import cuda.cooperative.experimental as cudax
from pynvjitlink import patch
patch.patch_numba_linker(lto=True)

Below is the code snippet that demonstrates the usage of the reduce API:

threads_in_block = 128
block_sum = cudax.block.sum(numba.int32, threads_in_block)
temp_storage_bytes = block_sum.temp_storage_bytes

@cuda.jit(link=block_sum.files)
def kernel(input, output):
    temp_storage = cuda.shared.array(shape=temp_storage_bytes, dtype=numba.uint8)
    block_output = block_sum(temp_storage, input[cuda.threadIdx.x])

    if cuda.threadIdx.x == 0:
        output[0] = block_output

Suppose the set of inputs across the block of threads is { 1, 1, 1, 1, ..., 1 }. The corresponding output in the threads thread0 will be { 128 }.

Parameters
  • dtype – Data type being reduced

  • threads_in_block – The number of threads in a block

Returns

A callable object that can be linked to and invoked from a CUDA kernel