cuda.compute: Parallel Computing Primitives#

The cuda.compute library provides parallel computing primitives that operate on entire arrays or ranges of data. These algorithms are designed to be easy to use from Python while delivering the performance of hand-optimized CUDA kernels, portable across different GPU architectures.

Algorithms#

The core functionality provided by the cuda.compute library are algorithms such as reductions, scans, sorts, and transforms.

Here’s a simple example showing how to use the reduce_into algorithm to reduce an array of integers.

Many algorithms, including reduction, require a temporary memory buffer. The library will allocate this buffer for you, but you can also use the object-based API for greater control.

Iterators#

Algorithms can be used not just on arrays, but also on iterators. Iterators provide a way to represent sequences of data without needing to allocate memory for them.

Here’s an example showing how to use reduction with a CountingIterator that generates a sequence of numbers starting from a specified value.

Iterators also provide a way to compose operations. Here’s an example showing how to use reduce_into with a TransformIterator to compute the sum of squares of a sequence of numbers.

Iterators that wrap an array (or another output iterator) may be used as both input and output iterators. Here’s an example showing how to use a TransformIterator to transform the output of a reduction before writing to the underlying array.

Custom Types#

The cuda.compute library supports defining custom data types, using the gpu_struct decorator. Here are some examples showing how to define and use custom types:

Example Collections#

For complete runnable examples and more advanced usage patterns, see our full collection of examples.

External API References#