cuda.compute
: Parallel Computing Primitives#
The cuda.compute
library provides parallel computing primitives that operate
on entire arrays or ranges of data. These algorithms are designed to be easy to use from Python
while delivering the performance of hand-optimized CUDA kernels, portable across different
GPU architectures.
Algorithms#
The core functionality provided by the cuda.compute
library are algorithms such
as reductions, scans, sorts, and transforms.
Here’s a simple example showing how to use the reduce_into
algorithm to
reduce an array of integers.
Many algorithms, including reduction, require a temporary memory buffer. The library will allocate this buffer for you, but you can also use the object-based API for greater control.
Iterators#
Algorithms can be used not just on arrays, but also on iterators. Iterators provide a way to represent sequences of data without needing to allocate memory for them.
Here’s an example showing how to use reduction with a CountingIterator
that
generates a sequence of numbers starting from a specified value.
Iterators also provide a way to compose operations. Here’s an example showing
how to use reduce_into
with a TransformIterator
to compute the sum of squares
of a sequence of numbers.
Iterators that wrap an array (or another output iterator) may be used as both input and output iterators.
Here’s an example showing how to use a
TransformIterator
to transform the output
of a reduction before writing to the underlying array.
Custom Types#
The cuda.compute
library supports defining custom data types,
using the gpu_struct
decorator.
Here are some examples showing how to define and use custom types:
Example Collections#
For complete runnable examples and more advanced usage patterns, see our full collection of examples.