cuda.compute: Parallel Computing Primitives#

The cuda.compute library provides parallel computing primitives that operate on entire arrays or ranges of data. These algorithms are designed to be easy to use from Python while delivering the performance of hand-optimized CUDA kernels, portable across different GPU architectures.

Algorithms#

The core functionality provided by the cuda.compute library are algorithms such as reductions, scans, sorts, and transforms.

Here’s a simple example showing how to use the reduce_into algorithm to reduce an array of integers.

Basic reduction example.#
"""
Sum all values in an array using reduction with PLUS operation.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import OpKind

# Prepare the input and output arrays.
dtype = np.int32
h_init = np.array([0], dtype=dtype)
d_input = cp.array([1, 2, 3, 4, 5], dtype=dtype)
d_output = cp.empty(1, dtype=dtype)

# Perform the reduction.
cuda.compute.reduce_into(d_input, d_output, OpKind.PLUS, len(d_input), h_init)

# Verify the result.
expected_output = 15
assert (d_output == expected_output).all()
result = d_output[0]
print(f"Sum reduction result: {result}")

Many algorithms, including reduction, require a temporary memory buffer. The library will allocate this buffer for you, but you can also use the object-based API for greater control.

Reduction with object-based API.#
"""
Reduction example using the object API.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    OpKind,
)

# Prepare the input and output arrays.
dtype = np.int32
init_value = 5
h_init = np.array([init_value], dtype=dtype)
h_input = np.array([1, 2, 3, 4], dtype=dtype)
d_input = cp.asarray(h_input)
d_output = cp.empty(1, dtype=dtype)

# Create a reducer object.
reducer = cuda.compute.make_reduce_into(d_input, d_output, OpKind.PLUS, h_init)

# Get the temporary storage size.
temp_storage_size = reducer(None, d_input, d_output, len(h_input), h_init)

# Allocate temporary storage using any user-defined allocator.
# The result must be an object exposing `__cuda_array_interface__`.
d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)

# Perform the reduction.
reducer(d_temp_storage, d_input, d_output, len(h_input), h_init)

expected_result = np.sum(h_input) + init_value
actual_result = d_output.get()[0]
assert actual_result == expected_result
print("Reduce object example completed successfully")

Iterators#

Algorithms can be used not just on arrays, but also on iterators. Iterators provide a way to represent sequences of data without needing to allocate memory for them.

Here’s an example showing how to use reduction with a CountingIterator that generates a sequence of numbers starting from a specified value.

Counting iterator example.#
"""
Example showing how to use counting_iterator.
"""

import functools

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    CountingIterator,
    OpKind,
)

# Prepare the input and output arrays.
first_item = 10
num_items = 3

# Create the counting iterator.
first_it = CountingIterator(np.int32(first_item))

# Prepare the initial value for the reduction.
h_init = np.array([0], dtype=np.int32)

# Prepare the output array.
d_output = cp.empty(1, dtype=np.int32)

# Perform the reduction.
cuda.compute.reduce_into(first_it, d_output, OpKind.PLUS, num_items, h_init)

# Verify the result.
expected_output = functools.reduce(
    lambda a, b: a + b, range(first_item, first_item + num_items)
)
assert (d_output == expected_output).all()
print(f"Counting iterator result: {d_output[0]} (expected: {expected_output})")

Iterators also provide a way to compose operations. Here’s an example showing how to use reduce_into with a TransformIterator to compute the sum of squares of a sequence of numbers.

Transform iterator example.#
"""
Demonstrate reduction with transform iterator.
"""

import functools

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    CountingIterator,
    OpKind,
    TransformIterator,
)


def transform_op(a):
    return -a if a % 2 == 0 else a


# Prepare the input and output arrays.
first_item = 10
num_items = 100

transform_it = TransformIterator(
    CountingIterator(np.int32(first_item)), transform_op
)  # Input sequence
h_init = np.array([0], dtype=np.int64)  # Initial value for the reduction
d_output = cp.empty(1, dtype=np.int64)  # Storage for output

# Perform the reduction.
cuda.compute.reduce_into(transform_it, d_output, OpKind.PLUS, num_items, h_init)

# Verify the result.
expected_output = functools.reduce(
    lambda a, b: a + b,
    [-a if a % 2 == 0 else a for a in range(first_item, first_item + num_items)],
)

# Test assertions
print(f"Transform iterator result: {d_output[0]} (expected: {expected_output})")
assert (d_output == expected_output).all()
assert d_output[0] == expected_output

Iterators that wrap an array (or another output iterator) may be used as both input and output iterators. Here’s an example showing how to use a TransformIterator to transform the output of a reduction before writing to the underlying array.

Custom Types#

The cuda.compute library supports defining custom data types, using the gpu_struct decorator. Here are some examples showing how to define and use custom types:

Custom type reduction example.#
"""
Finding the maximum green value in a sequence of pixels using `reduce_into`
with a custom data type.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import gpu_struct


# Define a custom data type to store the pixel values.
@gpu_struct
class Pixel:
    r: np.int32
    g: np.int32
    b: np.int32


# Define a reduction operation that returns the pixel with the maximum green value.
def max_g_value(x, y):
    return x if x.g > y.g else y


# Prepare the input and output arrays.
d_rgb = cp.random.randint(0, 256, (10, 3), dtype=np.int32).view(Pixel.dtype)
d_out = cp.empty(1, Pixel.dtype)

# Prepare the initial value for the reduction.
h_init = Pixel(0, 0, 0)

# Perform the reduction.
cuda.compute.reduce_into(d_rgb, d_out, max_g_value, d_rgb.size, h_init)

# Calculate the expected result.
h_rgb = d_rgb.get()
expected = h_rgb[h_rgb.view("int32")[:, 1].argmax()]

# Verify the result.
assert expected["g"] == d_out.get()["g"]
result = d_out.get()
print(f"Pixel reduction result: {result}")

User-defined operations#

A powerful feature of cuda.compute is the ability to customized algorithms with user-defined operations. Below is an example of doing a custom reduction with a user-defined binary operation.

Reduction with user-defined binary operations.#
"""
Sum only even values in an array using reduction with custom operation.
"""

import cupy as cp
import numpy as np

import cuda.compute

# Prepare the input and output arrays.
dtype = np.int32
h_init = np.array([0], dtype=dtype)
d_input = cp.array([1, 2, 3, 4, 5], dtype=dtype)
d_output = cp.empty(1, dtype=dtype)

# Define the binary operation for the reduction.


def add_op(a, b):
    return (a if a % 2 == 0 else 0) + (b if b % 2 == 0 else 0)


# Perform the reduction.
cuda.compute.reduce_into(d_input, d_output, add_op, len(d_input), h_init)

# Verify the result.
expected_output = 6
assert (d_output == expected_output).all()
result = d_output[0]
print(f"Custom sum reduction result: {result}")

Note that user-defined operations are compiled into device code using numba-cuda, so many of the same features and restrictions of numba and numba-cuda apply. Here are some important gotchas to be aware of:

  • Lambda functions are not supported.

  • Functions may invoke other functions, but the inner functions must be decorated with @numba.cuda.jit.

  • Functions capturing by global reference may not work as intended. Prefer using closures in these situations.

    Here is an example of a function that captures a global variable by reference, which is then used in a loop with unary_transform.

    for i in range(10):
        def func(x):
            return x + i  # i is captured from global scope
    
        cuda.compute.unary_transform(d_in, d_out, func, num_items)
    

    Modifications to the global variable i may not be reflected in the function when the function is called multiple times. Thus, the different calls to unary_transform may not produce different results. This is true even though the function is redefined each time in the loop.

    To avoid this, capture the variable in a closure:

    def make_func(i):
        def func(x):
           return x + i  # i is captured as a closure variable
        return func
    
    for i in range(10):
        func = make_func(i)
        cuda.compute.unary_transform(d_in, d_out, func, num_items)
    

Example Collections#

For complete runnable examples and more advanced usage patterns, see our full collection of examples.

External API References#