CUDA Parallel

Warning

Python exposure of parallel algorithms is in public beta. The API is subject to change without notice.

Algorithms

cuda.parallel.experimental.algorithms.reduce_into(d_in: cuda.parallel.experimental.typing.DeviceArrayLike | cuda.parallel.experimental.iterators._iterators.IteratorBase, d_out: cuda.parallel.experimental.typing.DeviceArrayLike, op: Callable, h_init: numpy.ndarray)

Computes a device-wide reduction using the specified binary op functor and initial value init.

Example

The code snippet below demonstrates the usage of the reduce_into API:

import cupy as cp
import numpy as np

import cuda.parallel.experimental.algorithms as algorithms

def min_op(a, b):
    return a if a < b else b

dtype = np.int32
h_init = np.array([42], dtype=dtype)
d_input = cp.array([8, 6, 7, 5, 3, 0, 9], dtype=dtype)
d_output = cp.empty(1, dtype=dtype)

# Instantiate reduction for the given operator and initial value
reduce_into = algorithms.reduce_into(d_output, d_output, min_op, h_init)

# Determine temporary device storage requirements
temp_storage_size = reduce_into(None, d_input, d_output, None, h_init)

# Allocate temporary storage
d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)

# Run reduction
reduce_into(d_temp_storage, d_input, d_output, None, h_init)

# Check the result is correct
expected_output = 0
assert (d_output == expected_output).all()
Parameters
  • d_in – CUDA device array storing the input sequence of data items

  • d_out – CUDA device array storing the output aggregate

  • op – Binary reduction

  • init – Numpy array storing initial value of the reduction

Returns

A callable object that can be used to perform the reduction

Iterators

cuda.parallel.experimental.iterators.CacheModifiedInputIterator(device_array, modifier)

Random Access Cache Modified Iterator that wraps a native device pointer.

Similar to https://nvidia.github.io/cccl/cub/api/classcub_1_1CacheModifiedInputIterator.html

Currently the only supported modifier is “stream” (LOAD_CS).

Example

The code snippet below demonstrates the usage of a CacheModifiedInputIterator:

import functools

import cupy as cp
import numpy as np

import cuda.parallel.experimental.algorithms as algorithms
import cuda.parallel.experimental.iterators as iterators

def add_op(a, b):
    return a + b

values = [8, 6, 7, 5, 3, 0, 9]
d_input = cp.array(values, dtype=np.int32)
d_output = cp.empty(1, dtype=np.int32)

iterator = iterators.CacheModifiedInputIterator(
    d_input, modifier="stream"
)  # Input sequence
h_init = np.array([0], dtype=np.int32)  # Initial value for the reduction
d_output = cp.empty(1, dtype=np.int32)  # Storage for output

# Instantiate reduction, determine storage requirements, and allocate storage
reduce_into = algorithms.reduce_into(iterator, d_output, add_op, h_init)
temp_storage_size = reduce_into(None, iterator, d_output, len(values), h_init)
d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)

# Run reduction
reduce_into(d_temp_storage, iterator, d_output, len(values), h_init)

expected_output = functools.reduce(lambda a, b: a + b, values)
assert (d_output == expected_output).all()
Parameters
  • device_array – CUDA device array storing the input sequence of data items

  • modifier – The PTX cache load modifier

Returns

A CacheModifiedInputIterator object initialized with device_array

cuda.parallel.experimental.iterators.ConstantIterator(value)

Returns an Iterator representing a sequence of constant values.

Similar to https://nvidia.github.io/cccl/thrust/api/classthrust_1_1constant__iterator.html

Example

The code snippet below demonstrates the usage of a ConstantIterator representing the sequence [10, 10, 10]:

import functools

import cupy as cp
import numpy as np

import cuda.parallel.experimental.algorithms as algorithms
import cuda.parallel.experimental.iterators as iterators

def add_op(a, b):
    return a + b

value = 10
num_items = 3

constant_it = iterators.ConstantIterator(np.int32(value))  # Input sequence
h_init = np.array([0], dtype=np.int32)  # Initial value for the reduction
d_output = cp.empty(1, dtype=np.int32)  # Storage for output

# Instantiate reduction, determine storage requirements, and allocate storage
reduce_into = algorithms.reduce_into(constant_it, d_output, add_op, h_init)
temp_storage_size = reduce_into(None, constant_it, d_output, num_items, h_init)
d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)

# Run reduction
reduce_into(d_temp_storage, constant_it, d_output, num_items, h_init)

expected_output = functools.reduce(lambda a, b: a + b, [value] * num_items)
assert (d_output == expected_output).all()
Parameters

value – The value of every item in the sequence

Returns

A ConstantIterator object initialized to value

cuda.parallel.experimental.iterators.CountingIterator(offset)

Returns an Iterator representing a sequence of incrementing values.

Similar to https://nvidia.github.io/cccl/thrust/api/classthrust_1_1counting__iterator.html

Example

The code snippet below demonstrates the usage of a CountingIterator representing the sequence [10, 11, 12]:

import functools

import cupy as cp
import numpy as np

import cuda.parallel.experimental.algorithms as algorithms
import cuda.parallel.experimental.iterators as iterators

def add_op(a, b):
    return a + b

first_item = 10
num_items = 3

first_it = iterators.CountingIterator(np.int32(first_item))  # Input sequence
h_init = np.array([0], dtype=np.int32)  # Initial value for the reduction
d_output = cp.empty(1, dtype=np.int32)  # Storage for output

# Instantiate reduction, determine storage requirements, and allocate storage
reduce_into = algorithms.reduce_into(first_it, d_output, add_op, h_init)
temp_storage_size = reduce_into(None, first_it, d_output, num_items, h_init)
d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)

# Run reduction
reduce_into(d_temp_storage, first_it, d_output, num_items, h_init)

expected_output = functools.reduce(
    lambda a, b: a + b, range(first_item, first_item + num_items)
)
assert (d_output == expected_output).all()
Parameters

offset – The initial value of the sequence

Returns

A CountingIterator object initialized to offset

cuda.parallel.experimental.iterators.TransformIterator(it, op)

Returns an Iterator representing a transformed sequence of values.

Similar to https://nvidia.github.io/cccl/cub/api/classcub_1_1TransformInputIterator.html

Example

The code snippet below demonstrates the usage of a TransformIterator composed with a CountingIterator, transforming the sequence [10, 11, 12] by squaring each item before reducing the output:

import functools

import cupy as cp
import numpy as np

import cuda.parallel.experimental.algorithms as algorithms
import cuda.parallel.experimental.iterators as iterators

def add_op(a, b):
    return a + b

def square_op(a):
    return a**2

first_item = 10
num_items = 3

transform_it = iterators.TransformIterator(
    iterators.CountingIterator(np.int32(first_item)), square_op
)  # Input sequence
h_init = np.array([0], dtype=np.int32)  # Initial value for the reduction
d_output = cp.empty(1, dtype=np.int32)  # Storage for output

# Instantiate reduction, determine storage requirements, and allocate storage
reduce_into = algorithms.reduce_into(transform_it, d_output, add_op, h_init)
temp_storage_size = reduce_into(None, transform_it, d_output, num_items, h_init)
d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)

# Run reduction
reduce_into(d_temp_storage, transform_it, d_output, num_items, h_init)

expected_output = functools.reduce(
    lambda a, b: a + b, [a**2 for a in range(first_item, first_item + num_items)]
)
assert (d_output == expected_output).all()
Parameters
  • it – The iterator object to be transformed

  • op – The transform operation

Returns

A TransformIterator object to transform the items in it using op

Utilities

cuda.parallel.experimental.struct.gpu_struct(this: type) Type[Any]

Defines the given class as being a GpuStruct.

A GpuStruct represents a value composed of one or more other values, and is defined as a class with annotated fields (similar to a dataclass). The type of each field must be a subclass of np.number, like np.int32 or np.float64.

Arrays of GPUStruct objects can be used as inputs to cuda.parallel algorithms.

Example

The code snippet below shows how to use gpu_struct to define a Pixel type (composed of r, g and b values), and perform a reduction on an array of Pixel objects to identify the one with the largest g component:

import cupy as cp
import numpy as np

from cuda.parallel.experimental import algorithms
from cuda.parallel.experimental.struct import gpu_struct

@gpu_struct
class Pixel:
    r: np.int32
    g: np.int32
    b: np.int32

def max_g_value(x, y):
    return x if x.g > y.g else y

d_rgb = cp.random.randint(0, 256, (10, 3), dtype=np.int32).view(Pixel.dtype)
d_out = cp.empty(1, Pixel.dtype)

h_init = Pixel(0, 0, 0)

reduce_into = algorithms.reduce_into(d_rgb, d_out, max_g_value, h_init)
temp_storage_bytes = reduce_into(None, d_rgb, d_out, len(d_rgb), h_init)

d_temp_storage = cp.empty(temp_storage_bytes, dtype=np.uint8)
_ = reduce_into(d_temp_storage, d_rgb, d_out, len(d_rgb), h_init)

h_rgb = d_rgb.get()
expected = h_rgb[h_rgb.view("int32")[:, 1].argmax()]

np.testing.assert_equal(expected["g"], d_out.get()["g"])