CUDA Parallel
Warning
Python exposure of parallel algorithms is in public beta. The API is subject to change without notice.
Algorithms
- cuda.parallel.experimental.algorithms.reduce_into(d_in: cuda.parallel.experimental.typing.DeviceArrayLike | cuda.parallel.experimental.iterators._iterators.IteratorBase, d_out: cuda.parallel.experimental.typing.DeviceArrayLike, op: Callable, h_init: numpy.ndarray)
Computes a device-wide reduction using the specified binary
op
functor and initial valueinit
.Example
The code snippet below demonstrates the usage of the
reduce_into
API:import cupy as cp import numpy as np import cuda.parallel.experimental.algorithms as algorithms def min_op(a, b): return a if a < b else b dtype = np.int32 h_init = np.array([42], dtype=dtype) d_input = cp.array([8, 6, 7, 5, 3, 0, 9], dtype=dtype) d_output = cp.empty(1, dtype=dtype) # Instantiate reduction for the given operator and initial value reduce_into = algorithms.reduce_into(d_output, d_output, min_op, h_init) # Determine temporary device storage requirements temp_storage_size = reduce_into(None, d_input, d_output, None, h_init) # Allocate temporary storage d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8) # Run reduction reduce_into(d_temp_storage, d_input, d_output, None, h_init) # Check the result is correct expected_output = 0 assert (d_output == expected_output).all()
- Parameters
d_in – CUDA device array storing the input sequence of data items
d_out – CUDA device array storing the output aggregate
op – Binary reduction
init – Numpy array storing initial value of the reduction
- Returns
A callable object that can be used to perform the reduction
Iterators
- cuda.parallel.experimental.iterators.CacheModifiedInputIterator(device_array, modifier)
Random Access Cache Modified Iterator that wraps a native device pointer.
Similar to https://nvidia.github.io/cccl/cub/api/classcub_1_1CacheModifiedInputIterator.html
Currently the only supported modifier is “stream” (LOAD_CS).
Example
The code snippet below demonstrates the usage of a
CacheModifiedInputIterator
:import functools import cupy as cp import numpy as np import cuda.parallel.experimental.algorithms as algorithms import cuda.parallel.experimental.iterators as iterators def add_op(a, b): return a + b values = [8, 6, 7, 5, 3, 0, 9] d_input = cp.array(values, dtype=np.int32) d_output = cp.empty(1, dtype=np.int32) iterator = iterators.CacheModifiedInputIterator( d_input, modifier="stream" ) # Input sequence h_init = np.array([0], dtype=np.int32) # Initial value for the reduction d_output = cp.empty(1, dtype=np.int32) # Storage for output # Instantiate reduction, determine storage requirements, and allocate storage reduce_into = algorithms.reduce_into(iterator, d_output, add_op, h_init) temp_storage_size = reduce_into(None, iterator, d_output, len(values), h_init) d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8) # Run reduction reduce_into(d_temp_storage, iterator, d_output, len(values), h_init) expected_output = functools.reduce(lambda a, b: a + b, values) assert (d_output == expected_output).all()
- Parameters
device_array – CUDA device array storing the input sequence of data items
modifier – The PTX cache load modifier
- Returns
A
CacheModifiedInputIterator
object initialized withdevice_array
- cuda.parallel.experimental.iterators.ConstantIterator(value)
Returns an Iterator representing a sequence of constant values.
Similar to https://nvidia.github.io/cccl/thrust/api/classthrust_1_1constant__iterator.html
Example
The code snippet below demonstrates the usage of a
ConstantIterator
representing the sequence[10, 10, 10]
:import functools import cupy as cp import numpy as np import cuda.parallel.experimental.algorithms as algorithms import cuda.parallel.experimental.iterators as iterators def add_op(a, b): return a + b value = 10 num_items = 3 constant_it = iterators.ConstantIterator(np.int32(value)) # Input sequence h_init = np.array([0], dtype=np.int32) # Initial value for the reduction d_output = cp.empty(1, dtype=np.int32) # Storage for output # Instantiate reduction, determine storage requirements, and allocate storage reduce_into = algorithms.reduce_into(constant_it, d_output, add_op, h_init) temp_storage_size = reduce_into(None, constant_it, d_output, num_items, h_init) d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8) # Run reduction reduce_into(d_temp_storage, constant_it, d_output, num_items, h_init) expected_output = functools.reduce(lambda a, b: a + b, [value] * num_items) assert (d_output == expected_output).all()
- Parameters
value – The value of every item in the sequence
- Returns
A
ConstantIterator
object initialized tovalue
- cuda.parallel.experimental.iterators.CountingIterator(offset)
Returns an Iterator representing a sequence of incrementing values.
Similar to https://nvidia.github.io/cccl/thrust/api/classthrust_1_1counting__iterator.html
Example
The code snippet below demonstrates the usage of a
CountingIterator
representing the sequence[10, 11, 12]
:import functools import cupy as cp import numpy as np import cuda.parallel.experimental.algorithms as algorithms import cuda.parallel.experimental.iterators as iterators def add_op(a, b): return a + b first_item = 10 num_items = 3 first_it = iterators.CountingIterator(np.int32(first_item)) # Input sequence h_init = np.array([0], dtype=np.int32) # Initial value for the reduction d_output = cp.empty(1, dtype=np.int32) # Storage for output # Instantiate reduction, determine storage requirements, and allocate storage reduce_into = algorithms.reduce_into(first_it, d_output, add_op, h_init) temp_storage_size = reduce_into(None, first_it, d_output, num_items, h_init) d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8) # Run reduction reduce_into(d_temp_storage, first_it, d_output, num_items, h_init) expected_output = functools.reduce( lambda a, b: a + b, range(first_item, first_item + num_items) ) assert (d_output == expected_output).all()
- Parameters
offset – The initial value of the sequence
- Returns
A
CountingIterator
object initialized tooffset
- cuda.parallel.experimental.iterators.TransformIterator(it, op)
Returns an Iterator representing a transformed sequence of values.
Similar to https://nvidia.github.io/cccl/cub/api/classcub_1_1TransformInputIterator.html
Example
The code snippet below demonstrates the usage of a
TransformIterator
composed with aCountingIterator
, transforming the sequence[10, 11, 12]
by squaring each item before reducing the output:import functools import cupy as cp import numpy as np import cuda.parallel.experimental.algorithms as algorithms import cuda.parallel.experimental.iterators as iterators def add_op(a, b): return a + b def square_op(a): return a**2 first_item = 10 num_items = 3 transform_it = iterators.TransformIterator( iterators.CountingIterator(np.int32(first_item)), square_op ) # Input sequence h_init = np.array([0], dtype=np.int32) # Initial value for the reduction d_output = cp.empty(1, dtype=np.int32) # Storage for output # Instantiate reduction, determine storage requirements, and allocate storage reduce_into = algorithms.reduce_into(transform_it, d_output, add_op, h_init) temp_storage_size = reduce_into(None, transform_it, d_output, num_items, h_init) d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8) # Run reduction reduce_into(d_temp_storage, transform_it, d_output, num_items, h_init) expected_output = functools.reduce( lambda a, b: a + b, [a**2 for a in range(first_item, first_item + num_items)] ) assert (d_output == expected_output).all()
- Parameters
it – The iterator object to be transformed
op – The transform operation
- Returns
A
TransformIterator
object to transform the items init
usingop
Utilities
- cuda.parallel.experimental.struct.gpu_struct(this: type) Type[Any]
Defines the given class as being a GpuStruct.
A GpuStruct represents a value composed of one or more other values, and is defined as a class with annotated fields (similar to a dataclass). The type of each field must be a subclass of np.number, like np.int32 or np.float64.
Arrays of GPUStruct objects can be used as inputs to cuda.parallel algorithms.
Example
The code snippet below shows how to use gpu_struct to define a Pixel type (composed of r, g and b values), and perform a reduction on an array of Pixel objects to identify the one with the largest g component:
import cupy as cp import numpy as np from cuda.parallel.experimental import algorithms from cuda.parallel.experimental.struct import gpu_struct @gpu_struct class Pixel: r: np.int32 g: np.int32 b: np.int32 def max_g_value(x, y): return x if x.g > y.g else y d_rgb = cp.random.randint(0, 256, (10, 3), dtype=np.int32).view(Pixel.dtype) d_out = cp.empty(1, Pixel.dtype) h_init = Pixel(0, 0, 0) reduce_into = algorithms.reduce_into(d_rgb, d_out, max_g_value, h_init) temp_storage_bytes = reduce_into(None, d_rgb, d_out, len(d_rgb), h_init) d_temp_storage = cp.empty(temp_storage_bytes, dtype=np.uint8) _ = reduce_into(d_temp_storage, d_rgb, d_out, len(d_rgb), h_init) h_rgb = d_rgb.get() expected = h_rgb[h_rgb.view("int32")[:, 1].argmax()] np.testing.assert_equal(expected["g"], d_out.get()["g"])