cuda.cccl.parallel
API Reference
Warning
Python exposure of parallel algorithms is in public beta. The API is subject to change without notice.
Algorithms
- class cuda.cccl.parallel.experimental.algorithms.DoubleBuffer(d_current: cuda.cccl.parallel.experimental.typing.DeviceArrayLike, d_alternate: cuda.cccl.parallel.experimental.typing.DeviceArrayLike)
- alternate()
- current()
- class cuda.cccl.parallel.experimental.algorithms.SortOrder(value)
An enumeration.
- ASCENDING = 0
- DESCENDING = 1
- cuda.cccl.parallel.experimental.algorithms.binary_transform(d_in1: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.iterators._iterators.IteratorBase, d_in2: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.iterators._iterators.IteratorBase, d_out: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.iterators._iterators.IteratorBase, op: Callable, num_items: int, stream=None)
Create a binary transform object that can be called to apply a transformation to the given pair of input sequences according to the binary operation
op
.This is the two-phase API that returns a transform object for execution.
Example
import numpy as np def op(a, b): return a + b d_in1 = input_array d_in2 = input_array d_out = cp.empty_like(d_in1) binary_transform_device(d_in1, d_in2, d_out, len(d_in1), op) got = d_out.get() expected = binary_transform_host(d_in1.get(), d_in2.get(), op) np.testing.assert_allclose(expected, got, rtol=1e-5)
- Parameters
d_in1 – Device array or iterator containing the first input sequence of data items.
d_in2 – Device array or iterator containing the second input sequence of data items.
d_out – Device array or iterator to store the result of the transformation.
op – Binary operation to apply to each pair of items from the input sequences.
num_items – Number of items to transform.
stream – CUDA stream to use for the operation.
- cuda.cccl.parallel.experimental.algorithms.exclusive_scan(d_in: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.iterators._iterators.IteratorBase, d_out: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.iterators._iterators.IteratorBase, op: Callable, h_init: Union[numpy.ndarray, Any], num_items: int, stream=None)
- cuda.cccl.parallel.experimental.algorithms.histogram_even(d_samples: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.iterators._iterators.IteratorBase, d_histogram: cuda.cccl.parallel.experimental.typing.DeviceArrayLike, num_output_levels: int, lower_level: Union[numpy.floating, numpy.integer], upper_level: Union[numpy.floating, numpy.integer], num_samples: int, stream=None)
- cuda.cccl.parallel.experimental.algorithms.inclusive_scan(d_in: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.iterators._iterators.IteratorBase, d_out: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.iterators._iterators.IteratorBase, op: Callable, h_init: Union[numpy.ndarray, Any], num_items: int, stream=None)
- cuda.cccl.parallel.experimental.algorithms.make_binary_transform(d_in1: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.iterators._iterators.IteratorBase, d_in2: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.iterators._iterators.IteratorBase, d_out: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.iterators._iterators.IteratorBase, op: Callable)
Create a binary transform object that can be called to apply a transformation to the given pair of input sequences according to the binary operation
op
.This is the object-oriented API that allows explicit control over temporary storage allocation. For simpler usage, consider using
binary_transform()
.Example
import numpy as np def op(a, b): return a + b d_in1 = input_array d_in2 = input_array d_out = cp.empty_like(d_in1) binary_transform_device(d_in1, d_in2, d_out, len(d_in1), op) got = d_out.get() expected = binary_transform_host(d_in1.get(), d_in2.get(), op) np.testing.assert_allclose(expected, got, rtol=1e-5)
- Parameters
d_in1 – Device array or iterator containing the first input sequence of data items.
d_in2 – Device array or iterator containing the second input sequence of data items.
d_out – Device array or iterator to store the result of the transformation.
op – Binary operation to apply to each pair of items from the input sequences.
- Returns
A callable object that performs the transformation.
- cuda.cccl.parallel.experimental.algorithms.make_exclusive_scan(d_in: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.iterators._iterators.IteratorBase, d_out: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.iterators._iterators.IteratorBase, op: Callable, h_init: numpy.ndarray)
Computes a device-wide scan using the specified binary
op
and initial valueinit
.Example
Below,
scan
is used to compute an exclusive scan of a sequence of integers.def max_op(a, b): return max(a, b) h_init = np.array([1], dtype="int32") d_input = cp.array([-5, 0, 2, -3, 2, 4, 0, -1, 2, 8], dtype="int32") d_output = cp.empty_like(d_input, dtype="int32") # Run exclusive scan with automatic temp storage allocation parallel.exclusive_scan(d_input, d_output, max_op, h_init, d_input.size) # Check the result is correct expected = np.asarray([1, 1, 1, 2, 2, 2, 4, 4, 4, 4]) np.testing.assert_equal(d_output.get(), expected)
- Parameters
d_in – Device array or iterator containing the input sequence of data items
d_out – Device array that will store the result of the scan
op – Callable representing the binary operator to apply
init – Numpy array storing initial value of the scan
- Returns
A callable object that can be used to perform the scan
- cuda.cccl.parallel.experimental.algorithms.make_histogram_even(d_samples: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.iterators._iterators.IteratorBase, d_histogram: cuda.cccl.parallel.experimental.typing.DeviceArrayLike, h_num_output_levels: numpy.ndarray, h_lower_level: numpy.ndarray, h_upper_level: numpy.ndarray, num_samples: int)
Implements a device-wide histogram that places
d_samples
into evenly-spaced bins.Example
Below,
histogram
is used to bin a sequence of samples.import cupy as cp import numpy as np import cuda.cccl.parallel.experimental as parallel num_samples = 10 h_samples = np.array( [2.2, 6.1, 7.1, 2.9, 3.5, 0.3, 2.9, 2.1, 6.1, 999.5], dtype="float32" ) d_samples = cp.asarray(h_samples) num_levels = 7 d_histogram = cp.empty(num_levels - 1, dtype="int32") lower_level = np.float64(0) upper_level = np.float64(12) # Run histogram with automatic temp storage allocation parallel.histogram_even( d_samples, d_histogram, num_levels, lower_level, upper_level, num_samples, ) # Check the result is correct h_actual_histogram = cp.asnumpy(d_histogram) # Calculate expected histogram using numpy h_expected_histogram, _ = np.histogram( h_samples, bins=num_levels - 1, range=(lower_level, upper_level) ) h_expected_histogram = h_expected_histogram.astype("int32") np.testing.assert_array_equal(h_actual_histogram, h_expected_histogram)
- Parameters
d_samples – Device array or iterator containing the input samples to be histogrammed
d_histogram – Device array to store the histogram
h_num_output_levels – Host array containing the number of output levels
h_lower_level – Host array containing the lower level
h_upper_level – Host array containing the upper level
num_samples – Number of samples to be histogrammed
- Returns
A callable object that can be used to perform the histogram
- cuda.cccl.parallel.experimental.algorithms.make_inclusive_scan(d_in: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.iterators._iterators.IteratorBase, d_out: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.iterators._iterators.IteratorBase, op: Callable, h_init: numpy.ndarray)
Computes a device-wide scan using the specified binary
op
and initial valueinit
.Example
Below,
scan
is used to compute an inclusive scan of a sequence of integers.def add_op(a, b): return a + b h_init = np.array([0], dtype="int32") d_input = cp.array([-5, 0, 2, -3, 2, 4, 0, -1, 2, 8], dtype="int32") d_output = cp.empty_like(d_input, dtype="int32") # Run inclusive scan with automatic temp storage allocation parallel.inclusive_scan(d_input, d_output, add_op, h_init, d_input.size) # Check the result is correct expected = np.asarray([-5, -5, -3, -6, -4, 0, 0, -1, 1, 9]) np.testing.assert_equal(d_output.get(), expected)
- Parameters
d_in – Device array or iterator containing the input sequence of data items
d_out – Device array that will store the result of the scan
op – Callable representing the binary operator to apply
init – Numpy array storing initial value of the scan
- Returns
A callable object that can be used to perform the scan
- cuda.cccl.parallel.experimental.algorithms.make_merge_sort(d_in_keys: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.iterators._iterators.IteratorBase, d_in_items: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.iterators._iterators.IteratorBase | None, d_out_keys: cuda.cccl.parallel.experimental.typing.DeviceArrayLike, d_out_items: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | None, op: Callable)
Implements a device-wide merge sort using
d_in_keys
and the comparison operatorop
.Example
Below,
merge_sort
is used to sort a sequence of keys inplace. It also rearranges the items according to the keys’ order.import cupy as cp import numpy as np import cuda.cccl.parallel.experimental as parallel def compare_op(lhs, rhs): return np.uint8(lhs < rhs) h_in_keys = np.array([-5, 0, 2, -3, 2, 4, 0, -1, 2, 8], dtype="int32") h_in_items = np.array( [-3.2, 2.2, 1.9, 4.0, -3.9, 2.7, 0, 8.3 - 1, 2.9, 5.4], dtype="float32" ) d_in_keys = cp.asarray(h_in_keys) d_in_items = cp.asarray(h_in_items) # Run merge_sort with automatic temp storage allocation parallel.merge_sort( d_in_keys, d_in_items, d_in_keys, d_in_items, compare_op, d_in_keys.size ) # Check the result is correct h_out_keys = cp.asnumpy(d_in_keys) h_out_items = cp.asnumpy(d_in_items) argsort = np.argsort(h_in_keys, stable=True) h_in_keys = np.array(h_in_keys)[argsort] h_in_items = np.array(h_in_items)[argsort] np.testing.assert_array_equal(h_out_keys, h_in_keys) np.testing.assert_array_equal(h_out_items, h_in_items)
- Parameters
d_in_keys – Device array or iterator containing the input keys to be sorted
d_in_items – Optional device array or iterator that contains each key’s corresponding item
d_out_keys – Device array to store the sorted keys
d_out_items – Device array to store the sorted items
op – Callable representing the comparison operator
- Returns
A callable object that can be used to perform the merge sort
- cuda.cccl.parallel.experimental.algorithms.make_radix_sort(d_in_keys: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.algorithms._radix_sort.DoubleBuffer, d_out_keys: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | None, d_in_values: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.algorithms._radix_sort.DoubleBuffer | None, d_out_values: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | None, order: cuda.cccl.parallel.experimental.algorithms._radix_sort.SortOrder)
Implements a device-wide radix sort using
d_in_keys
in the requested order.Example
Below,
radix_sort
is used to sort a sequence of keys. It also rearranges the values according to the keys’ order.import cupy as cp import numpy as np import cuda.cccl.parallel.experimental as parallel h_in_keys = np.array([-5, 0, 2, -3, 2, 4, 0, -1, 2, 8], dtype="int32") h_in_values = np.array( [-3.2, 2.2, 1.9, 4.0, -3.9, 2.7, 0, 8.3 - 1, 2.9, 5.4], dtype="float32" ) d_in_keys = cp.asarray(h_in_keys) d_in_values = cp.asarray(h_in_values) d_out_keys = cp.empty_like(d_in_keys) d_out_values = cp.empty_like(d_in_values) # Call single-phase API directly with num_items parameter parallel.radix_sort( d_in_keys, d_out_keys, d_in_values, d_out_values, parallel.SortOrder.ASCENDING, d_in_keys.size, ) # Check the result is correct h_out_keys = cp.asnumpy(d_out_keys) h_out_items = cp.asnumpy(d_out_values) argsort = np.argsort(h_in_keys, stable=True) h_in_keys = np.array(h_in_keys)[argsort] h_in_values = np.array(h_in_values)[argsort] np.testing.assert_array_equal(h_out_keys, h_in_keys) np.testing.assert_array_equal(h_out_items, h_in_values)
Instead of passing in arrays directly, we can use a
DoubleBuffer
, which requires less temporary storage but could overwrite the input arraysimport cupy as cp import numpy as np import cuda.cccl.parallel.experimental as parallel h_in_keys = np.array([-5, 0, 2, -3, 2, 4, 0, -1, 2, 8], dtype="int32") h_in_values = np.array( [-3.2, 2.2, 1.9, 4.0, -3.9, 2.7, 0, 8.3 - 1, 2.9, 5.4], dtype="float32" ) d_in_keys = cp.asarray(h_in_keys) d_in_values = cp.asarray(h_in_values) d_out_keys = cp.empty_like(d_in_keys) d_out_values = cp.empty_like(d_in_values) keys_double_buffer = parallel.DoubleBuffer(d_in_keys, d_out_keys) values_double_buffer = parallel.DoubleBuffer(d_in_values, d_out_values) # Call single-phase API directly with num_items parameter parallel.radix_sort( keys_double_buffer, None, values_double_buffer, None, parallel.SortOrder.ASCENDING, d_in_keys.size, ) # Check the result is correct h_out_keys = cp.asnumpy(keys_double_buffer.current()) h_out_values = cp.asnumpy(values_double_buffer.current()) argsort = np.argsort(h_in_keys, stable=True) h_in_keys = np.array(h_in_keys)[argsort] h_in_values = np.array(h_in_values)[argsort] np.testing.assert_array_equal(h_out_keys, h_in_keys) np.testing.assert_array_equal(h_out_values, h_in_values)
- Parameters
d_in_keys – Device array or DoubleBuffer containing the input keys to be sorted
d_out_keys – Device array to store the sorted keys
d_in_values – Optional Device array or DoubleBuffer containing the input keys to be sorted
d_out_values – Device array to store the sorted values
op – Callable representing the comparison operator
- Returns
A callable object that can be used to perform the merge sort
- cuda.cccl.parallel.experimental.algorithms.make_reduce_into(d_in: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.iterators._iterators.IteratorBase, d_out: cuda.cccl.parallel.experimental.typing.DeviceArrayLike, op: Callable, h_init: numpy.ndarray)
Computes a device-wide reduction using the specified binary
op
and initial valueinit
.Example
Below,
reduce_into
is used to compute the minimum value of a sequence of integers.import cupy as cp import numpy as np import cuda.cccl.parallel.experimental as parallel def min_op(a, b): return a if a < b else b dtype = np.int32 h_init = np.array([42], dtype=dtype) d_input = cp.array([8, 6, 7, 5, 3, 0, 9], dtype=dtype) d_output = cp.empty(1, dtype=dtype) # Run reduction parallel.reduce_into(d_input, d_output, min_op, len(d_input), h_init) # Check the result is correct expected_output = 0 assert (d_output == expected_output).all()
- Parameters
d_in – Device array or iterator containing the input sequence of data items
d_out – Device array (of size 1) that will store the result of the reduction
op – Callable representing the binary operator to apply
init – Numpy array storing initial value of the reduction
- Returns
A callable object that can be used to perform the reduction
- cuda.cccl.parallel.experimental.algorithms.make_segmented_reduce(d_in: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.iterators._iterators.IteratorBase, d_out: cuda.cccl.parallel.experimental.typing.DeviceArrayLike, start_offsets_in: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.iterators._iterators.IteratorBase, end_offsets_in: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.iterators._iterators.IteratorBase, op: Callable, h_init: numpy.ndarray)
Computes a device-wide segmented reduction using the specified binary
op
and initial valueinit
.Example
Below,
segmented_reduce
is used to compute the minimum value of a sequence of integers.def min_op(a, b): return a if a < b else b dtype = np.dtype(np.int32) max_val = np.iinfo(dtype).max h_init = np.asarray(max_val, dtype=dtype) offsets = cp.array([0, 7, 11, 16], dtype=np.int64) first_segment = (8, 6, 7, 5, 3, 0, 9) second_segment = (-4, 3, 0, 1) third_segment = (3, 1, 11, 25, 8) d_input = cp.array( [*first_segment, *second_segment, *third_segment], dtype=dtype, ) start_o = offsets[:-1] end_o = offsets[1:] n_segments = start_o.size d_output = cp.empty(n_segments, dtype=dtype) # Run segmented reduction with automatic temp storage allocation parallel.segmented_reduce( d_input, d_output, start_o, end_o, min_op, h_init, n_segments ) # Check the result is correct expected_output = cp.asarray([0, -4, 1], dtype=d_output.dtype) assert (d_output == expected_output).all()
- Parameters
d_in – Device array or iterator containing the input sequence of data items
d_out – Device array that will store the result of the reduction
start_offsets_in – Device array or iterator containing offsets to start of segments
end_offsets_in – Device array or iterator containing offsets to end of segments
op – Callable representing the binary operator to apply
init – Numpy array storing initial value of the reduction
- Returns
A callable object that can be used to perform the reduction
- cuda.cccl.parallel.experimental.algorithms.make_unary_transform(d_in: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.iterators._iterators.IteratorBase, d_out: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.iterators._iterators.IteratorBase, op: Callable)
Create a unary transform object that can be called to apply a transformation to each element of the input according to the unary operation
op
.This is the object-oriented API that allows explicit control over temporary storage allocation. For simpler usage, consider using
unary_transform()
.Example
import numpy as np def op(a): return a + 1 d_in = input_array d_out = cp.empty_like(d_in) unary_transform_device(d_in, d_out, len(d_in), op) got = d_out.get() expected = unary_transform_host(d_in.get(), op) np.testing.assert_allclose(expected, got, rtol=1e-5)
- Parameters
d_in – Device array or iterator containing the input sequence of data items.
d_out – Device array or iterator to store the result of the transformation.
op – Unary operation to apply to each element of the input.
- Returns
A callable object that performs the transformation.
- cuda.cccl.parallel.experimental.algorithms.make_unique_by_key(d_in_keys: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.iterators._iterators.IteratorBase, d_in_items: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.iterators._iterators.IteratorBase, d_out_keys: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.iterators._iterators.IteratorBase, d_out_items: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.iterators._iterators.IteratorBase, d_out_num_selected: cuda.cccl.parallel.experimental.typing.DeviceArrayLike, op: Callable)
Implements a device-wide unique by key operation using
d_in_keys
and the comparison operatorop
. Only the first key and its value from each run is selected and the total number of items selected is also reported.Example
Below,
unique_by_key
is used to populate the arrays of output keys and items with the first key and its corresponding item from each sequence of equal keys. It also outputs the number of items selected.import cupy as cp import numpy as np import cuda.cccl.parallel.experimental as parallel def compare_op(lhs, rhs): return np.uint8(lhs == rhs) h_in_keys = np.array([0, 2, 2, 9, 5, 5, 5, 8], dtype="int32") h_in_items = np.array([1, 2, 3, 4, 5, 6, 7, 8], dtype="float32") d_in_keys = cp.asarray(h_in_keys) d_in_items = cp.asarray(h_in_items) d_out_keys = cp.empty_like(d_in_keys) d_out_items = cp.empty_like(d_in_items) d_out_num_selected = cp.empty(1, np.int32) # Run unique_by_key with automatic temp storage allocation parallel.unique_by_key( d_in_keys, d_in_items, d_out_keys, d_out_items, d_out_num_selected, compare_op, d_in_keys.size, ) # Check the result is correct num_selected = cp.asnumpy(d_out_num_selected)[0] h_out_keys = cp.asnumpy(d_out_keys)[:num_selected] h_out_items = cp.asnumpy(d_out_items)[:num_selected] prev_key = h_in_keys[0] expected_keys = [prev_key] expected_items = [h_in_items[0]] for idx, (previous, next) in enumerate(zip(h_in_keys, h_in_keys[1:])): if previous != next: expected_keys.append(next) # add 1 since we are enumerating over pairs expected_items.append(h_in_items[idx + 1]) np.testing.assert_array_equal(h_out_keys, np.array(expected_keys)) np.testing.assert_array_equal(h_out_items, np.array(expected_items))
- Parameters
d_in_keys – Device array or iterator containing the input sequence of keys
d_in_items – Device array or iterator that contains each key’s corresponding item
d_out_keys – Device array or iterator to store the outputted keys
d_out_items – Device array or iterator to store each outputted key’s item
d_out_num_selected – Device array to store how many items were selected
op – Callable representing the equality operator
- Returns
A callable object that can be used to perform unique by key
- cuda.cccl.parallel.experimental.algorithms.merge_sort(d_in_keys: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.iterators._iterators.IteratorBase, d_in_items: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.iterators._iterators.IteratorBase | None, d_out_keys: cuda.cccl.parallel.experimental.typing.DeviceArrayLike, d_out_items: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | None, op: Callable, num_items: int, stream=None)
- cuda.cccl.parallel.experimental.algorithms.radix_sort(d_in_keys: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.algorithms._radix_sort.DoubleBuffer, d_out_keys: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | None, d_in_values: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.algorithms._radix_sort.DoubleBuffer | None, d_out_values: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | None, order: cuda.cccl.parallel.experimental.algorithms._radix_sort.SortOrder, num_items: int, begin_bit: Optional[int] = None, end_bit: Optional[int] = None, stream=None)
- cuda.cccl.parallel.experimental.algorithms.reduce_into(d_in: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.iterators._iterators.IteratorBase, d_out: cuda.cccl.parallel.experimental.typing.DeviceArrayLike, op: Callable, num_items: int, h_init: Union[numpy.ndarray, Any], stream=None)
- cuda.cccl.parallel.experimental.algorithms.segmented_reduce(d_in: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.iterators._iterators.IteratorBase, d_out: cuda.cccl.parallel.experimental.typing.DeviceArrayLike, start_offsets_in: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.iterators._iterators.IteratorBase, end_offsets_in: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.iterators._iterators.IteratorBase, op: Callable, h_init: Union[numpy.ndarray, Any], num_segments: int, stream=None)
- cuda.cccl.parallel.experimental.algorithms.unary_transform(d_in: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.iterators._iterators.IteratorBase, d_out: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.iterators._iterators.IteratorBase, op: Callable, num_items: int, stream=None)
Create a unary transform object that can be called to apply a transformation to each element of the input according to the unary operation
op
.This is the two-phase API that returns a transform object for execution.
Example
import numpy as np def op(a): return a + 1 d_in = input_array d_out = cp.empty_like(d_in) unary_transform_device(d_in, d_out, len(d_in), op) got = d_out.get() expected = unary_transform_host(d_in.get(), op) np.testing.assert_allclose(expected, got, rtol=1e-5)
- Parameters
d_in – Device array or iterator containing the input sequence of data items.
d_out – Device array or iterator to store the result of the transformation.
op – Unary operation to apply to each element of the input.
num_items – Number of items to transform.
stream – CUDA stream to use for the operation.
- cuda.cccl.parallel.experimental.algorithms.unique_by_key(d_in_keys: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.iterators._iterators.IteratorBase, d_in_items: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.iterators._iterators.IteratorBase, d_out_keys: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.iterators._iterators.IteratorBase, d_out_items: cuda.cccl.parallel.experimental.typing.DeviceArrayLike | cuda.cccl.parallel.experimental.iterators._iterators.IteratorBase, d_out_num_selected: cuda.cccl.parallel.experimental.typing.DeviceArrayLike, op: Callable, num_items: int, stream=None)
Performs device-wide unique by key operation using the single-phase API.
This function automatically handles temporary storage allocation and execution.
- Parameters
d_in_keys – Device array or iterator containing the input sequence of keys
d_in_items – Device array or iterator that contains each key’s corresponding item
d_out_keys – Device array or iterator to store the outputted keys
d_out_items – Device array or iterator to store each outputted key’s item
d_out_num_selected – Device array to store how many items were selected
op – Callable representing the equality operator
num_items – Number of items to process
stream – CUDA stream for the operation (optional)
Iterators
- cuda.cccl.parallel.experimental.iterators.CacheModifiedInputIterator(device_array, modifier)
Random Access Cache Modified Iterator that wraps a native device pointer.
Similar to https://nvidia.github.io/cccl/cub/api/classcub_1_1CacheModifiedInputIterator.html
Currently the only supported modifier is “stream” (LOAD_CS).
Example
The code snippet below demonstrates the usage of a
CacheModifiedInputIterator
:import functools import cupy as cp import numpy as np import cuda.cccl.parallel.experimental as parallel def add_op(a, b): return a + b values = [8, 6, 7, 5, 3, 0, 9] d_input = cp.array(values, dtype=np.int32) d_output = cp.empty(1, dtype=np.int32) iterator = parallel.CacheModifiedInputIterator( d_input, modifier="stream" ) # Input sequence h_init = np.array([0], dtype=np.int32) # Initial value for the reduction d_output = cp.empty(1, dtype=np.int32) # Storage for output # Run reduction parallel.reduce_into(iterator, d_output, add_op, len(values), h_init) expected_output = functools.reduce(lambda a, b: a + b, values) assert (d_output == expected_output).all()
- Parameters
device_array – CUDA device array storing the input sequence of data items
modifier – The PTX cache load modifier
prefix – An optional prefix added to the iterator’s methods to prevent name collisions.
- Returns
A
CacheModifiedInputIterator
object initialized withdevice_array
- cuda.cccl.parallel.experimental.iterators.ConstantIterator(value)
Returns an Iterator representing a sequence of constant values.
Similar to https://nvidia.github.io/cccl/thrust/api/classthrust_1_1constant__iterator.html
Example
The code snippet below demonstrates the usage of a
ConstantIterator
representing the sequence[10, 10, 10]
:import functools import cupy as cp import numpy as np import cuda.cccl.parallel.experimental as parallel def add_op(a, b): return a + b value = 10 num_items = 3 constant_it = parallel.ConstantIterator(np.int32(value)) # Input sequence h_init = np.array([0], dtype=np.int32) # Initial value for the reduction d_output = cp.empty(1, dtype=np.int32) # Storage for output # Run reduction parallel.reduce_into(constant_it, d_output, add_op, num_items, h_init) expected_output = functools.reduce(lambda a, b: a + b, [value] * num_items) assert (d_output == expected_output).all()
- Parameters
value – The value of every item in the sequence
- Returns
A
ConstantIterator
object initialized tovalue
- cuda.cccl.parallel.experimental.iterators.CountingIterator(offset)
Returns an Iterator representing a sequence of incrementing values.
Similar to https://nvidia.github.io/cccl/thrust/api/classthrust_1_1counting__iterator.html
Example
The code snippet below demonstrates the usage of a
CountingIterator
representing the sequence[10, 11, 12]
:import functools import cupy as cp import numpy as np import cuda.cccl.parallel.experimental as parallel def add_op(a, b): return a + b first_item = 10 num_items = 3 first_it = parallel.CountingIterator(np.int32(first_item)) # Input sequence h_init = np.array([0], dtype=np.int32) # Initial value for the reduction d_output = cp.empty(1, dtype=np.int32) # Storage for output # Run reduction parallel.reduce_into(first_it, d_output, add_op, num_items, h_init) expected_output = functools.reduce( lambda a, b: a + b, range(first_item, first_item + num_items) ) assert (d_output == expected_output).all()
- Parameters
offset – The initial value of the sequence
- Returns
A
CountingIterator
object initialized tooffset
- cuda.cccl.parallel.experimental.iterators.ReverseInputIterator(sequence)
Returns an input Iterator over an array in reverse.
Similar to [std::reverse_iterator](https://en.cppreference.com/w/cpp/iterator/reverse_iterator)
Example
The code snippet below demonstrates the usage of a
ReverseInputIterator
:def add_op(a, b): return a + b h_init = np.array([0], dtype="int32") d_input = cp.array([-5, 0, 2, -3, 2, 4, 0, -1, 2, 8], dtype="int32") d_output = cp.empty_like(d_input, dtype="int32") reverse_it = parallel.ReverseInputIterator(d_input) # Run scan with automatic temp storage allocation parallel.inclusive_scan(reverse_it, d_output, add_op, h_init, len(d_input)) # Check the result is correct expected = np.asarray([8, 10, 9, 9, 13, 15, 12, 14, 14, 9]) np.testing.assert_equal(d_output.get(), expected)
- Parameters
sequence – The iterator or CUDA device array to be reversed
- Returns
A
ReverseIterator
object initialized withsequence
to use as an input
- cuda.cccl.parallel.experimental.iterators.ReverseOutputIterator(sequence)
Returns an output Iterator over an array in reverse.
Similar to [std::reverse_iterator](https://en.cppreference.com/w/cpp/iterator/reverse_iterator)
Example
The code snippet below demonstrates the usage of a
ReverseIterator
:def add_op(a, b): return a + b h_init = np.array([0], dtype="int32") d_input = cp.array([-5, 0, 2, -3, 2, 4, 0, -1, 2, 8], dtype="int32") d_output = cp.empty_like(d_input, dtype="int32") reverse_it = parallel.ReverseOutputIterator(d_output) # Run scan with automatic temp storage allocation parallel.inclusive_scan(d_input, reverse_it, add_op, h_init, len(d_input)) # Check the result is correct expected = np.asarray([9, 1, -1, 0, 0, -4, -6, -3, -5, -5]) np.testing.assert_equal(d_output.get(), expected)
- Parameters
sequence – The iterator or CUDA device array to be reversed to use as an output
- Returns
A
ReverseIterator
object initialized withsequence
to use as an output
- cuda.cccl.parallel.experimental.iterators.TransformIterator(it, op)
Returns an Iterator representing a transformed sequence of values.
Similar to https://nvidia.github.io/cccl/thrust/api/classthrust_1_1transform__iterator.html
Example
The code snippet below demonstrates the usage of a
TransformIterator
composed with aCountingIterator
, transforming the sequence[10, 11, 12]
by squaring each item before reducing the output:import functools import cupy as cp import numpy as np import cuda.cccl.parallel.experimental as parallel def add_op(a, b): return a + b def square_op(a): return a**2 first_item = 10 num_items = 3 transform_it = parallel.TransformIterator( parallel.CountingIterator(np.int32(first_item)), square_op ) # Input sequence h_init = np.array([0], dtype=np.int32) # Initial value for the reduction d_output = cp.empty(1, dtype=np.int32) # Storage for output # Run reduction parallel.reduce_into(transform_it, d_output, add_op, num_items, h_init) expected_output = functools.reduce( lambda a, b: a + b, [a**2 for a in range(first_item, first_item + num_items)] ) assert (d_output == expected_output).all()
- Parameters
it – The iterator object to be transformed
op – The transform operation
- Returns
A
TransformIterator
object to transform the items init
usingop
Utilities
- cuda.cccl.parallel.experimental.struct.gpu_struct(this: type) Type[Any]
Defines the given class as being a GpuStruct.
A GpuStruct represents a value composed of one or more other values, and is defined as a class with annotated fields (similar to a dataclass). The type of each field must be a subclass of np.number, like np.int32 or np.float64.
Arrays of GPUStruct objects can be used as inputs to cuda.cccl.parallel algorithms.
Example
The code snippet below shows how to use gpu_struct to define a MinMax type (composed of min_val, max_val values), and perform a reduction on an input array of floating point values to compute its the smallest and the largest absolute values:
import cupy as cp import numpy as np import cuda.cccl.parallel.experimental as parallel @parallel.gpu_struct class MinMax: min_val: np.float64 max_val: np.float64 def minmax_op(v1: MinMax, v2: MinMax): c_min = min(v1.min_val, v2.min_val) c_max = max(v1.max_val, v2.max_val) return MinMax(c_min, c_max) def transform_op(v): av = abs(v) return MinMax(av, av) nelems = 4096 d_in = cp.random.randn(nelems) # input values must be transformed to MinMax structures # in-place to map computation to data-parallel reduction # algorithm that requires commutative binary operation # with both operands having the same type. tr_it = parallel.TransformIterator(d_in, transform_op) d_out = cp.empty(tuple(), dtype=MinMax.dtype) # initial value set with identity elements of # minimum and maximum operators h_init = MinMax(np.inf, -np.inf) # run the reduction algorithm parallel.reduce_into(tr_it, d_out, minmax_op, nelems, h_init) # display values computed on the device actual = d_out.get() h = np.abs(d_in.get()) expected = np.asarray([(h.min(), h.max())], dtype=MinMax.dtype) assert actual == expected