cuda.cccl.parallel API Reference#

Warning

Python exposure of parallel algorithms is in public beta. The API is subject to change without notice.

Algorithms#

cuda.cccl.parallel.experimental.algorithms.make_binary_transform(d_in1, d_in2, d_out, op)#

Create a binary transform object that can be called to apply a transformation to the given pair of input sequences according to the binary operation op.

This is the object-oriented API that allows explicit control over temporary storage allocation. For simpler usage, consider using binary_transform().

Example

import numpy as np

def op(a, b):
    return a + b

d_in1 = input_array
d_in2 = input_array
d_out = cp.empty_like(d_in1)

binary_transform_device(d_in1, d_in2, d_out, len(d_in1), op)

got = d_out.get()
expected = binary_transform_host(d_in1.get(), d_in2.get(), op)

np.testing.assert_allclose(expected, got, rtol=1e-5)
Parameters:
  • d_in1 (DeviceArrayLike | IteratorBase) – Device array or iterator containing the first input sequence of data items.

  • d_in2 (DeviceArrayLike | IteratorBase) – Device array or iterator containing the second input sequence of data items.

  • d_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the result of the transformation.

  • op (Callable | cuda.cccl.parallel.experimental._bindings.OpKind) – Callable or OpKind representing the binary operation to apply to each pair of items from the input sequences.

Returns:

A callable object that performs the transformation.

cuda.cccl.parallel.experimental.algorithms.make_exclusive_scan(d_in, d_out, op, h_init)#

Computes a device-wide scan using the specified binary op and initial value init.

Example

Below, scan is used to compute an exclusive scan of a sequence of integers.


def max_op(a, b):
    return max(a, b)

h_init = np.array([1], dtype="int32")
d_input = cp.array([-5, 0, 2, -3, 2, 4, 0, -1, 2, 8], dtype="int32")
d_output = cp.empty_like(d_input, dtype="int32")

# Run exclusive scan with automatic temp storage allocation
parallel.exclusive_scan(d_input, d_output, max_op, h_init, d_input.size)

# Check the result is correct
expected = np.asarray([1, 1, 1, 2, 2, 2, 4, 4, 4, 4])
np.testing.assert_equal(d_output.get(), expected)
Parameters:
  • d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items

  • d_out (DeviceArrayLike | IteratorBase) – Device array that will store the result of the scan

  • op (Callable | cuda.cccl.parallel.experimental._bindings.OpKind) – Callable or OpKind representing the binary operator to apply

  • init – Numpy array storing initial value of the scan

  • h_init (ndarray)

Returns:

A callable object that can be used to perform the scan

cuda.cccl.parallel.experimental.algorithms.make_histogram_even(
d_samples,
d_histogram,
h_num_output_levels,
h_lower_level,
h_upper_level,
num_samples,
)#

Implements a device-wide histogram that places d_samples into evenly-spaced bins.

Example

Below, histogram is used to bin a sequence of samples.

import cupy as cp
import numpy as np

import cuda.cccl.parallel.experimental as parallel

num_samples = 10
h_samples = np.array(
    [2.2, 6.1, 7.1, 2.9, 3.5, 0.3, 2.9, 2.1, 6.1, 999.5], dtype="float32"
)
d_samples = cp.asarray(h_samples)
num_levels = 7
d_histogram = cp.empty(num_levels - 1, dtype="int32")
lower_level = np.float64(0)
upper_level = np.float64(12)

# Run histogram with automatic temp storage allocation
parallel.histogram_even(
    d_samples,
    d_histogram,
    num_levels,
    lower_level,
    upper_level,
    num_samples,
)

# Check the result is correct
h_actual_histogram = cp.asnumpy(d_histogram)
# Calculate expected histogram using numpy
h_expected_histogram, _ = np.histogram(
    h_samples, bins=num_levels - 1, range=(lower_level, upper_level)
)
h_expected_histogram = h_expected_histogram.astype("int32")

np.testing.assert_array_equal(h_actual_histogram, h_expected_histogram)
Parameters:
  • d_samples (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input samples to be histogrammed

  • d_histogram (DeviceArrayLike) – Device array to store the histogram

  • h_num_output_levels (ndarray) – Host array containing the number of output levels

  • h_lower_level (ndarray) – Host array containing the lower level

  • h_upper_level (ndarray) – Host array containing the upper level

  • num_samples (int) – Number of samples to be histogrammed

Returns:

A callable object that can be used to perform the histogram

cuda.cccl.parallel.experimental.algorithms.make_inclusive_scan(d_in, d_out, op, h_init)#

Computes a device-wide scan using the specified binary op and initial value init.

Example

Below, scan is used to compute an inclusive scan of a sequence of integers.


def add_op(a, b):
    return a + b

h_init = np.array([0], dtype="int32")
d_input = cp.array([-5, 0, 2, -3, 2, 4, 0, -1, 2, 8], dtype="int32")
d_output = cp.empty_like(d_input, dtype="int32")

# Run inclusive scan with automatic temp storage allocation
parallel.inclusive_scan(d_input, d_output, add_op, h_init, d_input.size)

# Check the result is correct
expected = np.asarray([-5, -5, -3, -6, -4, 0, 0, -1, 1, 9])
np.testing.assert_equal(d_output.get(), expected)
Parameters:
  • d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items

  • d_out (DeviceArrayLike | IteratorBase) – Device array that will store the result of the scan

  • op (Callable | cuda.cccl.parallel.experimental._bindings.OpKind) – Callable or OpKind representing the binary operator to apply

  • init – Numpy array storing initial value of the scan

  • h_init (ndarray)

Returns:

A callable object that can be used to perform the scan

cuda.cccl.parallel.experimental.algorithms.make_merge_sort(d_in_keys, d_in_items, d_out_keys, d_out_items, op)#

Implements a device-wide merge sort using d_in_keys and the comparison operator op.

Example

Below, merge_sort is used to sort a sequence of keys inplace. It also rearranges the items according to the keys’ order.

import cupy as cp
import numpy as np

import cuda.cccl.parallel.experimental as parallel

def compare_op(lhs, rhs):
    return np.uint8(lhs < rhs)

h_in_keys = np.array([-5, 0, 2, -3, 2, 4, 0, -1, 2, 8], dtype="int32")
h_in_items = np.array(
    [-3.2, 2.2, 1.9, 4.0, -3.9, 2.7, 0, 8.3 - 1, 2.9, 5.4], dtype="float32"
)

d_in_keys = cp.asarray(h_in_keys)
d_in_items = cp.asarray(h_in_items)

# Run merge_sort with automatic temp storage allocation
parallel.merge_sort(
    d_in_keys, d_in_items, d_in_keys, d_in_items, compare_op, d_in_keys.size
)

# Check the result is correct
h_out_keys = cp.asnumpy(d_in_keys)
h_out_items = cp.asnumpy(d_in_items)

argsort = np.argsort(h_in_keys, stable=True)
h_in_keys = np.array(h_in_keys)[argsort]
h_in_items = np.array(h_in_items)[argsort]

np.testing.assert_array_equal(h_out_keys, h_in_keys)
np.testing.assert_array_equal(h_out_items, h_in_items)
Parameters:
  • d_in_keys (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input keys to be sorted

  • d_in_items (DeviceArrayLike | IteratorBase | None) – Optional device array or iterator that contains each key’s corresponding item

  • d_out_keys (DeviceArrayLike) – Device array to store the sorted keys

  • d_out_items (DeviceArrayLike | None) – Device array to store the sorted items

  • op (Callable | cuda.cccl.parallel.experimental._bindings.OpKind) – Callable or OpKind representing the comparison operator

Returns:

A callable object that can be used to perform the merge sort

cuda.cccl.parallel.experimental.algorithms.make_radix_sort(
d_in_keys,
d_out_keys,
d_in_values,
d_out_values,
order,
)#

Implements a device-wide radix sort using d_in_keys in the requested order.

Example

Below, radix_sort is used to sort a sequence of keys. It also rearranges the values according to the keys’ order.

import cupy as cp
import numpy as np

import cuda.cccl.parallel.experimental as parallel

h_in_keys = np.array([-5, 0, 2, -3, 2, 4, 0, -1, 2, 8], dtype="int32")
h_in_values = np.array(
    [-3.2, 2.2, 1.9, 4.0, -3.9, 2.7, 0, 8.3 - 1, 2.9, 5.4], dtype="float32"
)

d_in_keys = cp.asarray(h_in_keys)
d_in_values = cp.asarray(h_in_values)

d_out_keys = cp.empty_like(d_in_keys)
d_out_values = cp.empty_like(d_in_values)

# Call single-phase API directly with num_items parameter
parallel.radix_sort(
    d_in_keys,
    d_out_keys,
    d_in_values,
    d_out_values,
    parallel.SortOrder.ASCENDING,
    d_in_keys.size,
)

# Check the result is correct
h_out_keys = cp.asnumpy(d_out_keys)
h_out_items = cp.asnumpy(d_out_values)

argsort = np.argsort(h_in_keys, stable=True)
h_in_keys = np.array(h_in_keys)[argsort]
h_in_values = np.array(h_in_values)[argsort]

np.testing.assert_array_equal(h_out_keys, h_in_keys)
np.testing.assert_array_equal(h_out_items, h_in_values)

Instead of passing in arrays directly, we can use a DoubleBuffer, which requires less temporary storage but could overwrite the input arrays

import cupy as cp
import numpy as np

import cuda.cccl.parallel.experimental as parallel

h_in_keys = np.array([-5, 0, 2, -3, 2, 4, 0, -1, 2, 8], dtype="int32")
h_in_values = np.array(
    [-3.2, 2.2, 1.9, 4.0, -3.9, 2.7, 0, 8.3 - 1, 2.9, 5.4], dtype="float32"
)

d_in_keys = cp.asarray(h_in_keys)
d_in_values = cp.asarray(h_in_values)

d_out_keys = cp.empty_like(d_in_keys)
d_out_values = cp.empty_like(d_in_values)

keys_double_buffer = parallel.DoubleBuffer(d_in_keys, d_out_keys)
values_double_buffer = parallel.DoubleBuffer(d_in_values, d_out_values)

# Call single-phase API directly with num_items parameter
parallel.radix_sort(
    keys_double_buffer,
    None,
    values_double_buffer,
    None,
    parallel.SortOrder.ASCENDING,
    d_in_keys.size,
)

# Check the result is correct
h_out_keys = cp.asnumpy(keys_double_buffer.current())
h_out_values = cp.asnumpy(values_double_buffer.current())

argsort = np.argsort(h_in_keys, stable=True)
h_in_keys = np.array(h_in_keys)[argsort]
h_in_values = np.array(h_in_values)[argsort]

np.testing.assert_array_equal(h_out_keys, h_in_keys)
np.testing.assert_array_equal(h_out_values, h_in_values)
Parameters:
  • d_in_keys (DeviceArrayLike | DoubleBuffer) – Device array or DoubleBuffer containing the input keys to be sorted

  • d_out_keys (DeviceArrayLike | None) – Device array to store the sorted keys

  • d_in_values (DeviceArrayLike | DoubleBuffer | None) – Optional Device array or DoubleBuffer containing the input keys to be sorted

  • d_out_values (DeviceArrayLike | None) – Device array to store the sorted values

  • op – Callable representing the comparison operator

  • order (SortOrder)

Returns:

A callable object that can be used to perform the merge sort

cuda.cccl.parallel.experimental.algorithms.make_reduce_into(d_in, d_out, op, h_init)#

Computes a device-wide reduction using the specified binary op and initial value init.

Example

Below, reduce_into is used to compute the minimum value of a sequence of integers.

import cupy as cp
import numpy as np

import cuda.cccl.parallel.experimental as parallel

def min_op(a, b):
    return a if a < b else b

dtype = np.int32
h_init = np.array([42], dtype=dtype)
d_input = cp.array([8, 6, 7, 5, 3, 0, 9], dtype=dtype)
d_output = cp.empty(1, dtype=dtype)

# Run reduction
parallel.reduce_into(d_input, d_output, min_op, len(d_input), h_init)

# Check the result is correct
expected_output = 0
assert (d_output == expected_output).all()
Parameters:
  • d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items

  • d_out (DeviceArrayLike) – Device array (of size 1) that will store the result of the reduction

  • op (Callable | cuda.cccl.parallel.experimental._bindings.OpKind) – Callable or OpKind representing the binary operator to apply

  • init – Numpy array storing initial value of the reduction

  • h_init (ndarray)

Returns:

A callable object that can be used to perform the reduction

cuda.cccl.parallel.experimental.algorithms.make_segmented_reduce(
d_in,
d_out,
start_offsets_in,
end_offsets_in,
op,
h_init,
)#

Computes a device-wide segmented reduction using the specified binary op and initial value init.

Example

Below, segmented_reduce is used to compute the minimum value of a sequence of integers.


def min_op(a, b):
    return a if a < b else b

dtype = np.dtype(np.int32)
max_val = np.iinfo(dtype).max
h_init = np.asarray(max_val, dtype=dtype)

offsets = cp.array([0, 7, 11, 16], dtype=np.int64)
first_segment = (8, 6, 7, 5, 3, 0, 9)
second_segment = (-4, 3, 0, 1)
third_segment = (3, 1, 11, 25, 8)
d_input = cp.array(
    [*first_segment, *second_segment, *third_segment],
    dtype=dtype,
)

start_o = offsets[:-1]
end_o = offsets[1:]

n_segments = start_o.size
d_output = cp.empty(n_segments, dtype=dtype)

# Run segmented reduction with automatic temp storage allocation
parallel.segmented_reduce(
    d_input, d_output, start_o, end_o, min_op, h_init, n_segments
)

# Check the result is correct
expected_output = cp.asarray([0, -4, 1], dtype=d_output.dtype)
assert (d_output == expected_output).all()
Parameters:
  • d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items

  • d_out (DeviceArrayLike) – Device array that will store the result of the reduction

  • start_offsets_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing offsets to start of segments

  • end_offsets_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing offsets to end of segments

  • op (Callable | cuda.cccl.parallel.experimental._bindings.OpKind) – Callable or OpKind representing the binary operator to apply

  • init – Numpy array storing initial value of the reduction

  • h_init (ndarray)

Returns:

A callable object that can be used to perform the reduction

cuda.cccl.parallel.experimental.algorithms.make_unique_by_key(
d_in_keys,
d_in_items,
d_out_keys,
d_out_items,
d_out_num_selected,
op,
)#

Implements a device-wide unique by key operation using d_in_keys and the comparison operator op. Only the first key and its value from each run is selected and the total number of items selected is also reported.

Example

Below, unique_by_key is used to populate the arrays of output keys and items with the first key and its corresponding item from each sequence of equal keys. It also outputs the number of items selected.

import cupy as cp
import numpy as np

import cuda.cccl.parallel.experimental as parallel

def compare_op(lhs, rhs):
    return np.uint8(lhs == rhs)

h_in_keys = np.array([0, 2, 2, 9, 5, 5, 5, 8], dtype="int32")
h_in_items = np.array([1, 2, 3, 4, 5, 6, 7, 8], dtype="float32")

d_in_keys = cp.asarray(h_in_keys)
d_in_items = cp.asarray(h_in_items)
d_out_keys = cp.empty_like(d_in_keys)
d_out_items = cp.empty_like(d_in_items)
d_out_num_selected = cp.empty(1, np.int32)

# Run unique_by_key with automatic temp storage allocation
parallel.unique_by_key(
    d_in_keys,
    d_in_items,
    d_out_keys,
    d_out_items,
    d_out_num_selected,
    compare_op,
    d_in_keys.size,
)

# Check the result is correct
num_selected = cp.asnumpy(d_out_num_selected)[0]
h_out_keys = cp.asnumpy(d_out_keys)[:num_selected]
h_out_items = cp.asnumpy(d_out_items)[:num_selected]

prev_key = h_in_keys[0]
expected_keys = [prev_key]
expected_items = [h_in_items[0]]

for idx, (previous, next) in enumerate(zip(h_in_keys, h_in_keys[1:])):
    if previous != next:
        expected_keys.append(next)

        # add 1 since we are enumerating over pairs
        expected_items.append(h_in_items[idx + 1])

np.testing.assert_array_equal(h_out_keys, np.array(expected_keys))
np.testing.assert_array_equal(h_out_items, np.array(expected_items))
Parameters:
  • d_in_keys (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of keys

  • d_in_items (DeviceArrayLike | IteratorBase) – Device array or iterator that contains each key’s corresponding item

  • d_out_keys (DeviceArrayLike | IteratorBase) – Device array or iterator to store the outputted keys

  • d_out_items (DeviceArrayLike | IteratorBase) – Device array or iterator to store each outputted key’s item

  • d_out_num_selected (DeviceArrayLike) – Device array to store how many items were selected

  • op (Callable | cuda.cccl.parallel.experimental._bindings.OpKind) – Callable or OpKind representing the equality operator

Returns:

A callable object that can be used to perform unique by key

cuda.cccl.parallel.experimental.algorithms.make_unary_transform(d_in, d_out, op)#

Create a unary transform object that can be called to apply a transformation to each element of the input according to the unary operation op.

This is the object-oriented API that allows explicit control over temporary storage allocation. For simpler usage, consider using unary_transform().

Example

import numpy as np

def op(a):
    return a + 1

d_in = input_array
d_out = cp.empty_like(d_in)

unary_transform_device(d_in, d_out, len(d_in), op)

got = d_out.get()
expected = unary_transform_host(d_in.get(), op)

np.testing.assert_allclose(expected, got, rtol=1e-5)
Parameters:
  • d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items.

  • d_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the result of the transformation.

  • op (Callable | cuda.cccl.parallel.experimental._bindings.OpKind) – Callable or OpKind representing the unary operation to apply to each element of the input.

Returns:

A callable object that performs the transformation.

cuda.cccl.parallel.experimental.algorithms.histogram_even(
d_samples,
d_histogram,
num_output_levels,
lower_level,
upper_level,
num_samples,
stream=None,
)#

Performs device-wide histogram computation with evenly-spaced bins.

This function automatically handles temporary storage allocation and execution.

Parameters:
  • d_samples (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data samples

  • d_histogram (DeviceArrayLike) – Device array to store the computed histogram

  • num_output_levels (int) – Number of histogram bin levels (num_bins = num_output_levels - 1)

  • lower_level (floating | integer) – Lower sample value bound (inclusive)

  • upper_level (floating | integer) – Upper sample value bound (exclusive)

  • num_samples (int) – Number of input samples

  • stream – CUDA stream for the operation (optional)

cuda.cccl.parallel.experimental.algorithms.merge_sort(
d_in_keys,
d_in_items,
d_out_keys,
d_out_items,
op,
num_items,
stream=None,
)#

Performs device-wide merge sort.

This function automatically handles temporary storage allocation and execution.

Parameters:
  • d_in_keys (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of keys

  • d_in_items (DeviceArrayLike | IteratorBase | None) – Device array or iterator containing the input sequence of items (optional)

  • d_out_keys (DeviceArrayLike) – Device array to store the sorted keys

  • d_out_items (DeviceArrayLike | None) – Device array to store the sorted items (optional)

  • op (Callable | cuda.cccl.parallel.experimental._bindings.OpKind) – Comparison operator for sorting

  • num_items (int) – Number of items to sort

  • stream – CUDA stream for the operation (optional)

cuda.cccl.parallel.experimental.algorithms.reduce_into(d_in, d_out, op, num_items, h_init, stream=None)#

Performs device-wide reduction.

This function automatically handles temporary storage allocation and execution.

Parameters:
  • d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items

  • d_out (DeviceArrayLike) – Device array to store the result of the reduction

  • op (Callable | cuda.cccl.parallel.experimental._bindings.OpKind) – Binary reduction operator

  • num_items (int) – Number of items to reduce

  • h_init (ndarray | Any) – Initial value for the reduction

  • stream – CUDA stream for the operation (optional)

cuda.cccl.parallel.experimental.algorithms.exclusive_scan(d_in, d_out, op, h_init, num_items, stream=None)#

Performs device-wide exclusive scan.

This function automatically handles temporary storage allocation and execution.

Parameters:
  • d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items

  • d_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the result of the scan

  • op (Callable | cuda.cccl.parallel.experimental._bindings.OpKind) – Binary scan operator

  • h_init (ndarray | Any) – Initial value for the scan

  • num_items (int) – Number of items to scan

  • stream – CUDA stream for the operation (optional)

cuda.cccl.parallel.experimental.algorithms.inclusive_scan(d_in, d_out, op, h_init, num_items, stream=None)#

Performs device-wide inclusive scan.

This function automatically handles temporary storage allocation and execution.

Parameters:
  • d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items

  • d_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the result of the scan

  • op (Callable | cuda.cccl.parallel.experimental._bindings.OpKind) – Binary scan operator

  • h_init (ndarray | Any) – Initial value for the scan

  • num_items (int) – Number of items to scan

  • stream – CUDA stream for the operation (optional)

cuda.cccl.parallel.experimental.algorithms.segmented_reduce(
d_in,
d_out,
start_offsets_in,
end_offsets_in,
op,
h_init,
num_segments,
stream=None,
)#

Performs device-wide segmented reduction.

This function automatically handles temporary storage allocation and execution.

Parameters:
  • d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items

  • d_out (DeviceArrayLike) – Device array to store the result of the reduction for each segment

  • start_offsets_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the sequence of beginning offsets

  • end_offsets_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the sequence of ending offsets

  • op (Callable | cuda.cccl.parallel.experimental._bindings.OpKind) – Binary reduction operator

  • h_init (ndarray | Any) – Initial value for the reduction

  • num_segments (int) – Number of segments to reduce

  • stream – CUDA stream for the operation (optional)

cuda.cccl.parallel.experimental.algorithms.unique_by_key(
d_in_keys,
d_in_items,
d_out_keys,
d_out_items,
d_out_num_selected,
op,
num_items,
stream=None,
)#

Performs device-wide unique by key operation using the single-phase API.

This function automatically handles temporary storage allocation and execution.

Parameters:
  • d_in_keys (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of keys

  • d_in_items (DeviceArrayLike | IteratorBase) – Device array or iterator that contains each key’s corresponding item

  • d_out_keys (DeviceArrayLike | IteratorBase) – Device array or iterator to store the outputted keys

  • d_out_items (DeviceArrayLike | IteratorBase) – Device array or iterator to store each outputted key’s item

  • d_out_num_selected (DeviceArrayLike) – Device array to store how many items were selected

  • op (Callable | cuda.cccl.parallel.experimental._bindings.OpKind) – Callable or OpKind representing the equality operator

  • num_items (int) – Number of items to process

  • stream – CUDA stream for the operation (optional)

cuda.cccl.parallel.experimental.algorithms.radix_sort(
d_in_keys,
d_out_keys,
d_in_values,
d_out_values,
order,
num_items,
begin_bit=None,
end_bit=None,
stream=None,
)#

Performs device-wide radix sort.

This function automatically handles temporary storage allocation and execution.

Parameters:
  • d_in_keys (DeviceArrayLike | DoubleBuffer) – Device array or DoubleBuffer containing the input sequence of keys

  • d_out_keys (DeviceArrayLike | None) – Device array to store the sorted keys (optional)

  • d_in_values (DeviceArrayLike | DoubleBuffer | None) – Device array or DoubleBuffer containing the input sequence of values (optional)

  • d_out_values (DeviceArrayLike | None) – Device array to store the sorted values (optional)

  • order (SortOrder) – Sort order (ascending or descending)

  • num_items (int) – Number of items to sort

  • begin_bit (int | None) – Beginning bit position for comparison (optional)

  • end_bit (int | None) – Ending bit position for comparison (optional)

  • stream – CUDA stream for the operation (optional)

class cuda.cccl.parallel.experimental.algorithms.DoubleBuffer(d_current, d_alternate)#
Parameters:
  • d_current (DeviceArrayLike)

  • d_alternate (DeviceArrayLike)

__init__(d_current, d_alternate)#
Parameters:
  • d_current (DeviceArrayLike)

  • d_alternate (DeviceArrayLike)

current()#
alternate()#
class cuda.cccl.parallel.experimental.algorithms.SortOrder(
value,
names=<not given>,
*values,
module=None,
qualname=None,
type=None,
start=1,
boundary=None,
)#
ASCENDING = 0#
DESCENDING = 1#
cuda.cccl.parallel.experimental.algorithms.binary_transform(d_in1, d_in2, d_out, op, num_items, stream=None)#

Create a binary transform object that can be called to apply a transformation to the given pair of input sequences according to the binary operation op.

This is the two-phase API that returns a transform object for execution.

Example

import numpy as np

def op(a, b):
    return a + b

d_in1 = input_array
d_in2 = input_array
d_out = cp.empty_like(d_in1)

binary_transform_device(d_in1, d_in2, d_out, len(d_in1), op)

got = d_out.get()
expected = binary_transform_host(d_in1.get(), d_in2.get(), op)

np.testing.assert_allclose(expected, got, rtol=1e-5)
Parameters:
  • d_in1 (DeviceArrayLike | IteratorBase) – Device array or iterator containing the first input sequence of data items.

  • d_in2 (DeviceArrayLike | IteratorBase) – Device array or iterator containing the second input sequence of data items.

  • d_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the result of the transformation.

  • op (Callable | cuda.cccl.parallel.experimental._bindings.OpKind) – Callable or OpKind representing the binary operation to apply to each pair of items from the input sequences.

  • num_items (int) – Number of items to transform.

  • stream – CUDA stream to use for the operation.

cuda.cccl.parallel.experimental.algorithms.unary_transform(d_in, d_out, op, num_items, stream=None)#

Create a unary transform object that can be called to apply a transformation to each element of the input according to the unary operation op.

This is the two-phase API that returns a transform object for execution.

Example

import numpy as np

def op(a):
    return a + 1

d_in = input_array
d_out = cp.empty_like(d_in)

unary_transform_device(d_in, d_out, len(d_in), op)

got = d_out.get()
expected = unary_transform_host(d_in.get(), op)

np.testing.assert_allclose(expected, got, rtol=1e-5)
Parameters:
  • d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items.

  • d_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the result of the transformation.

  • op (Callable | cuda.cccl.parallel.experimental._bindings.OpKind) – Callable or OpKind representing the unary operation to apply to each element of the input.

  • num_items (int) – Number of items to transform.

  • stream – CUDA stream to use for the operation.

Iterators#

cuda.cccl.parallel.experimental.iterators.CacheModifiedInputIterator(device_array, modifier)#

Random Access Cache Modified Iterator that wraps a native device pointer.

Similar to https://nvidia.github.io/cccl/cub/api/classcub_1_1CacheModifiedInputIterator.html

Currently the only supported modifier is “stream” (LOAD_CS).

Example

The code snippet below demonstrates the usage of a CacheModifiedInputIterator:

import functools

import cupy as cp
import numpy as np

import cuda.cccl.parallel.experimental as parallel

def add_op(a, b):
    return a + b

values = [8, 6, 7, 5, 3, 0, 9]
d_input = cp.array(values, dtype=np.int32)
d_output = cp.empty(1, dtype=np.int32)

iterator = parallel.CacheModifiedInputIterator(
    d_input, modifier="stream"
)  # Input sequence
h_init = np.array([0], dtype=np.int32)  # Initial value for the reduction
d_output = cp.empty(1, dtype=np.int32)  # Storage for output

# Run reduction
parallel.reduce_into(iterator, d_output, add_op, len(values), h_init)

expected_output = functools.reduce(lambda a, b: a + b, values)
assert (d_output == expected_output).all()
Parameters:
  • device_array – CUDA device array storing the input sequence of data items

  • modifier – The PTX cache load modifier

  • prefix – An optional prefix added to the iterator’s methods to prevent name collisions.

Returns:

A CacheModifiedInputIterator object initialized with device_array

cuda.cccl.parallel.experimental.iterators.ConstantIterator(value)#

Returns an Iterator representing a sequence of constant values.

Similar to https://nvidia.github.io/cccl/thrust/api/classthrust_1_1constant__iterator.html

Example

The code snippet below demonstrates the usage of a ConstantIterator representing the sequence [10, 10, 10]:

import functools

import cupy as cp
import numpy as np

import cuda.cccl.parallel.experimental as parallel

def add_op(a, b):
    return a + b

value = 10
num_items = 3

constant_it = parallel.ConstantIterator(np.int32(value))  # Input sequence
h_init = np.array([0], dtype=np.int32)  # Initial value for the reduction
d_output = cp.empty(1, dtype=np.int32)  # Storage for output

# Run reduction
parallel.reduce_into(constant_it, d_output, add_op, num_items, h_init)

expected_output = functools.reduce(lambda a, b: a + b, [value] * num_items)
assert (d_output == expected_output).all()
Parameters:

value – The value of every item in the sequence

Returns:

A ConstantIterator object initialized to value

cuda.cccl.parallel.experimental.iterators.CountingIterator(offset)#

Returns an Iterator representing a sequence of incrementing values.

Similar to https://nvidia.github.io/cccl/thrust/api/classthrust_1_1counting__iterator.html

Example

The code snippet below demonstrates the usage of a CountingIterator representing the sequence [10, 11, 12]:

import functools

import cupy as cp
import numpy as np

import cuda.cccl.parallel.experimental as parallel

def add_op(a, b):
    return a + b

first_item = 10
num_items = 3

first_it = parallel.CountingIterator(np.int32(first_item))  # Input sequence
h_init = np.array([0], dtype=np.int32)  # Initial value for the reduction
d_output = cp.empty(1, dtype=np.int32)  # Storage for output

# Run reduction
parallel.reduce_into(first_it, d_output, add_op, num_items, h_init)

expected_output = functools.reduce(
    lambda a, b: a + b, range(first_item, first_item + num_items)
)
assert (d_output == expected_output).all()
Parameters:

offset – The initial value of the sequence

Returns:

A CountingIterator object initialized to offset

cuda.cccl.parallel.experimental.iterators.ReverseInputIterator(sequence)#

Returns an input Iterator over an array in reverse.

Similar to [std::reverse_iterator](https://en.cppreference.com/w/cpp/iterator/reverse_iterator)

Example

The code snippet below demonstrates the usage of a ReverseInputIterator:


def add_op(a, b):
    return a + b

h_init = np.array([0], dtype="int32")
d_input = cp.array([-5, 0, 2, -3, 2, 4, 0, -1, 2, 8], dtype="int32")
d_output = cp.empty_like(d_input, dtype="int32")
reverse_it = parallel.ReverseInputIterator(d_input)

# Run scan with automatic temp storage allocation
parallel.inclusive_scan(reverse_it, d_output, add_op, h_init, len(d_input))

# Check the result is correct
expected = np.asarray([8, 10, 9, 9, 13, 15, 12, 14, 14, 9])
np.testing.assert_equal(d_output.get(), expected)
Parameters:

sequence – The iterator or CUDA device array to be reversed

Returns:

A ReverseIterator object initialized with sequence to use as an input

cuda.cccl.parallel.experimental.iterators.ReverseOutputIterator(sequence)#

Returns an output Iterator over an array in reverse.

Similar to [std::reverse_iterator](https://en.cppreference.com/w/cpp/iterator/reverse_iterator)

Example

The code snippet below demonstrates the usage of a ReverseIterator:


def add_op(a, b):
    return a + b

h_init = np.array([0], dtype="int32")
d_input = cp.array([-5, 0, 2, -3, 2, 4, 0, -1, 2, 8], dtype="int32")
d_output = cp.empty_like(d_input, dtype="int32")
reverse_it = parallel.ReverseOutputIterator(d_output)

# Run scan with automatic temp storage allocation
parallel.inclusive_scan(d_input, reverse_it, add_op, h_init, len(d_input))

# Check the result is correct
expected = np.asarray([9, 1, -1, 0, 0, -4, -6, -3, -5, -5])
np.testing.assert_equal(d_output.get(), expected)
Parameters:

sequence – The iterator or CUDA device array to be reversed to use as an output

Returns:

A ReverseIterator object initialized with sequence to use as an output

cuda.cccl.parallel.experimental.iterators.TransformIterator(it, op)#

Returns an Iterator representing a transformed sequence of values.

Similar to https://nvidia.github.io/cccl/thrust/api/classthrust_1_1transform__iterator.html

Example

The code snippet below demonstrates the usage of a TransformIterator composed with a CountingIterator, transforming the sequence [10, 11, 12] by squaring each item before reducing the output:

import functools

import cupy as cp
import numpy as np

import cuda.cccl.parallel.experimental as parallel

def add_op(a, b):
    return a + b

def square_op(a):
    return a**2

first_item = 10
num_items = 3

transform_it = parallel.TransformIterator(
    parallel.CountingIterator(np.int32(first_item)), square_op
)  # Input sequence
h_init = np.array([0], dtype=np.int32)  # Initial value for the reduction
d_output = cp.empty(1, dtype=np.int32)  # Storage for output

# Run reduction
parallel.reduce_into(transform_it, d_output, add_op, num_items, h_init)

expected_output = functools.reduce(
    lambda a, b: a + b, [a**2 for a in range(first_item, first_item + num_items)]
)
assert (d_output == expected_output).all()
Parameters:
  • it – The iterator object to be transformed

  • op – The transform operation

Returns:

A TransformIterator object to transform the items in it using op

cuda.cccl.parallel.experimental.iterators.ZipIterator(*iterators)#

Returns an Iterator representing a zipped sequence of values from N iterators.

Similar to https://nvidia.github.io/cccl/thrust/api/classthrust_1_1zip__iterator.html

The resulting iterator yields gpu_struct objects with fields corresponding to each input iterator. For 2 iterators, fields are named ‘first’ and ‘second’. For N iterators, fields are indexed as field_0, field_1, …, field_N-1.

Example

The code snippet below demonstrates the usage of a ZipIterator combining two device arrays:

Parameters:

*iterators – Variable number of iterators to zip (at least 1)

Returns:

A ZipIterator object that yields combined values from all input iterators

Utilities#

cuda.cccl.parallel.experimental.struct.gpu_struct_from_numba_types(name, field_names, field_types)#

Create a struct type from tuples of field names and numba types.

Parameters:
  • name (str) – The name of the struct class

  • field_names (tuple) – Tuple of field names

  • field_types (tuple) – Tuple of corresponding numba types

Returns:

A dynamically created struct class with the specified fields

Return type:

Type

cuda.cccl.parallel.experimental.struct.gpu_struct(this)#

Decorate a class as a GPU struct.

A GpuStruct represents a value composed of one or more other values, and is defined as a class with annotated fields (similar to a dataclass). The type of each field must be a subclass of np.number, like np.int32 or np.float64.

Arrays of GPUStruct objects can be used as inputs to cuda.cccl.parallel algorithms.

Example

The code snippet below shows how to use gpu_struct to define a MinMax type (composed of min_val, max_val values), and perform a reduction on an input array of floating point values to compute its the smallest and the largest absolute values:

import cupy as cp
import numpy as np

import cuda.cccl.parallel.experimental as parallel

@parallel.gpu_struct
class MinMax:
    min_val: np.float64
    max_val: np.float64

def minmax_op(v1: MinMax, v2: MinMax):
    c_min = min(v1.min_val, v2.min_val)
    c_max = max(v1.max_val, v2.max_val)
    return MinMax(c_min, c_max)

def transform_op(v):
    av = abs(v)
    return MinMax(av, av)

nelems = 4096

d_in = cp.random.randn(nelems)
# input values must be transformed to MinMax structures
# in-place to map computation to data-parallel reduction
# algorithm that requires commutative binary operation
# with both operands having the same type.
tr_it = parallel.TransformIterator(d_in, transform_op)

d_out = cp.empty(tuple(), dtype=MinMax.dtype)

# initial value set with identity elements of
# minimum and maximum operators
h_init = MinMax(np.inf, -np.inf)

# run the reduction algorithm
parallel.reduce_into(tr_it, d_out, minmax_op, nelems, h_init)

# display values computed on the device
actual = d_out.get()

h = np.abs(d_in.get())
expected = np.asarray([(h.min(), h.max())], dtype=MinMax.dtype)

assert actual == expected
Parameters:

this (type)

Return type:

Type[Any]

cuda.cccl.parallel.experimental.struct.gpu_struct_from_numpy_dtype(name, np_dtype)#

Create a GPU struct from a numpy record dtype.