cuda.compute
API Reference#
Warning
cuda.compute
is in public beta.
The API is subject to change without notice.
Algorithms#
- cuda.compute.algorithms.reduce_into(d_in, d_out, op, num_items, h_init, stream=None)#
Performs device-wide reduction.
This function automatically handles temporary storage allocation and execution.
Example
Below,
reduce_into
is used to compute the sum of a sequence of integers.- Parameters:
d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items
d_out (DeviceArrayLike | IteratorBase) – Device array to store the result of the reduction
op (Callable | cuda.compute._bindings.OpKind) – Binary reduction operator
num_items (int) – Number of items to reduce
stream – CUDA stream for the operation (optional)
- cuda.compute.algorithms.make_reduce_into(d_in, d_out, op, h_init)#
Computes a device-wide reduction using the specified binary
op
and initial valueinit
.Example
Below,
make_reduce_into
is used to create a reduction object that can be reused.- Parameters:
d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items
d_out (DeviceArrayLike | IteratorBase) – Device array (of size 1) that will store the result of the reduction
op (Callable | cuda.compute._bindings.OpKind) – Callable or OpKind representing the binary operator to apply
init – Numpy array storing initial value of the reduction
h_init (ndarray)
- Returns:
A callable object that can be used to perform the reduction
- cuda.compute.algorithms.inclusive_scan(d_in, d_out, op, h_init, num_items, stream=None)#
Performs device-wide inclusive scan.
This function automatically handles temporary storage allocation and execution.
Example
Below,
inclusive_scan
is used to compute an inclusive scan (prefix sum).- Parameters:
d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items
d_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the result of the scan
op (Callable | cuda.compute._bindings.OpKind) – Binary scan operator
num_items (int) – Number of items to scan
stream – CUDA stream for the operation (optional)
- cuda.compute.algorithms.make_inclusive_scan(d_in, d_out, op, h_init)#
Computes a device-wide scan using the specified binary
op
and initial valueinit
.Example
Below,
make_inclusive_scan
is used to create an inclusive scan object that can be reused.- Parameters:
d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items
d_out (DeviceArrayLike | IteratorBase) – Device array that will store the result of the scan
op (Callable | cuda.compute._bindings.OpKind) – Callable or OpKind representing the binary operator to apply
init – Numpy array storing initial value of the scan
h_init (ndarray)
- Returns:
A callable object that can be used to perform the scan
- cuda.compute.algorithms.exclusive_scan(d_in, d_out, op, h_init, num_items, stream=None)#
Performs device-wide exclusive scan.
This function automatically handles temporary storage allocation and execution.
Example
Below,
exclusive_scan
is used to compute an exclusive scan with max operation.- Parameters:
d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items
d_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the result of the scan
op (Callable | cuda.compute._bindings.OpKind) – Binary scan operator
num_items (int) – Number of items to scan
stream – CUDA stream for the operation (optional)
- cuda.compute.algorithms.make_exclusive_scan(d_in, d_out, op, h_init)#
Computes a device-wide scan using the specified binary
op
and initial valueinit
.Example
Below,
make_exclusive_scan
is used to create an exclusive scan object that can be reused.- Parameters:
d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items
d_out (DeviceArrayLike | IteratorBase) – Device array that will store the result of the scan
op (Callable | cuda.compute._bindings.OpKind) – Callable or OpKind representing the binary operator to apply
init – Numpy array storing initial value of the scan
h_init (ndarray)
- Returns:
A callable object that can be used to perform the scan
- cuda.compute.algorithms.unary_transform(d_in, d_out, op, num_items, stream=None)#
Performs device-wide unary transform.
This function automatically handles temporary storage allocation and execution.
Example
Below,
unary_transform
is used to apply a transformation to each element of the input.- Parameters:
d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items.
d_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the result of the transformation.
op (Callable | cuda.compute._bindings.OpKind) – Callable or OpKind representing the unary operation to apply to each element of the input.
num_items (int) – Number of items to transform.
stream – CUDA stream to use for the operation.
- cuda.compute.algorithms.make_unary_transform(d_in, d_out, op)#
Create a unary transform object that can be called to apply a transformation to each element of the input according to the unary operation
op
.This is the object-oriented API that allows explicit control over temporary storage allocation. For simpler usage, consider using
unary_transform()
.Example
- Parameters:
d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items.
d_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the result of the transformation.
op (Callable | cuda.compute._bindings.OpKind) – Callable or OpKind representing the unary operation to apply to each element of the input.
- Returns:
A callable object that performs the transformation.
- cuda.compute.algorithms.binary_transform(d_in1, d_in2, d_out, op, num_items, stream=None)#
Performs device-wide binary transform.
This function automatically handles temporary storage allocation and execution.
Example
Below,
binary_transform
is used to apply a transformation to pairs of elements from two input sequences.- Parameters:
d_in1 (DeviceArrayLike | IteratorBase) – Device array or iterator containing the first input sequence of data items.
d_in2 (DeviceArrayLike | IteratorBase) – Device array or iterator containing the second input sequence of data items.
d_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the result of the transformation.
op (Callable | cuda.compute._bindings.OpKind) – Callable or OpKind representing the binary operation to apply to each pair of items from the input sequences.
num_items (int) – Number of items to transform.
stream – CUDA stream to use for the operation.
- cuda.compute.algorithms.make_binary_transform(d_in1, d_in2, d_out, op)#
Create a binary transform object that can be called to apply a transformation to the given pair of input sequences according to the binary operation
op
.This is the object-oriented API that allows explicit control over temporary storage allocation. For simpler usage, consider using
binary_transform()
.Example
- Parameters:
d_in1 (DeviceArrayLike | IteratorBase) – Device array or iterator containing the first input sequence of data items.
d_in2 (DeviceArrayLike | IteratorBase) – Device array or iterator containing the second input sequence of data items.
d_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the result of the transformation.
op (Callable | cuda.compute._bindings.OpKind) – Callable or OpKind representing the binary operation to apply to each pair of items from the input sequences.
- Returns:
A callable object that performs the transformation.
- cuda.compute.algorithms.histogram_even(
- d_samples,
- d_histogram,
- num_output_levels,
- lower_level,
- upper_level,
- num_samples,
- stream=None,
Performs device-wide histogram computation with evenly-spaced bins.
This function automatically handles temporary storage allocation and execution.
Example
Below,
histogram_even
is used to compute a histogram with evenly-spaced bins.- Parameters:
d_samples (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data samples
d_histogram (DeviceArrayLike) – Device array to store the computed histogram
num_output_levels (int) – Number of histogram bin levels (num_bins = num_output_levels - 1)
lower_level (floating | integer) – Lower sample value bound (inclusive)
upper_level (floating | integer) – Upper sample value bound (exclusive)
num_samples (int) – Number of input samples
stream – CUDA stream for the operation (optional)
- cuda.compute.algorithms.make_histogram_even(
- d_samples,
- d_histogram,
- h_num_output_levels,
- h_lower_level,
- h_upper_level,
- num_samples,
Implements a device-wide histogram that places
d_samples
into evenly-spaced bins.Example
Below,
make_histogram_even
is used to create a histogram object that can be reused.- Parameters:
d_samples (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input samples to be histogrammed
d_histogram (DeviceArrayLike) – Device array to store the histogram
h_num_output_levels (ndarray) – Host array containing the number of output levels
h_lower_level (ndarray) – Host array containing the lower level
h_upper_level (ndarray) – Host array containing the upper level
num_samples (int) – Number of samples to be histogrammed
- Returns:
A callable object that can be used to perform the histogram
- cuda.compute.algorithms.merge_sort(
- d_in_keys,
- d_in_items,
- d_out_keys,
- d_out_items,
- op,
- num_items,
- stream=None,
Performs device-wide merge sort.
This function automatically handles temporary storage allocation and execution.
Example
Below,
merge_sort
is used to sort a sequence of keys inplace. It also rearranges the items according to the keys’ order.- Parameters:
d_in_keys (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of keys
d_in_items (DeviceArrayLike | IteratorBase | None) – Device array or iterator containing the input sequence of items (optional)
d_out_keys (DeviceArrayLike) – Device array to store the sorted keys
d_out_items (DeviceArrayLike | None) – Device array to store the sorted items (optional)
op (Callable | cuda.compute._bindings.OpKind) – Comparison operator for sorting
num_items (int) – Number of items to sort
stream – CUDA stream for the operation (optional)
- cuda.compute.algorithms.make_merge_sort(d_in_keys, d_in_items, d_out_keys, d_out_items, op)#
Implements a device-wide merge sort using
d_in_keys
and the comparison operatorop
.Example
Below,
make_merge_sort
is used to create a merge sort object that can be reused.- Parameters:
d_in_keys (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input keys to be sorted
d_in_items (DeviceArrayLike | IteratorBase | None) – Optional device array or iterator that contains each key’s corresponding item
d_out_keys (DeviceArrayLike) – Device array to store the sorted keys
d_out_items (DeviceArrayLike | None) – Device array to store the sorted items
op (Callable | cuda.compute._bindings.OpKind) – Callable or OpKind representing the comparison operator
- Returns:
A callable object that can be used to perform the merge sort
- cuda.compute.algorithms.radix_sort(
- d_in_keys,
- d_out_keys,
- d_in_values,
- d_out_values,
- order,
- num_items,
- begin_bit=None,
- end_bit=None,
- stream=None,
Performs device-wide radix sort.
This function automatically handles temporary storage allocation and execution.
Example
Below,
radix_sort
is used to sort a sequence of keys. It also rearranges the values according to the keys’ order.In the following example,
radix_sort
is used to sort a sequence of keys with a ``DoubleBuffer` for reduced temporary storage.- Parameters:
d_in_keys (DeviceArrayLike | DoubleBuffer) – Device array or DoubleBuffer containing the input sequence of keys
d_out_keys (DeviceArrayLike | None) – Device array to store the sorted keys (optional)
d_in_values (DeviceArrayLike | DoubleBuffer | None) – Device array or DoubleBuffer containing the input sequence of values (optional)
d_out_values (DeviceArrayLike | None) – Device array to store the sorted values (optional)
order (SortOrder) – Sort order (ascending or descending)
num_items (int) – Number of items to sort
begin_bit (int | None) – Beginning bit position for comparison (optional)
end_bit (int | None) – Ending bit position for comparison (optional)
stream – CUDA stream for the operation (optional)
- cuda.compute.algorithms.make_radix_sort(
- d_in_keys,
- d_out_keys,
- d_in_values,
- d_out_values,
- order,
Implements a device-wide radix sort using
d_in_keys
in the requested order.Example
Below,
make_radix_sort
is used to create a radix sort object that can be reused.- Parameters:
d_in_keys (DeviceArrayLike | DoubleBuffer) – Device array or DoubleBuffer containing the input keys to be sorted
d_out_keys (DeviceArrayLike | None) – Device array to store the sorted keys
d_in_values (DeviceArrayLike | DoubleBuffer | None) – Optional Device array or DoubleBuffer containing the input keys to be sorted
d_out_values (DeviceArrayLike | None) – Device array to store the sorted values
op – Callable representing the comparison operator
order (SortOrder)
- Returns:
A callable object that can be used to perform the radix sort
- cuda.compute.algorithms.segmented_reduce(
- d_in,
- d_out,
- start_offsets_in,
- end_offsets_in,
- op,
- h_init,
- num_segments,
- stream=None,
Performs device-wide segmented reduction.
This function automatically handles temporary storage allocation and execution.
Example
Below,
segmented_reduce
is used to compute the minimum value of segments in a sequence of integers.- Parameters:
d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items
d_out (DeviceArrayLike | IteratorBase) – Device array to store the result of the reduction for each segment
start_offsets_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the sequence of beginning offsets
end_offsets_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the sequence of ending offsets
op (Callable | cuda.compute._bindings.OpKind) – Binary reduction operator
num_segments (int) – Number of segments to reduce
stream – CUDA stream for the operation (optional)
- cuda.compute.algorithms.make_segmented_reduce(
- d_in,
- d_out,
- start_offsets_in,
- end_offsets_in,
- op,
- h_init,
Computes a device-wide segmented reduction using the specified binary
op
and initial valueinit
.Example
Below,
make_segmented_reduce
is used to create a segmented reduction object that can be reused.- Parameters:
d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items
d_out (DeviceArrayLike | IteratorBase) – Device array that will store the result of the reduction
start_offsets_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing offsets to start of segments
end_offsets_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing offsets to end of segments
op (Callable | cuda.compute._bindings.OpKind) – Callable or OpKind representing the binary operator to apply
init – Numpy array storing initial value of the reduction
h_init (ndarray)
- Returns:
A callable object that can be used to perform the reduction
- cuda.compute.algorithms.unique_by_key(
- d_in_keys,
- d_in_items,
- d_out_keys,
- d_out_items,
- d_out_num_selected,
- op,
- num_items,
- stream=None,
Performs device-wide unique by key operation using the single-phase API.
This function automatically handles temporary storage allocation and execution.
Example
Below,
unique_by_key
is used to populate the arrays of output keys and items with the first key and its corresponding item from each sequence of equal keys. It also outputs the number of items selected.- Parameters:
d_in_keys (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of keys
d_in_items (DeviceArrayLike | IteratorBase) – Device array or iterator that contains each key’s corresponding item
d_out_keys (DeviceArrayLike | IteratorBase) – Device array or iterator to store the outputted keys
d_out_items (DeviceArrayLike | IteratorBase) – Device array or iterator to store each outputted key’s item
d_out_num_selected (DeviceArrayLike) – Device array to store how many items were selected
op (Callable | cuda.compute._bindings.OpKind) – Callable or OpKind representing the equality operator
num_items (int) – Number of items to process
stream – CUDA stream for the operation (optional)
- cuda.compute.algorithms.make_unique_by_key(
- d_in_keys,
- d_in_items,
- d_out_keys,
- d_out_items,
- d_out_num_selected,
- op,
Implements a device-wide unique by key operation using
d_in_keys
and the comparison operatorop
. Only the first key and its value from each run is selected and the total number of items selected is also reported.Example
Below,
make_unique_by_key
is used to create a unique by key object that can be reused.- Parameters:
d_in_keys (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of keys
d_in_items (DeviceArrayLike | IteratorBase) – Device array or iterator that contains each key’s corresponding item
d_out_keys (DeviceArrayLike | IteratorBase) – Device array or iterator to store the outputted keys
d_out_items (DeviceArrayLike | IteratorBase) – Device array or iterator to store each outputted key’s item
d_out_num_selected (DeviceArrayLike) – Device array to store how many items were selected
op (Callable | cuda.compute._bindings.OpKind) – Callable or OpKind representing the equality operator
- Returns:
A callable object that can be used to perform unique by key
- cuda.compute.algorithms.three_way_partition(
- d_in,
- d_first_part_out,
- d_second_part_out,
- d_unselected_out,
- d_num_selected_out,
- select_first_part_op,
- select_second_part_op,
- num_items,
- stream=None,
Performs device-wide three-way partition. Given an input sequence of data items, it partitions the items into three parts: - The first part is selected by the
select_first_part_op
operator. - The second part is selected by theselect_second_part_op
operator. - The unselected items are not selected by either operator.This function automatically handles temporary storage allocation and execution.
Example
Below,
three_way_partition
is used to partition a sequence of integers into three parts.- Parameters:
d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items
d_first_part_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the first part of the output
d_second_part_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the second part of the output
d_unselected_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the unselected items
d_num_selected_out (DeviceArrayLike | IteratorBase) – Device array to store the number of items selected. The total number of items selected by
select_first_part_op
andselect_second_part_op
is stored ind_num_selected_out[0]
andd_num_selected_out[1]
, respectively.select_first_part_op (Callable) – Callable representing the unary operator to select the first part
select_second_part_op (Callable) – Callable representing the unary operator to select the second part
num_items (int) – Number of items to partition
stream – CUDA stream for the operation (optional)
- cuda.compute.algorithms.make_three_way_partition(
- d_in,
- d_first_part_out,
- d_second_part_out,
- d_unselected_out,
- d_num_selected_out,
- select_first_part_op,
- select_second_part_op,
Computes a device-wide three-way partition using the specified unary
select_first_part_op
andselect_second_part_op
operators.Example
Below,
make_three_way_partition
is used to create a three-way partition object that can be reused.- Parameters:
d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items
d_first_part_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the first part of the output
d_second_part_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the second part of the output
d_unselected_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the unselected items
d_num_selected_out (DeviceArrayLike | IteratorBase) – Device array to store the number of items selected. The total number of items selected by
select_first_part_op
andselect_second_part_op
is stored ind_num_selected_out[0]
andd_num_selected_out[1]
, respectively.select_first_part_op (Callable) – Callable representing the unary operator to select the first part
select_second_part_op (Callable) – Callable representing the unary operator to select the second part
- Returns:
A callable object that can be used to perform the three-way partition
Iterators#
- cuda.compute.iterators.CacheModifiedInputIterator(device_array, modifier)#
Random Access Cache Modified Iterator that wraps a native device pointer.
Similar to https://nvidia.github.io/cccl/cub/api/classcub_1_1CacheModifiedInputIterator.html
Currently the only supported modifier is “stream” (LOAD_CS).
Example
The code snippet below demonstrates the usage of a
CacheModifiedInputIterator
:- Parameters:
device_array – Array storing the input sequence of data items
modifier – The PTX cache load modifier
- Returns:
A
CacheModifiedInputIterator
object initialized withdevice_array
- cuda.compute.iterators.ConstantIterator(value)#
Returns an Iterator representing a sequence of constant values.
Similar to https://nvidia.github.io/cccl/thrust/api/classthrust_1_1constant__iterator.html
Example
The code snippet below demonstrates the usage of a
ConstantIterator
representing a sequence of constant values:- Parameters:
value – The value of every item in the sequence
- Returns:
A
ConstantIterator
object initialized tovalue
- cuda.compute.iterators.CountingIterator(offset)#
Returns an Iterator representing a sequence of incrementing values.
Similar to https://nvidia.github.io/cccl/thrust/api/classthrust_1_1counting__iterator.html
Example
The code snippet below demonstrates the usage of a
CountingIterator
representing the sequence[10, 11, 12]
:- Parameters:
offset – The initial value of the sequence
- Returns:
A
CountingIterator
object initialized tooffset
- cuda.compute.iterators.ReverseIterator(sequence)#
Returns an Iterator over an array or another iterator in reverse.
Similar to [std::reverse_iterator](https://en.cppreference.com/w/cpp/iterator/reverse_iterator).
Examples
The code snippet below demonstrates the usage of a
ReverseIterator
as an input iterator:The code snippet below demonstrates the usage of a
ReverseIterator
as an output iterator:- Parameters:
sequence – The iterator or array to be reversed
- Returns:
A
ReverseIterator
object
- cuda.compute.iterators.TransformIterator(it, op)#
An iterator that applies a user-defined unary function to the elements of an underlying iterator as they are read.
Similar to [thrust::transform_iterator](https://nvidia.github.io/cccl/thrust/api/classthrust_1_1transform__iterator.html)
Example
The code snippet below demonstrates the usage of a
TransformIterator
composed with aCountingIterator
to transform the input before performing a reduction.- Parameters:
it – The underlying iterator
op – The unary operation to be applied to values as they are read from
it
- Returns:
A
TransformIterator
object to transform the items init
usingop
- cuda.compute.iterators.TransformOutputIterator(it, op)#
An iterator that applies a user-defined unary function to values before writing them to an underlying iterator.
Similar to [thrust::transform_output_iterator](https://nvidia.github.io/cccl/thrust/api/classthrust_1_1transform__output__iterator.html).
Example
The code snippet below demonstrates the usage of a
TransformOutputIterator
to transform the output of a reduction before writing to an output array.- Parameters:
it – The underlying iterator
op – The operation to be applied to values before they are written to
it
- Returns:
A
TransformOutputIterator
object that appliesop
to transform values before writing them toit
- cuda.compute.iterators.ZipIterator(*iterators)#
Returns an Iterator representing a zipped sequence of values from N iterators.
Similar to https://nvidia.github.io/cccl/thrust/api/classthrust_1_1zip__iterator.html
The resulting iterator yields gpu_struct objects with fields corresponding to each input iterator. For 2 iterators, fields are named ‘first’ and ‘second’. For N iterators, fields are indexed as field_0, field_1, …, field_N-1.
Example
The code snippet below demonstrates the usage of a
ZipIterator
combining two device arrays:- Parameters:
*iterators – Variable number of iterators to zip (at least 1)
- Returns:
A
ZipIterator
object that yields combined values from all input iterators
Operators#
- class cuda.compute.op.OpKind#
Enumeration of operator kinds for CUDA parallel algorithms.
This enum defines the types of operations that can be performed in parallel algorithms, including arithmetic, logical, and bitwise operations.
- STATELESS#
- STATEFUL#
- PLUS#
- MINUS#
- MULTIPLIES#
- DIVIDES#
- MODULUS#
- EQUAL_TO#
- NOT_EQUAL_TO#
- GREATER#
- LESS#
- GREATER_EQUAL#
- LESS_EQUAL#
- LOGICAL_AND#
- LOGICAL_OR#
- LOGICAL_NOT#
- BIT_AND#
- BIT_OR#
- BIT_XOR#
- BIT_NOT#
- NEGATE#
Utilities#
- cuda.compute.struct.gpu_struct_from_numba_types(name, field_names, field_types)#
Create a struct type from tuples of field names and numba types.
- cuda.compute.struct.gpu_struct(this)#
Decorate a class as a GPU struct.
A GpuStruct represents a value composed of one or more other values, and is defined as a class with annotated fields (similar to a dataclass). The type of each field must be a subclass of np.number, like np.int32 or np.float64.
Arrays of GPUStruct objects can be used as inputs to cuda.compute algorithms.
Example
The code snippet below shows how to use gpu_struct to define a MinMax type (composed of min_val, max_val values), and perform a reduction on an input array of floating point values to compute its the smallest and the largest absolute values:
- cuda.compute.struct.gpu_struct_from_numpy_dtype(name, np_dtype)#
Create a GPU struct from a numpy record dtype.