cuda.compute API Reference#

Warning

cuda.compute is in public beta. The API is subject to change without notice.

Algorithms#

cuda.compute.algorithms.reduce_into(d_in, d_out, op, num_items, h_init, stream=None)#

Performs device-wide reduction.

This function automatically handles temporary storage allocation and execution.

Example

Below, reduce_into is used to compute the sum of a sequence of integers.

Parameters:
  • d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items

  • d_out (DeviceArrayLike | IteratorBase) – Device array to store the result of the reduction

  • op (Callable | cuda.compute._bindings.OpKind) – Binary reduction operator

  • num_items (int) – Number of items to reduce

  • h_init (ndarray | Any) – Initial value for the reduction

  • stream – CUDA stream for the operation (optional)

cuda.compute.algorithms.make_reduce_into(d_in, d_out, op, h_init)#

Computes a device-wide reduction using the specified binary op and initial value init.

Example

Below, make_reduce_into is used to create a reduction object that can be reused.

Parameters:
  • d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items

  • d_out (DeviceArrayLike | IteratorBase) – Device array (of size 1) that will store the result of the reduction

  • op (Callable | cuda.compute._bindings.OpKind) – Callable or OpKind representing the binary operator to apply

  • init – Numpy array storing initial value of the reduction

  • h_init (ndarray)

Returns:

A callable object that can be used to perform the reduction

cuda.compute.algorithms.inclusive_scan(d_in, d_out, op, h_init, num_items, stream=None)#

Performs device-wide inclusive scan.

This function automatically handles temporary storage allocation and execution.

Example

Below, inclusive_scan is used to compute an inclusive scan (prefix sum).

Parameters:
  • d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items

  • d_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the result of the scan

  • op (Callable | cuda.compute._bindings.OpKind) – Binary scan operator

  • h_init (ndarray | Any) – Initial value for the scan

  • num_items (int) – Number of items to scan

  • stream – CUDA stream for the operation (optional)

cuda.compute.algorithms.make_inclusive_scan(d_in, d_out, op, h_init)#

Computes a device-wide scan using the specified binary op and initial value init.

Example

Below, make_inclusive_scan is used to create an inclusive scan object that can be reused.

Parameters:
  • d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items

  • d_out (DeviceArrayLike | IteratorBase) – Device array that will store the result of the scan

  • op (Callable | cuda.compute._bindings.OpKind) – Callable or OpKind representing the binary operator to apply

  • init – Numpy array storing initial value of the scan

  • h_init (ndarray)

Returns:

A callable object that can be used to perform the scan

cuda.compute.algorithms.exclusive_scan(d_in, d_out, op, h_init, num_items, stream=None)#

Performs device-wide exclusive scan.

This function automatically handles temporary storage allocation and execution.

Example

Below, exclusive_scan is used to compute an exclusive scan with max operation.

Parameters:
  • d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items

  • d_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the result of the scan

  • op (Callable | cuda.compute._bindings.OpKind) – Binary scan operator

  • h_init (ndarray | Any) – Initial value for the scan

  • num_items (int) – Number of items to scan

  • stream – CUDA stream for the operation (optional)

cuda.compute.algorithms.make_exclusive_scan(d_in, d_out, op, h_init)#

Computes a device-wide scan using the specified binary op and initial value init.

Example

Below, make_exclusive_scan is used to create an exclusive scan object that can be reused.

Parameters:
  • d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items

  • d_out (DeviceArrayLike | IteratorBase) – Device array that will store the result of the scan

  • op (Callable | cuda.compute._bindings.OpKind) – Callable or OpKind representing the binary operator to apply

  • init – Numpy array storing initial value of the scan

  • h_init (ndarray)

Returns:

A callable object that can be used to perform the scan

cuda.compute.algorithms.unary_transform(d_in, d_out, op, num_items, stream=None)#

Performs device-wide unary transform.

This function automatically handles temporary storage allocation and execution.

Example

Below, unary_transform is used to apply a transformation to each element of the input.

Parameters:
  • d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items.

  • d_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the result of the transformation.

  • op (Callable | cuda.compute._bindings.OpKind) – Callable or OpKind representing the unary operation to apply to each element of the input.

  • num_items (int) – Number of items to transform.

  • stream – CUDA stream to use for the operation.

cuda.compute.algorithms.make_unary_transform(d_in, d_out, op)#

Create a unary transform object that can be called to apply a transformation to each element of the input according to the unary operation op.

This is the object-oriented API that allows explicit control over temporary storage allocation. For simpler usage, consider using unary_transform().

Example

Parameters:
  • d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items.

  • d_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the result of the transformation.

  • op (Callable | cuda.compute._bindings.OpKind) – Callable or OpKind representing the unary operation to apply to each element of the input.

Returns:

A callable object that performs the transformation.

cuda.compute.algorithms.binary_transform(d_in1, d_in2, d_out, op, num_items, stream=None)#

Performs device-wide binary transform.

This function automatically handles temporary storage allocation and execution.

Example

Below, binary_transform is used to apply a transformation to pairs of elements from two input sequences.

Parameters:
  • d_in1 (DeviceArrayLike | IteratorBase) – Device array or iterator containing the first input sequence of data items.

  • d_in2 (DeviceArrayLike | IteratorBase) – Device array or iterator containing the second input sequence of data items.

  • d_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the result of the transformation.

  • op (Callable | cuda.compute._bindings.OpKind) – Callable or OpKind representing the binary operation to apply to each pair of items from the input sequences.

  • num_items (int) – Number of items to transform.

  • stream – CUDA stream to use for the operation.

cuda.compute.algorithms.make_binary_transform(d_in1, d_in2, d_out, op)#

Create a binary transform object that can be called to apply a transformation to the given pair of input sequences according to the binary operation op.

This is the object-oriented API that allows explicit control over temporary storage allocation. For simpler usage, consider using binary_transform().

Example

Parameters:
  • d_in1 (DeviceArrayLike | IteratorBase) – Device array or iterator containing the first input sequence of data items.

  • d_in2 (DeviceArrayLike | IteratorBase) – Device array or iterator containing the second input sequence of data items.

  • d_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the result of the transformation.

  • op (Callable | cuda.compute._bindings.OpKind) – Callable or OpKind representing the binary operation to apply to each pair of items from the input sequences.

Returns:

A callable object that performs the transformation.

cuda.compute.algorithms.histogram_even(
d_samples,
d_histogram,
num_output_levels,
lower_level,
upper_level,
num_samples,
stream=None,
)#

Performs device-wide histogram computation with evenly-spaced bins.

This function automatically handles temporary storage allocation and execution.

Example

Below, histogram_even is used to compute a histogram with evenly-spaced bins.

Parameters:
  • d_samples (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data samples

  • d_histogram (DeviceArrayLike) – Device array to store the computed histogram

  • num_output_levels (int) – Number of histogram bin levels (num_bins = num_output_levels - 1)

  • lower_level (floating | integer) – Lower sample value bound (inclusive)

  • upper_level (floating | integer) – Upper sample value bound (exclusive)

  • num_samples (int) – Number of input samples

  • stream – CUDA stream for the operation (optional)

cuda.compute.algorithms.make_histogram_even(
d_samples,
d_histogram,
h_num_output_levels,
h_lower_level,
h_upper_level,
num_samples,
)#

Implements a device-wide histogram that places d_samples into evenly-spaced bins.

Example

Below, make_histogram_even is used to create a histogram object that can be reused.

Parameters:
  • d_samples (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input samples to be histogrammed

  • d_histogram (DeviceArrayLike) – Device array to store the histogram

  • h_num_output_levels (ndarray) – Host array containing the number of output levels

  • h_lower_level (ndarray) – Host array containing the lower level

  • h_upper_level (ndarray) – Host array containing the upper level

  • num_samples (int) – Number of samples to be histogrammed

Returns:

A callable object that can be used to perform the histogram

cuda.compute.algorithms.merge_sort(
d_in_keys,
d_in_items,
d_out_keys,
d_out_items,
op,
num_items,
stream=None,
)#

Performs device-wide merge sort.

This function automatically handles temporary storage allocation and execution.

Example

Below, merge_sort is used to sort a sequence of keys inplace. It also rearranges the items according to the keys’ order.

Parameters:
  • d_in_keys (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of keys

  • d_in_items (DeviceArrayLike | IteratorBase | None) – Device array or iterator containing the input sequence of items (optional)

  • d_out_keys (DeviceArrayLike) – Device array to store the sorted keys

  • d_out_items (DeviceArrayLike | None) – Device array to store the sorted items (optional)

  • op (Callable | cuda.compute._bindings.OpKind) – Comparison operator for sorting

  • num_items (int) – Number of items to sort

  • stream – CUDA stream for the operation (optional)

cuda.compute.algorithms.make_merge_sort(d_in_keys, d_in_items, d_out_keys, d_out_items, op)#

Implements a device-wide merge sort using d_in_keys and the comparison operator op.

Example

Below, make_merge_sort is used to create a merge sort object that can be reused.

Parameters:
  • d_in_keys (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input keys to be sorted

  • d_in_items (DeviceArrayLike | IteratorBase | None) – Optional device array or iterator that contains each key’s corresponding item

  • d_out_keys (DeviceArrayLike) – Device array to store the sorted keys

  • d_out_items (DeviceArrayLike | None) – Device array to store the sorted items

  • op (Callable | cuda.compute._bindings.OpKind) – Callable or OpKind representing the comparison operator

Returns:

A callable object that can be used to perform the merge sort

cuda.compute.algorithms.radix_sort(
d_in_keys,
d_out_keys,
d_in_values,
d_out_values,
order,
num_items,
begin_bit=None,
end_bit=None,
stream=None,
)#

Performs device-wide radix sort.

This function automatically handles temporary storage allocation and execution.

Example

Below, radix_sort is used to sort a sequence of keys. It also rearranges the values according to the keys’ order.

In the following example, radix_sort is used to sort a sequence of keys with a ``DoubleBuffer` for reduced temporary storage.

Parameters:
  • d_in_keys (DeviceArrayLike | DoubleBuffer) – Device array or DoubleBuffer containing the input sequence of keys

  • d_out_keys (DeviceArrayLike | None) – Device array to store the sorted keys (optional)

  • d_in_values (DeviceArrayLike | DoubleBuffer | None) – Device array or DoubleBuffer containing the input sequence of values (optional)

  • d_out_values (DeviceArrayLike | None) – Device array to store the sorted values (optional)

  • order (SortOrder) – Sort order (ascending or descending)

  • num_items (int) – Number of items to sort

  • begin_bit (int | None) – Beginning bit position for comparison (optional)

  • end_bit (int | None) – Ending bit position for comparison (optional)

  • stream – CUDA stream for the operation (optional)

cuda.compute.algorithms.make_radix_sort(
d_in_keys,
d_out_keys,
d_in_values,
d_out_values,
order,
)#

Implements a device-wide radix sort using d_in_keys in the requested order.

Example

Below, make_radix_sort is used to create a radix sort object that can be reused.

Parameters:
  • d_in_keys (DeviceArrayLike | DoubleBuffer) – Device array or DoubleBuffer containing the input keys to be sorted

  • d_out_keys (DeviceArrayLike | None) – Device array to store the sorted keys

  • d_in_values (DeviceArrayLike | DoubleBuffer | None) – Optional Device array or DoubleBuffer containing the input keys to be sorted

  • d_out_values (DeviceArrayLike | None) – Device array to store the sorted values

  • op – Callable representing the comparison operator

  • order (SortOrder)

Returns:

A callable object that can be used to perform the radix sort

cuda.compute.algorithms.segmented_reduce(
d_in,
d_out,
start_offsets_in,
end_offsets_in,
op,
h_init,
num_segments,
stream=None,
)#

Performs device-wide segmented reduction.

This function automatically handles temporary storage allocation and execution.

Example

Below, segmented_reduce is used to compute the minimum value of segments in a sequence of integers.

Parameters:
  • d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items

  • d_out (DeviceArrayLike | IteratorBase) – Device array to store the result of the reduction for each segment

  • start_offsets_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the sequence of beginning offsets

  • end_offsets_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the sequence of ending offsets

  • op (Callable | cuda.compute._bindings.OpKind) – Binary reduction operator

  • h_init (ndarray | Any) – Initial value for the reduction

  • num_segments (int) – Number of segments to reduce

  • stream – CUDA stream for the operation (optional)

cuda.compute.algorithms.make_segmented_reduce(
d_in,
d_out,
start_offsets_in,
end_offsets_in,
op,
h_init,
)#

Computes a device-wide segmented reduction using the specified binary op and initial value init.

Example

Below, make_segmented_reduce is used to create a segmented reduction object that can be reused.

Parameters:
  • d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items

  • d_out (DeviceArrayLike | IteratorBase) – Device array that will store the result of the reduction

  • start_offsets_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing offsets to start of segments

  • end_offsets_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing offsets to end of segments

  • op (Callable | cuda.compute._bindings.OpKind) – Callable or OpKind representing the binary operator to apply

  • init – Numpy array storing initial value of the reduction

  • h_init (ndarray)

Returns:

A callable object that can be used to perform the reduction

cuda.compute.algorithms.unique_by_key(
d_in_keys,
d_in_items,
d_out_keys,
d_out_items,
d_out_num_selected,
op,
num_items,
stream=None,
)#

Performs device-wide unique by key operation using the single-phase API.

This function automatically handles temporary storage allocation and execution.

Example

Below, unique_by_key is used to populate the arrays of output keys and items with the first key and its corresponding item from each sequence of equal keys. It also outputs the number of items selected.

Parameters:
  • d_in_keys (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of keys

  • d_in_items (DeviceArrayLike | IteratorBase) – Device array or iterator that contains each key’s corresponding item

  • d_out_keys (DeviceArrayLike | IteratorBase) – Device array or iterator to store the outputted keys

  • d_out_items (DeviceArrayLike | IteratorBase) – Device array or iterator to store each outputted key’s item

  • d_out_num_selected (DeviceArrayLike) – Device array to store how many items were selected

  • op (Callable | cuda.compute._bindings.OpKind) – Callable or OpKind representing the equality operator

  • num_items (int) – Number of items to process

  • stream – CUDA stream for the operation (optional)

cuda.compute.algorithms.make_unique_by_key(
d_in_keys,
d_in_items,
d_out_keys,
d_out_items,
d_out_num_selected,
op,
)#

Implements a device-wide unique by key operation using d_in_keys and the comparison operator op. Only the first key and its value from each run is selected and the total number of items selected is also reported.

Example

Below, make_unique_by_key is used to create a unique by key object that can be reused.

Parameters:
  • d_in_keys (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of keys

  • d_in_items (DeviceArrayLike | IteratorBase) – Device array or iterator that contains each key’s corresponding item

  • d_out_keys (DeviceArrayLike | IteratorBase) – Device array or iterator to store the outputted keys

  • d_out_items (DeviceArrayLike | IteratorBase) – Device array or iterator to store each outputted key’s item

  • d_out_num_selected (DeviceArrayLike) – Device array to store how many items were selected

  • op (Callable | cuda.compute._bindings.OpKind) – Callable or OpKind representing the equality operator

Returns:

A callable object that can be used to perform unique by key

cuda.compute.algorithms.three_way_partition(
d_in,
d_first_part_out,
d_second_part_out,
d_unselected_out,
d_num_selected_out,
select_first_part_op,
select_second_part_op,
num_items,
stream=None,
)#

Performs device-wide three-way partition. Given an input sequence of data items, it partitions the items into three parts: - The first part is selected by the select_first_part_op operator. - The second part is selected by the select_second_part_op operator. - The unselected items are not selected by either operator.

This function automatically handles temporary storage allocation and execution.

Example

Below, three_way_partition is used to partition a sequence of integers into three parts.

Parameters:
  • d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items

  • d_first_part_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the first part of the output

  • d_second_part_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the second part of the output

  • d_unselected_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the unselected items

  • d_num_selected_out (DeviceArrayLike | IteratorBase) – Device array to store the number of items selected. The total number of items selected by select_first_part_op and select_second_part_op is stored in d_num_selected_out[0] and d_num_selected_out[1], respectively.

  • select_first_part_op (Callable) – Callable representing the unary operator to select the first part

  • select_second_part_op (Callable) – Callable representing the unary operator to select the second part

  • num_items (int) – Number of items to partition

  • stream – CUDA stream for the operation (optional)

cuda.compute.algorithms.make_three_way_partition(
d_in,
d_first_part_out,
d_second_part_out,
d_unselected_out,
d_num_selected_out,
select_first_part_op,
select_second_part_op,
)#

Computes a device-wide three-way partition using the specified unary select_first_part_op and select_second_part_op operators.

Example

Below, make_three_way_partition is used to create a three-way partition object that can be reused.

Parameters:
  • d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items

  • d_first_part_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the first part of the output

  • d_second_part_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the second part of the output

  • d_unselected_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the unselected items

  • d_num_selected_out (DeviceArrayLike | IteratorBase) – Device array to store the number of items selected. The total number of items selected by select_first_part_op and select_second_part_op is stored in d_num_selected_out[0] and d_num_selected_out[1], respectively.

  • select_first_part_op (Callable) – Callable representing the unary operator to select the first part

  • select_second_part_op (Callable) – Callable representing the unary operator to select the second part

Returns:

A callable object that can be used to perform the three-way partition

class cuda.compute.algorithms.DoubleBuffer(d_current, d_alternate)#
Parameters:
  • d_current (DeviceArrayLike)

  • d_alternate (DeviceArrayLike)

__init__(d_current, d_alternate)#
Parameters:
  • d_current (DeviceArrayLike)

  • d_alternate (DeviceArrayLike)

current()#
alternate()#
class cuda.compute.algorithms.SortOrder(
value,
names=<not given>,
*values,
module=None,
qualname=None,
type=None,
start=1,
boundary=None,
)#
ASCENDING = 0#
DESCENDING = 1#

Iterators#

cuda.compute.iterators.CacheModifiedInputIterator(device_array, modifier)#

Random Access Cache Modified Iterator that wraps a native device pointer.

Similar to https://nvidia.github.io/cccl/cub/api/classcub_1_1CacheModifiedInputIterator.html

Currently the only supported modifier is “stream” (LOAD_CS).

Example

The code snippet below demonstrates the usage of a CacheModifiedInputIterator:

Parameters:
  • device_array – Array storing the input sequence of data items

  • modifier – The PTX cache load modifier

Returns:

A CacheModifiedInputIterator object initialized with device_array

cuda.compute.iterators.ConstantIterator(value)#

Returns an Iterator representing a sequence of constant values.

Similar to https://nvidia.github.io/cccl/thrust/api/classthrust_1_1constant__iterator.html

Example

The code snippet below demonstrates the usage of a ConstantIterator representing a sequence of constant values:

Parameters:

value – The value of every item in the sequence

Returns:

A ConstantIterator object initialized to value

cuda.compute.iterators.CountingIterator(offset)#

Returns an Iterator representing a sequence of incrementing values.

Similar to https://nvidia.github.io/cccl/thrust/api/classthrust_1_1counting__iterator.html

Example

The code snippet below demonstrates the usage of a CountingIterator representing the sequence [10, 11, 12]:

Parameters:

offset – The initial value of the sequence

Returns:

A CountingIterator object initialized to offset

cuda.compute.iterators.ReverseIterator(sequence)#

Returns an Iterator over an array or another iterator in reverse.

Similar to [std::reverse_iterator](https://en.cppreference.com/w/cpp/iterator/reverse_iterator).

Examples

The code snippet below demonstrates the usage of a ReverseIterator as an input iterator:

The code snippet below demonstrates the usage of a ReverseIterator as an output iterator:

Parameters:

sequence – The iterator or array to be reversed

Returns:

A ReverseIterator object

cuda.compute.iterators.TransformIterator(it, op)#

An iterator that applies a user-defined unary function to the elements of an underlying iterator as they are read.

Similar to [thrust::transform_iterator](https://nvidia.github.io/cccl/thrust/api/classthrust_1_1transform__iterator.html)

Example

The code snippet below demonstrates the usage of a TransformIterator composed with a CountingIterator to transform the input before performing a reduction.

Parameters:
  • it – The underlying iterator

  • op – The unary operation to be applied to values as they are read from it

Returns:

A TransformIterator object to transform the items in it using op

cuda.compute.iterators.TransformOutputIterator(it, op)#

An iterator that applies a user-defined unary function to values before writing them to an underlying iterator.

Similar to [thrust::transform_output_iterator](https://nvidia.github.io/cccl/thrust/api/classthrust_1_1transform__output__iterator.html).

Example

The code snippet below demonstrates the usage of a TransformOutputIterator to transform the output of a reduction before writing to an output array.

Parameters:
  • it – The underlying iterator

  • op – The operation to be applied to values before they are written to it

Returns:

A TransformOutputIterator object that applies op to transform values before writing them to it

cuda.compute.iterators.ZipIterator(*iterators)#

Returns an Iterator representing a zipped sequence of values from N iterators.

Similar to https://nvidia.github.io/cccl/thrust/api/classthrust_1_1zip__iterator.html

The resulting iterator yields gpu_struct objects with fields corresponding to each input iterator. For 2 iterators, fields are named ‘first’ and ‘second’. For N iterators, fields are indexed as field_0, field_1, …, field_N-1.

Example

The code snippet below demonstrates the usage of a ZipIterator combining two device arrays:

Parameters:

*iterators – Variable number of iterators to zip (at least 1)

Returns:

A ZipIterator object that yields combined values from all input iterators

Operators#

class cuda.compute.op.OpKind#

Enumeration of operator kinds for CUDA parallel algorithms.

This enum defines the types of operations that can be performed in parallel algorithms, including arithmetic, logical, and bitwise operations.

STATELESS#
STATEFUL#
PLUS#
MINUS#
MULTIPLIES#
DIVIDES#
MODULUS#
EQUAL_TO#
NOT_EQUAL_TO#
GREATER#
LESS#
GREATER_EQUAL#
LESS_EQUAL#
LOGICAL_AND#
LOGICAL_OR#
LOGICAL_NOT#
BIT_AND#
BIT_OR#
BIT_XOR#
BIT_NOT#
NEGATE#

Utilities#

cuda.compute.struct.gpu_struct_from_numba_types(name, field_names, field_types)#

Create a struct type from tuples of field names and numba types.

Parameters:
  • name (str) – The name of the struct class

  • field_names (tuple) – Tuple of field names

  • field_types (tuple) – Tuple of corresponding numba types

Returns:

A dynamically created struct class with the specified fields

Return type:

Type

cuda.compute.struct.gpu_struct(this)#

Decorate a class as a GPU struct.

A GpuStruct represents a value composed of one or more other values, and is defined as a class with annotated fields (similar to a dataclass). The type of each field must be a subclass of np.number, like np.int32 or np.float64.

Arrays of GPUStruct objects can be used as inputs to cuda.compute algorithms.

Example

The code snippet below shows how to use gpu_struct to define a MinMax type (composed of min_val, max_val values), and perform a reduction on an input array of floating point values to compute its the smallest and the largest absolute values:

Parameters:

this (type)

Return type:

Type[Any]

cuda.compute.struct.gpu_struct_from_numpy_dtype(name, np_dtype)#

Create a GPU struct from a numpy record dtype.