`cuda.compute` API Reference#

Warning

cuda.compute is in public beta. The API is subject to change without notice.

Algorithms#

cuda.compute.algorithms.reduce_into(d_in, d_out, op, num_items, h_init, stream=None, **kwargs)#

Performs device-wide reduction.

This function automatically handles temporary storage allocation and execution.

Example

Below, reduce_into is used to compute the sum of a sequence of integers.

"""
Sum all values in an array using reduction with PLUS operation.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import OpKind

# Prepare the input and output arrays.
dtype = np.int32
h_init = np.array([0], dtype=dtype)
d_input = cp.array([1, 2, 3, 4, 5], dtype=dtype)
d_output = cp.empty(1, dtype=dtype)

# Perform the reduction.
cuda.compute.reduce_into(d_input, d_output, OpKind.PLUS, len(d_input), h_init)

# Verify the result.
expected_output = 15
assert (d_output == expected_output).all()
result = d_output[0]
print(f"Sum reduction result: {result}")

Parameters:

d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items
d_out (DeviceArrayLike | IteratorBase) – Device array to store the result of the reduction
op (Callable | cuda.compute._bindings.OpKind) – Binary reduction operator
num_items (int) – Number of items to reduce
h_init (ndarray | GpuStruct) – Initial value for the reduction
stream – CUDA stream for the operation (optional)

cuda.compute.algorithms.make_reduce_into(d_in, d_out, op, h_init, **kwargs)#

Computes a device-wide reduction using the specified binary op and initial value init.

Example

Below, make_reduce_into is used to create a reduction object that can be reused.

"""
Reduction example using the object API.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    OpKind,
)

# Prepare the input and output arrays.
dtype = np.int32
init_value = 5
h_init = np.array([init_value], dtype=dtype)
h_input = np.array([1, 2, 3, 4], dtype=dtype)
d_input = cp.asarray(h_input)
d_output = cp.empty(1, dtype=dtype)

# Create a reducer object.
reducer = cuda.compute.make_reduce_into(d_input, d_output, OpKind.PLUS, h_init)

# Get the temporary storage size.
temp_storage_size = reducer(None, d_input, d_output, len(h_input), h_init)

# Allocate temporary storage using any user-defined allocator.
# The result must be an object exposing `__cuda_array_interface__`.
d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)

# Perform the reduction.
reducer(d_temp_storage, d_input, d_output, len(h_input), h_init)

expected_result = np.sum(h_input) + init_value
actual_result = d_output.get()[0]
assert actual_result == expected_result
print("Reduce object example completed successfully")

Parameters:

d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items
d_out (DeviceArrayLike | IteratorBase) – Device array (of size 1) that will store the result of the reduction
op (Callable | cuda.compute._bindings.OpKind) – Callable or OpKind representing the binary operator to apply
init – Numpy array storing initial value of the reduction
h_init (ndarray | GpuStruct)

Returns:

A callable object that can be used to perform the reduction

cuda.compute.algorithms.inclusive_scan(d_in, d_out, op, init_value, num_items, stream=None)#

Performs device-wide inclusive scan.

This function automatically handles temporary storage allocation and execution.

Example

Below, inclusive_scan is used to compute an inclusive scan (prefix sum).

"""
Inclusive scan with custom operation (prefix sum of even values).
"""

import cupy as cp
import numpy as np

import cuda.compute

# Prepare the input and output arrays.
h_init = np.array([0], dtype="int32")
d_input = cp.array([1, 2, 3, 4, 5], dtype="int32")
d_output = cp.empty_like(d_input, dtype="int32")

# Define the binary operation for the scan.


def add_op(a, b):
    return (a if a % 2 == 0 else 0) + (b if b % 2 == 0 else 0)


# Perform the inclusive scan.
cuda.compute.inclusive_scan(d_input, d_output, add_op, h_init, d_input.size)

# Verify the result.
expected = np.asarray([0, 2, 2, 6, 6])
assert np.array_equal(d_output.get(), expected)
result = d_output.get()
print(f"Inclusive scan custom result: {result}")

Parameters:

d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items
d_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the result of the scan
op (Callable | cuda.compute._bindings.OpKind) – Binary scan operator
init_value (ndarray | DeviceArrayLike | GpuStruct | None) – Initial value for the scan
num_items (int) – Number of items to scan
stream – CUDA stream for the operation (optional)

cuda.compute.algorithms.make_inclusive_scan(d_in, d_out, op, init_value)#

Computes a device-wide scan using the specified binary op and initial value init.

Example

Below, make_inclusive_scan is used to create an inclusive scan object that can be reused.

"""
Inclusive scan example demonstrating the object API.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    OpKind,
)

# Prepare the input and output arrays.
dtype = np.int32
h_init = np.array([0], dtype=dtype)
h_input = np.array([1, 2, 3, 4], dtype=dtype)
d_input = cp.asarray(h_input)
d_output = cp.empty(len(h_input), dtype=dtype)

# Create the scanner object and allocate temporary storage.
scanner = cuda.compute.make_inclusive_scan(d_input, d_output, OpKind.PLUS, h_init)
temp_storage_size = scanner(None, d_input, d_output, len(h_input), h_init)
d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)

# Perform the inclusive scan.
scanner(d_temp_storage, d_input, d_output, len(h_input), h_init)

# Verify the result.
expected_result = np.array([1, 3, 6, 10], dtype=dtype)
actual_result = d_output.get()
np.testing.assert_array_equal(actual_result, expected_result)
print("Inclusive scan object example completed successfully")

Parameters:

d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items
d_out (DeviceArrayLike | IteratorBase) – Device array that will store the result of the scan
op (Callable | cuda.compute._bindings.OpKind) – Callable or OpKind representing the binary operator to apply
init_value (ndarray | DeviceArrayLike | GpuStruct | None) – Numpy array, device array, or GPU struct storing initial value of the scan, or None for no initial value

Returns:

A callable object that can be used to perform the scan

cuda.compute.algorithms.exclusive_scan(d_in, d_out, op, init_value, num_items, stream=None)#

Performs device-wide exclusive scan.

This function automatically handles temporary storage allocation and execution.

Example

Below, exclusive_scan is used to compute an exclusive scan with max operation.

"""
Exclusive scan using custom maximum operation.
"""

import cupy as cp
import numpy as np

import cuda.compute

# Define the binary operation for the scan.


def max_op(a, b):
    return max(a, b)


# Prepare the input and output arrays.
h_init = np.array([1], dtype="int32")
d_input = cp.array([-5, 0, 2, -3, 2, 4, 0, -1, 2, 8], dtype="int32")
d_output = cp.empty_like(d_input, dtype="int32")

# Perform the exclusive scan.
cuda.compute.exclusive_scan(d_input, d_output, max_op, h_init, d_input.size)

# Verify the result.
expected = np.asarray([1, 1, 1, 2, 2, 2, 4, 4, 4, 4])
result = d_output.get()

np.testing.assert_equal(result, expected)
print(f"Exclusive scan max result: {result}")

Parameters:

d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items
d_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the result of the scan
op (Callable | cuda.compute._bindings.OpKind) – Binary scan operator
init_value (ndarray | DeviceArrayLike | GpuStruct | None) – Initial value for the scan
num_items (int) – Number of items to scan
stream – CUDA stream for the operation (optional)

cuda.compute.algorithms.make_exclusive_scan(d_in, d_out, op, init_value)#

Computes a device-wide scan using the specified binary op and initial value init.

Example

Below, make_exclusive_scan is used to create an exclusive scan object that can be reused.

"""
Exclusive scan example demonstrating the object API.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    OpKind,
)

# Prepare the input and output arrays.
dtype = np.int32
h_init = np.array([0], dtype=dtype)
h_input = np.array([1, 2, 3, 4], dtype=dtype)
d_input = cp.asarray(h_input)
d_output = cp.empty(len(h_input), dtype=dtype)

# Create the scanner object and allocate temporary storage.
scanner = cuda.compute.make_exclusive_scan(d_input, d_output, OpKind.PLUS, h_init)
temp_storage_size = scanner(None, d_input, d_output, len(h_input), h_init)
d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)

# Perform the exclusive scan.
scanner(d_temp_storage, d_input, d_output, len(h_input), h_init)

# Verify the result.
expected_result = np.array([0, 1, 3, 6], dtype=dtype)
actual_result = d_output.get()
np.testing.assert_array_equal(actual_result, expected_result)
print("Exclusive scan object example completed successfully")

Parameters:

d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items
d_out (DeviceArrayLike | IteratorBase) – Device array that will store the result of the scan
op (Callable | cuda.compute._bindings.OpKind) – Callable or OpKind representing the binary operator to apply
init_value (ndarray | DeviceArrayLike | GpuStruct | None) – Numpy array, device array, or GPU struct storing initial value of the scan, or None for no initial value

Returns:

A callable object that can be used to perform the scan

cuda.compute.algorithms.unary_transform(d_in, d_out, op, num_items, stream=None)#

Performs device-wide unary transform.

This function automatically handles temporary storage allocation and execution.

The op function can reference device arrays as globals or closures - they will be automatically captured as state arrays, enabling stateful operations like counting.

Example

Below, unary_transform is used to apply a transformation to each element of the input.

"""
Example showing how to use unary_transform to apply a unary operation to each element.
"""

import cupy as cp
import numpy as np

import cuda.compute

# Prepare the input and output arrays.
input_data = np.array([1, 2, 3, 4, 5], dtype=np.int32)
d_in = cp.asarray(input_data)
d_out = cp.empty_like(d_in)


# Define the unary operation.
def op(a):
    return a + 1


# Perform the unary transform.
cuda.compute.unary_transform(d_in, d_out, op, len(d_in))

# Verify the result.
result = d_out.get()
expected = input_data + 1

np.testing.assert_array_equal(result, expected)
print(f"Unary transform result: {result}")

When working with custom struct types, you need to provide type annotations to help with type inference. See the binary transform struct example for reference:

"""
Example demonstrating binary_transform with custom struct types.

When working with struct inputs in transform operations, you need to provide
type annotations to help Numba infer the correct types. Unlike reduce_into
which can infer types from h_init, transform operations require explicit
annotations when using struct inputs.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import gpu_struct


@gpu_struct
class Point2D:
    x: np.float32
    y: np.float32


def add_points(p1: Point2D, p2: Point2D) -> Point2D:
    return Point2D(p1.x + p2.x, p1.y + p2.y)


num_items = 1000

h_in1 = np.empty(num_items, dtype=Point2D.dtype)
h_in1["x"] = np.random.rand(num_items).astype(np.float32)
h_in1["y"] = np.random.rand(num_items).astype(np.float32)

h_in2 = np.empty(num_items, dtype=Point2D.dtype)
h_in2["x"] = np.random.rand(num_items).astype(np.float32)
h_in2["y"] = np.random.rand(num_items).astype(np.float32)

d_in1 = cp.empty_like(h_in1)
d_in1.set(h_in1)

d_in2 = cp.empty_like(h_in2)
d_in2.set(h_in2)

d_out = cp.empty_like(d_in1)

cuda.compute.binary_transform(d_in1, d_in2, d_out, add_points, num_items)

result = d_out.get()

np.testing.assert_allclose(result["x"], h_in1["x"] + h_in2["x"], rtol=1e-5)
np.testing.assert_allclose(result["y"], h_in1["y"] + h_in2["y"], rtol=1e-5)

print("Binary transform with structs completed successfully")
print(f"First result point: x={result[0]['x']:.4f}, y={result[0]['y']:.4f}")

Parameters:

d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items.
d_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the result of the transformation.
op (Callable | cuda.compute._bindings.OpKind) – Callable or OpKind representing the unary operation to apply to each element of the input. Can reference device arrays as globals/closures - they will be automatically captured.
num_items (int) – Number of items to transform.
stream – CUDA stream to use for the operation.

cuda.compute.algorithms.make_unary_transform(d_in, d_out, op)#

Create a unary transform object that can be called to apply a transformation to each element of the input according to the unary operation op.

This is the object-oriented API that allows explicit control over temporary storage allocation. For simpler usage, consider using unary_transform().

Example

"""
Unary transform examples demonstrating the object API and well-known operations.
"""

import cupy as cp
import numpy as np

import cuda.compute

# Prepare the input and output arrays.
dtype = np.int32
h_input = np.array([1, 2, 3, 4], dtype=dtype)
d_input = cp.asarray(h_input)
d_output = cp.empty_like(d_input)


# Define the unary operation.
def add_one_op(a):
    return a + 1


# Create the unary transform object.
transformer = cuda.compute.make_unary_transform(d_input, d_output, add_one_op)

# Perform the unary transform.
transformer(d_input, d_output, len(h_input))

# Verify the result.
expected_result = np.array([2, 3, 4, 5], dtype=dtype)
actual_result = d_output.get()
np.testing.assert_array_equal(actual_result, expected_result)
print("Unary transform object example completed successfully")

Parameters:

d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items.
d_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the result of the transformation.
op (Callable | cuda.compute._bindings.OpKind) – Callable or OpKind representing the unary operation to apply to each element of the input.

Returns:

A callable object that performs the transformation.

cuda.compute.algorithms.binary_transform(d_in1, d_in2, d_out, op, num_items, stream=None)#

Performs device-wide binary transform.

This function automatically handles temporary storage allocation and execution.

Example

Below, binary_transform is used to apply a transformation to pairs of elements from two input sequences.

"""
Example showing how to use binary_transform to perform elementwise addition.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    OpKind,
)

# Prepare the input and output arrays.
input1_data = np.array([1, 2, 3, 4], dtype=np.int32)
input2_data = np.array([10, 20, 30, 40], dtype=np.int32)
d_in1 = cp.asarray(input1_data)
d_in2 = cp.asarray(input2_data)
d_out = cp.empty_like(d_in1)

# Perform the binary transform.
cuda.compute.binary_transform(d_in1, d_in2, d_out, OpKind.PLUS, len(d_in1))

# Verify the result.
result = d_out.get()
expected = input1_data + input2_data

np.testing.assert_array_equal(result, expected)
print(f"Binary transform result: {result}")

When working with custom struct types, you need to provide type annotations to help with type inference. See the following example:

"""
Example demonstrating binary_transform with custom struct types.

When working with struct inputs in transform operations, you need to provide
type annotations to help Numba infer the correct types. Unlike reduce_into
which can infer types from h_init, transform operations require explicit
annotations when using struct inputs.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import gpu_struct


@gpu_struct
class Point2D:
    x: np.float32
    y: np.float32


def add_points(p1: Point2D, p2: Point2D) -> Point2D:
    return Point2D(p1.x + p2.x, p1.y + p2.y)


num_items = 1000

h_in1 = np.empty(num_items, dtype=Point2D.dtype)
h_in1["x"] = np.random.rand(num_items).astype(np.float32)
h_in1["y"] = np.random.rand(num_items).astype(np.float32)

h_in2 = np.empty(num_items, dtype=Point2D.dtype)
h_in2["x"] = np.random.rand(num_items).astype(np.float32)
h_in2["y"] = np.random.rand(num_items).astype(np.float32)

d_in1 = cp.empty_like(h_in1)
d_in1.set(h_in1)

d_in2 = cp.empty_like(h_in2)
d_in2.set(h_in2)

d_out = cp.empty_like(d_in1)

cuda.compute.binary_transform(d_in1, d_in2, d_out, add_points, num_items)

result = d_out.get()

np.testing.assert_allclose(result["x"], h_in1["x"] + h_in2["x"], rtol=1e-5)
np.testing.assert_allclose(result["y"], h_in1["y"] + h_in2["y"], rtol=1e-5)

print("Binary transform with structs completed successfully")
print(f"First result point: x={result[0]['x']:.4f}, y={result[0]['y']:.4f}")

Parameters:

d_in1 (DeviceArrayLike | IteratorBase) – Device array or iterator containing the first input sequence of data items.
d_in2 (DeviceArrayLike | IteratorBase) – Device array or iterator containing the second input sequence of data items.
d_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the result of the transformation.
op (Callable | cuda.compute._bindings.OpKind) – Callable or OpKind representing the binary operation to apply to each pair of items from the input sequences. Can reference device arrays as globals/closures - they will be automatically captured.
num_items (int) – Number of items to transform.
stream – CUDA stream to use for the operation.

cuda.compute.algorithms.make_binary_transform(d_in1, d_in2, d_out, op)#

Create a binary transform object that can be called to apply a transformation to the given pair of input sequences according to the binary operation op.

This is the object-oriented API that allows explicit control over temporary storage allocation. For simpler usage, consider using binary_transform().

Example

"""
Binary transform examples demonstrating the transform object API.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    OpKind,
)

# Prepare the input and output arrays.
dtype = np.int32
h_input1 = np.array([1, 2, 3, 4], dtype=dtype)
h_input2 = np.array([10, 20, 30, 40], dtype=dtype)
d_input1 = cp.asarray(h_input1)
d_input2 = cp.asarray(h_input2)
d_output = cp.empty_like(d_input1)

# Create the binary transform object.
transformer = cuda.compute.make_binary_transform(
    d_input1, d_input2, d_output, OpKind.PLUS
)

# Perform the binary transform.
transformer(d_input1, d_input2, d_output, len(h_input1))

# Verify the result.
expected_result = np.array([11, 22, 33, 44], dtype=dtype)
actual_result = d_output.get()
np.testing.assert_array_equal(actual_result, expected_result)
print("Binary transform object example completed successfully")

Parameters:

d_in1 (DeviceArrayLike | IteratorBase) – Device array or iterator containing the first input sequence of data items.
d_in2 (DeviceArrayLike | IteratorBase) – Device array or iterator containing the second input sequence of data items.
d_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the result of the transformation.
op (Callable | cuda.compute._bindings.OpKind) – Callable or OpKind representing the binary operation to apply to each pair of items from the input sequences.

Returns:

A callable object that performs the transformation.

cuda.compute.algorithms.histogram_even( d_samples, d_histogram, num_output_levels, lower_level, upper_level, num_samples, stream=None, )#

Performs device-wide histogram computation with evenly-spaced bins.

This function automatically handles temporary storage allocation and execution.

Example

Below, histogram_even is used to compute a histogram with evenly-spaced bins.

Basic histogram example.#

"""
Example showing how to use histogram_even to bin a sequence of samples.
"""

import cupy as cp
import numpy as np

import cuda.compute

# Prepare the input and output arrays.
num_samples = 10
h_samples = np.array(
    [2.2, 6.1, 7.1, 2.9, 3.5, 0.3, 2.9, 2.1, 6.1, 999.5], dtype="float32"
)
d_samples = cp.asarray(h_samples)
num_levels = 7
d_histogram = cp.empty(num_levels - 1, dtype="int32")
lower_level = np.float32(0)
upper_level = np.float32(12)

# Perform the histogram even.
cuda.compute.histogram_even(
    d_samples,
    d_histogram,
    num_levels,
    lower_level,
    upper_level,
    num_samples,
)

# Verify the result.
h_actual_histogram = cp.asnumpy(d_histogram)
h_expected_histogram, _ = np.histogram(
    h_samples, bins=num_levels - 1, range=(lower_level, upper_level)
)
h_expected_histogram = h_expected_histogram.astype("int32")

np.testing.assert_array_equal(h_actual_histogram, h_expected_histogram)
print(f"Histogram even basic result: {h_actual_histogram}")

Parameters:

d_samples (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data samples
d_histogram (DeviceArrayLike) – Device array to store the computed histogram
num_output_levels (int) – Number of histogram bin levels (num_bins = num_output_levels - 1)
lower_level (floating | integer) – Lower sample value bound (inclusive)
upper_level (floating | integer) – Upper sample value bound (exclusive)
num_samples (int) – Number of input samples
stream – CUDA stream for the operation (optional)

cuda.compute.algorithms.make_histogram_even( d_samples, d_histogram, h_num_output_levels, h_lower_level, h_upper_level, num_samples, )#

Implements a device-wide histogram that places d_samples into evenly-spaced bins.

Example

Below, make_histogram_even is used to create a histogram object that can be reused.

"""
Example showing how to use histogram object API to bin a sequence of samples.
"""

import cupy as cp
import numpy as np

import cuda.compute

# Prepare the input and output arrays.
h_samples = np.array(
    [1.5, 2.3, 4.7, 6.2, 7.8, 3.1, 5.5, 8.9, 2.7, 6.4], dtype="float32"
)
d_samples = cp.asarray(h_samples)

num_levels = 6

# note that the object API requires passing numpy arrays
# rather than scalars:
h_num_output_levels = np.array([num_levels], dtype=np.int32)
h_lower_level = np.array([0.0], dtype=np.float32)
h_upper_level = np.array([10.0], dtype=np.float32)

d_histogram = cp.zeros(num_levels - 1, dtype="int32")

# Create the histogram object.
histogrammer = cuda.compute.make_histogram_even(
    d_samples,
    d_histogram,
    h_num_output_levels,
    h_lower_level,
    h_upper_level,
    len(h_samples),
)

# Get the temporary storage size.
temp_storage_size = histogrammer(
    None,
    d_samples,
    d_histogram,
    h_num_output_levels,
    h_lower_level,
    h_upper_level,
    len(h_samples),
)

# Allocate the temporary storage.
d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)

# Perform the histogram.
histogrammer(
    d_temp_storage,
    d_samples,
    d_histogram,
    h_num_output_levels,
    h_lower_level,
    h_upper_level,
    len(h_samples),
)

# Verify the result.
h_result = cp.asnumpy(d_histogram)
expected_histogram = np.array([1, 3, 2, 3, 1], dtype="int32")

np.testing.assert_array_equal(h_result, expected_histogram)
print("Histogram object example completed successfully")

Parameters:

d_samples (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input samples to be histogrammed
d_histogram (DeviceArrayLike) – Device array to store the histogram
h_num_output_levels (ndarray) – Host array containing the number of output levels
h_lower_level (ndarray) – Host array containing the lower level
h_upper_level (ndarray) – Host array containing the upper level
num_samples (int) – Number of samples to be histogrammed

Returns:

A callable object that can be used to perform the histogram

cuda.compute.algorithms.merge_sort( d_in_keys, d_in_items, d_out_keys, d_out_items, op, num_items, stream=None, )#

Performs device-wide merge sort.

This function automatically handles temporary storage allocation and execution.

Example

Below, merge_sort is used to sort a sequence of keys inplace. It also rearranges the items according to the keys’ order.

"""
Demonstrate basic merge sort with keys and values.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    OpKind,
)

# Prepare the input and output arrays.
h_in_keys = np.array([-5, 0, 2, -3, 2, 4, 0, -1, 2, 8], dtype="int32")
h_in_values = np.array(
    [-3.2, 2.2, 1.9, 4.0, -3.9, 2.7, 0, 8.3 - 1, 2.9, 5.4], dtype="float32"
)

d_in_keys = cp.asarray(h_in_keys)
d_in_values = cp.asarray(h_in_values)

# Perform the merge sort.
cuda.compute.merge_sort(
    d_in_keys,
    d_in_values,
    d_in_keys,
    d_in_values,
    OpKind.LESS,
    d_in_keys.size,
)

# Verify the result.
h_out_keys = cp.asnumpy(d_in_keys)
h_out_values = cp.asnumpy(d_in_values)

argsort = np.argsort(h_in_keys, stable=True)
expected_keys = np.array(h_in_keys)[argsort]
expected_values = np.array(h_in_values)[argsort]

assert np.array_equal(h_out_keys, expected_keys)
assert np.array_equal(h_out_values, expected_values)
print(f"Merge sort basic result - keys: {h_out_keys}, values: {h_out_values}")

Parameters:

d_in_keys (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of keys
d_in_items (DeviceArrayLike | IteratorBase | None) – Device array or iterator containing the input sequence of items (optional)
d_out_keys (DeviceArrayLike) – Device array to store the sorted keys
d_out_items (DeviceArrayLike | None) – Device array to store the sorted items (optional)
op (Callable | cuda.compute._bindings.OpKind) – Comparison operator for sorting
num_items (int) – Number of items to sort
stream – CUDA stream for the operation (optional)

cuda.compute.algorithms.make_merge_sort(d_in_keys, d_in_items, d_out_keys, d_out_items, op)#

Implements a device-wide merge sort using d_in_keys and the comparison operator op.

Example

Below, make_merge_sort is used to create a merge sort object that can be reused.

"""
Merge sort example demonstrating the object API.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    OpKind,
)

# Prepare the input and output arrays.
dtype = np.int32
h_input_keys = np.array([4, 2, 3, 1], dtype=dtype)
h_input_values = np.array([40, 20, 30, 10], dtype=dtype)
d_input_keys = cp.asarray(h_input_keys)
d_input_values = cp.asarray(h_input_values)
d_output_keys = cp.empty_like(d_input_keys)
d_output_values = cp.empty_like(d_input_values)

# Create the merge sort object.
sorter = cuda.compute.make_merge_sort(
    d_input_keys,
    d_input_values,
    d_output_keys,
    d_output_values,
    OpKind.LESS,
)

# Get the temporary storage size.
temp_storage_size = sorter(
    None,
    d_input_keys,
    d_input_values,
    d_output_keys,
    d_output_values,
    len(h_input_keys),
)

# Allocate the temporary storage.
d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)

# Perform the merge sort.
sorter(
    d_temp_storage,
    d_input_keys,
    d_input_values,
    d_output_keys,
    d_output_values,
    len(h_input_keys),
)

# Verify the result.
expected_keys = np.array([1, 2, 3, 4], dtype=dtype)
expected_values = np.array([10, 20, 30, 40], dtype=dtype)
actual_keys = d_output_keys.get()
actual_values = d_output_values.get()
np.testing.assert_array_equal(actual_keys, expected_keys)
np.testing.assert_array_equal(actual_values, expected_values)
print("Merge sort object example completed successfully")

Parameters:

d_in_keys (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input keys to be sorted
d_in_items (DeviceArrayLike | IteratorBase | None) – Optional device array or iterator that contains each key’s corresponding item
d_out_keys (DeviceArrayLike) – Device array to store the sorted keys
d_out_items (DeviceArrayLike | None) – Device array to store the sorted items
op (Callable | cuda.compute._bindings.OpKind) – Callable or OpKind representing the comparison operator

Returns:

A callable object that can be used to perform the merge sort

cuda.compute.algorithms.radix_sort( d_in_keys, d_out_keys, d_in_values, d_out_values, order, num_items, begin_bit=None, end_bit=None, stream=None, )#

Performs device-wide radix sort.

This function automatically handles temporary storage allocation and execution.

Example

Below, radix_sort is used to sort a sequence of keys. It also rearranges the values according to the keys’ order.

"""
Example showing how to use radix_sort to sort keys and values.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    SortOrder,
)

# Prepare the input and output arrays.
h_in_keys = np.array([-5, 0, 2, -3, 2, 4, 0, -1, 2, 8], dtype="int32")
h_in_values = np.array(
    [-3.2, 2.2, 1.9, 4.0, -3.9, 2.7, 0, 8.3 - 1, 2.9, 5.4], dtype="float32"
)

d_in_keys = cp.asarray(h_in_keys)
d_in_values = cp.asarray(h_in_values)

# Prepare the output arrays.
d_out_keys = cp.empty_like(d_in_keys)
d_out_values = cp.empty_like(d_in_values)

# Perform the radix sort.
cuda.compute.radix_sort(
    d_in_keys,
    d_out_keys,
    d_in_values,
    d_out_values,
    SortOrder.ASCENDING,
    d_in_keys.size,
)

# Verify the result.
h_out_keys = cp.asnumpy(d_out_keys)
h_out_values = cp.asnumpy(d_out_values)

argsort = np.argsort(h_in_keys, stable=True)
expected_keys = np.array(h_in_keys)[argsort]
expected_values = np.array(h_in_values)[argsort]

assert np.array_equal(h_out_keys, expected_keys)
assert np.array_equal(h_out_values, expected_values)
print(f"Radix sort basic result - keys: {h_out_keys}, values: {h_out_values}")

In the following example, radix_sort is used to sort a sequence of keys with a ``DoubleBuffer` for reduced temporary storage.

"""
Example showing how to use radix_sort with DoubleBuffer for reduced temporary storage.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    DoubleBuffer,
    SortOrder,
)

# Prepare the input and output arrays.
h_in_keys = np.array([-5, 0, 2, -3, 2, 4, 0, -1, 2, 8], dtype="int32")
h_in_values = np.array(
    [-3.2, 2.2, 1.9, 4.0, -3.9, 2.7, 0, 8.3 - 1, 2.9, 5.4], dtype="float32"
)

d_in_keys = cp.asarray(h_in_keys)
d_in_values = cp.asarray(h_in_values)

d_out_keys = cp.empty_like(d_in_keys)
d_out_values = cp.empty_like(d_in_values)

# Create the double buffer.
keys_double_buffer = DoubleBuffer(d_in_keys, d_out_keys)
values_double_buffer = DoubleBuffer(d_in_values, d_out_values)

# Perform the radix sort.
cuda.compute.radix_sort(
    keys_double_buffer,
    None,
    values_double_buffer,
    None,
    SortOrder.ASCENDING,
    d_in_keys.size,
)

# Verify the result.
h_out_keys = cp.asnumpy(keys_double_buffer.current())
h_out_values = cp.asnumpy(values_double_buffer.current())

argsort = np.argsort(h_in_keys, stable=True)
h_expected_keys = np.array(h_in_keys)[argsort]
h_expected_values = np.array(h_in_values)[argsort]

assert np.array_equal(h_out_keys, h_expected_keys)
assert np.array_equal(h_out_values, h_expected_values)
print(f"Radix sort buffer result - keys: {h_out_keys}, values: {h_out_values}")

Parameters:

d_in_keys (DeviceArrayLike | DoubleBuffer) – Device array or DoubleBuffer containing the input sequence of keys
d_out_keys (DeviceArrayLike | None) – Device array to store the sorted keys (optional)
d_in_values (DeviceArrayLike | DoubleBuffer | None) – Device array or DoubleBuffer containing the input sequence of values (optional)
d_out_values (DeviceArrayLike | None) – Device array to store the sorted values (optional)
order (SortOrder) – Sort order (ascending or descending)
num_items (int) – Number of items to sort
begin_bit (int | None) – Beginning bit position for comparison (optional)
end_bit (int | None) – Ending bit position for comparison (optional)
stream – CUDA stream for the operation (optional)

cuda.compute.algorithms.make_radix_sort( d_in_keys, d_out_keys, d_in_values, d_out_values, order, )#

Implements a device-wide radix sort using d_in_keys in the requested order.

Example

Below, make_radix_sort is used to create a radix sort object that can be reused.

"""
Example showing how to use radix_sort with the object API.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    SortOrder,
)

# Prepare the input and output arrays.
dtype = np.int32
h_input_keys = np.array([4, 2, 3, 1], dtype=dtype)
h_input_values = np.array([40, 20, 30, 10], dtype=dtype)
d_input_keys = cp.asarray(h_input_keys)
d_input_values = cp.asarray(h_input_values)
d_output_keys = cp.empty_like(d_input_keys)
d_output_values = cp.empty_like(d_input_values)

# Create the radix sort object.
sorter = cuda.compute.make_radix_sort(
    d_input_keys,
    d_output_keys,
    d_input_values,
    d_output_values,
    SortOrder.ASCENDING,
)

# Get the temporary storage size.
temp_storage_size = sorter(
    None,
    d_input_keys,
    d_output_keys,
    d_input_values,
    d_output_values,
    len(h_input_keys),
)
d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)

# Perform the radix sort.
sorter(
    d_temp_storage,
    d_input_keys,
    d_output_keys,
    d_input_values,
    d_output_values,
    len(h_input_keys),
)

# Verify the result.
expected_keys = np.array([1, 2, 3, 4], dtype=dtype)
expected_values = np.array([10, 20, 30, 40], dtype=dtype)
actual_keys = d_output_keys.get()
actual_values = d_output_values.get()
np.testing.assert_array_equal(actual_keys, expected_keys)
np.testing.assert_array_equal(actual_values, expected_values)
print("Radix sort object example completed successfully")

Parameters:

d_in_keys (DeviceArrayLike | DoubleBuffer) – Device array or DoubleBuffer containing the input keys to be sorted
d_out_keys (DeviceArrayLike | None) – Device array to store the sorted keys
d_in_values (DeviceArrayLike | DoubleBuffer | None) – Optional Device array or DoubleBuffer containing the input keys to be sorted
d_out_values (DeviceArrayLike | None) – Device array to store the sorted values
op – Callable representing the comparison operator
order (SortOrder)

Returns:

A callable object that can be used to perform the radix sort

cuda.compute.algorithms.segmented_reduce( d_in, d_out, start_offsets_in, end_offsets_in, op, h_init, num_segments, stream=None, )#

Performs device-wide segmented reduction.

This function automatically handles temporary storage allocation and execution.

Example

Below, segmented_reduce is used to compute the minimum value of segments in a sequence of integers.

"""
Example showing how to use segmented_reduce to find the minimum in each segment.
"""

import cupy as cp
import numpy as np

import cuda.compute


def min_op(a, b):
    return a if a < b else b


dtype = np.dtype(np.int32)
max_val = np.iinfo(dtype).max
h_init = np.asarray(max_val, dtype=dtype)

# Prepare the offsets.
offsets = cp.array([0, 7, 11, 16], dtype=np.int64)
first_segment = (8, 6, 7, 5, 3, 0, 9)
second_segment = (-4, 3, 0, 1)
third_segment = (3, 1, 11, 25, 8)

# Prepare the input array.
d_input = cp.array(
    [*first_segment, *second_segment, *third_segment],
    dtype=dtype,
)

# Prepare the start and end offsets.
start_o = offsets[:-1]
end_o = offsets[1:]

# Prepare the output array.
n_segments = start_o.size
d_output = cp.empty(n_segments, dtype=dtype)

# Perform the segmented reduce.
cuda.compute.segmented_reduce(
    d_input, d_output, start_o, end_o, min_op, h_init, n_segments
)

# Verify the result.
expected_output = cp.asarray([0, -4, 1], dtype=d_output.dtype)
assert (d_output == expected_output).all()
print(f"Segmented reduce basic result: {d_output.get()}")

Parameters:

d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items
d_out (DeviceArrayLike | IteratorBase) – Device array to store the result of the reduction for each segment
start_offsets_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the sequence of beginning offsets
end_offsets_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the sequence of ending offsets
op (Callable | cuda.compute._bindings.OpKind) – Binary reduction operator
h_init (ndarray | GpuStruct) – Initial value for the reduction
num_segments (int) – Number of segments to reduce
stream – CUDA stream for the operation (optional)

cuda.compute.algorithms.make_segmented_reduce( d_in, d_out, start_offsets_in, end_offsets_in, op, h_init, )#

Computes a device-wide segmented reduction using the specified binary op and initial value init.

Example

Below, make_segmented_reduce is used to create a segmented reduction object that can be reused.

"""
Segmented reduction using the object API.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    OpKind,
)

# Prepare the input and output arrays.
dtype = np.int32
h_init = np.array([0], dtype=dtype)
h_input = np.array([1, 2, 3, 4, 5, 6], dtype=dtype)
d_input = cp.asarray(h_input)
d_output = cp.empty(2, dtype=dtype)

start_offsets = cp.array([0, 3], dtype=np.int64)
end_offsets = cp.array([3, 6], dtype=np.int64)

# Create the segmented reduce object.
reducer = cuda.compute.make_segmented_reduce(
    d_input, d_output, start_offsets, end_offsets, OpKind.PLUS, h_init
)

# Get the temporary storage size.
temp_storage_size = reducer(
    None, d_input, d_output, 2, start_offsets, end_offsets, h_init
)

# Allocate the temporary storage.
d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)

# Perform the segmented reduce.
reducer(d_temp_storage, d_input, d_output, 2, start_offsets, end_offsets, h_init)

# Verify the result.
expected_result = np.array([6, 15], dtype=dtype)
actual_result = d_output.get()
np.testing.assert_array_equal(actual_result, expected_result)
print("Segmented reduce object example completed successfully")

Parameters:

d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items
d_out (DeviceArrayLike | IteratorBase) – Device array that will store the result of the reduction
start_offsets_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing offsets to start of segments
end_offsets_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing offsets to end of segments
op (Callable | cuda.compute._bindings.OpKind) – Callable or OpKind representing the binary operator to apply
init – Numpy array storing initial value of the reduction
h_init (ndarray | GpuStruct)

Returns:

A callable object that can be used to perform the reduction

cuda.compute.algorithms.unique_by_key( d_in_keys, d_in_items, d_out_keys, d_out_items, d_out_num_selected, op, num_items, stream=None, )#

Performs device-wide unique by key operation using the single-phase API.

This function automatically handles temporary storage allocation and execution.

Example

Below, unique_by_key is used to populate the arrays of output keys and items with the first key and its corresponding item from each sequence of equal keys. It also outputs the number of items selected.

"""
Example showing how to use unique_by_key to remove all
but the first value for each group of consecutive keys.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    OpKind,
)

# Prepare the input and output arrays.
h_in_keys = np.array([0, 2, 2, 9, 5, 5, 5, 8], dtype="int32")
h_in_values = np.array([1, 2, 3, 4, 5, 6, 7, 8], dtype="float32")

d_in_keys = cp.asarray(h_in_keys)
d_in_values = cp.asarray(h_in_values)
d_out_keys = cp.empty_like(d_in_keys)
d_out_values = cp.empty_like(d_in_values)
d_out_num_selected = cp.empty(1, np.int32)

# Perform the unique by key operation.
cuda.compute.unique_by_key(
    d_in_keys,
    d_in_values,
    d_out_keys,
    d_out_values,
    d_out_num_selected,
    OpKind.EQUAL_TO,
    d_in_keys.size,
)

# Verify the result.
num_selected = cp.asnumpy(d_out_num_selected)[0]
h_out_keys = cp.asnumpy(d_out_keys)[:num_selected]
h_out_values = cp.asnumpy(d_out_values)[:num_selected]

expected_keys = np.array([0, 2, 9, 5, 8])
expected_values = np.array([1, 2, 4, 5, 8])

assert np.array_equal(h_out_keys, expected_keys)
assert np.array_equal(h_out_values, expected_values)
print(
    f"Unique by key basic result - keys: {h_out_keys}, values: {h_out_values}, count: {num_selected}"
)

Parameters:

d_in_keys (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of keys
d_in_items (DeviceArrayLike | IteratorBase) – Device array or iterator that contains each key’s corresponding item
d_out_keys (DeviceArrayLike | IteratorBase) – Device array or iterator to store the outputted keys
d_out_items (DeviceArrayLike | IteratorBase) – Device array or iterator to store each outputted key’s item
d_out_num_selected (DeviceArrayLike) – Device array to store how many items were selected
op (Callable | cuda.compute._bindings.OpKind) – Callable or OpKind representing the equality operator
num_items (int) – Number of items to process
stream – CUDA stream for the operation (optional)

cuda.compute.algorithms.make_unique_by_key( d_in_keys, d_in_items, d_out_keys, d_out_items, d_out_num_selected, op, )#

Implements a device-wide unique by key operation using d_in_keys and the comparison operator op. Only the first key and its value from each run is selected and the total number of items selected is also reported.

Example

Below, make_unique_by_key is used to create a unique by key object that can be reused.

"""
Example showing how to use unique_by_key with the object API.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    OpKind,
)

# Unique by key example demonstrating the object API
dtype = np.int32
h_input_keys = np.array([1, 1, 2, 3, 3], dtype=dtype)
h_input_values = np.array([10, 20, 30, 40, 50], dtype=dtype)
d_input_keys = cp.asarray(h_input_keys)
d_input_values = cp.asarray(h_input_values)
d_output_keys = cp.empty_like(d_input_keys)
d_output_values = cp.empty_like(d_input_values)
d_num_selected = cp.empty(1, dtype=np.int32)

# Create the unique by key object.
uniquer = cuda.compute.make_unique_by_key(
    d_input_keys,
    d_input_values,
    d_output_keys,
    d_output_values,
    d_num_selected,
    OpKind.EQUAL_TO,
)

# Get the temporary storage size.
temp_storage_size = uniquer(
    None,
    d_input_keys,
    d_input_values,
    d_output_keys,
    d_output_values,
    d_num_selected,
    len(h_input_keys),
)

# Allocate the temporary storage.
d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)

# Perform the unique by key operation.
uniquer(
    d_temp_storage,
    d_input_keys,
    d_input_values,
    d_output_keys,
    d_output_values,
    d_num_selected,
    len(h_input_keys),
)

# Verify the result.
num_selected = d_num_selected.get()[0]
expected_keys = np.array([1, 2, 3], dtype=dtype)
expected_values = np.array([10, 30, 40], dtype=dtype)
actual_keys = d_output_keys.get()[:num_selected]
actual_values = d_output_values.get()[:num_selected]
np.testing.assert_array_equal(actual_keys, expected_keys)
np.testing.assert_array_equal(actual_values, expected_values)
print("Unique by key object example completed successfully")

Parameters:

d_in_keys (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of keys
d_in_items (DeviceArrayLike | IteratorBase) – Device array or iterator that contains each key’s corresponding item
d_out_keys (DeviceArrayLike | IteratorBase) – Device array or iterator to store the outputted keys
d_out_items (DeviceArrayLike | IteratorBase) – Device array or iterator to store each outputted key’s item
d_out_num_selected (DeviceArrayLike) – Device array to store how many items were selected
op (Callable | cuda.compute._bindings.OpKind) – Callable or OpKind representing the equality operator

Returns:

A callable object that can be used to perform unique by key

cuda.compute.algorithms.segmented_sort( d_in_keys, d_out_keys, d_in_values, d_out_values, num_items, num_segments, start_offsets_in, end_offsets_in, order, stream=None, )#

Performs device-wide segmented sort.

This function automatically handles temporary storage allocation and execution.

Example

Below, segmented_sort is used to perform a segmented sort. It also rearranges the values according to the keys’ order.

In the following example, segmented_sort is used to perform a segmented sort with a ``DoubleBuffer` for reduced temporary storage.

Parameters:

d_in_keys (DeviceArrayLike | DoubleBuffer) – Device array or DoubleBuffer containing the input keys to be sorted
d_out_keys (DeviceArrayLike | None) – Device array to store the sorted keys (optional)
d_in_values (DeviceArrayLike | DoubleBuffer | None) – Device array or DoubleBuffer containing the input values to be sorted (optional)
d_out_values (DeviceArrayLike | None) – Device array to store the sorted values (optional)
num_items (int) – Total number of items to sort
num_segments (int) – Number of segments to sort
start_offsets_in (DeviceArrayLike) – Device array or iterator containing the sequence of beginning offsets
end_offsets_in (DeviceArrayLike) – Device array or iterator containing the sequence of ending offsets
order (SortOrder) – Sort order (ascending or descending)
stream – CUDA stream for the operation (optional)

cuda.compute.algorithms.make_segmented_sort( d_in_keys, d_out_keys, d_in_values, d_out_values, start_offsets_in, end_offsets_in, order, )#

Performs a device-wide segmented sort using the specified keys and values.

Example

Below, make_segmented_sort is used to create a segmented sort object that can be reused.

Parameters:

d_in_keys (DeviceArrayLike | DoubleBuffer) – Device array or DoubleBuffer containing the input keys to be sorted
d_out_keys (DeviceArrayLike | None) – Device array to store the sorted keys
d_in_values (DeviceArrayLike | DoubleBuffer | None) – Optional Device array or DoubleBuffer containing the input values to be sorted
d_out_values (DeviceArrayLike | None) – Device array to store the sorted values
start_offsets_in (DeviceArrayLike) – Device array or iterator containing the sequence of beginning offsets
end_offsets_in (DeviceArrayLike) – Device array or iterator containing the sequence of ending offsets
order (SortOrder) – SortOrder specifying the order of the sort

Returns:

A callable object that can be used to perform the segmented sort

cuda.compute.algorithms.three_way_partition( d_in, d_first_part_out, d_second_part_out, d_unselected_out, d_num_selected_out, select_first_part_op, select_second_part_op, num_items, stream=None, )#

Performs device-wide three-way partition. Given an input sequence of data items, it partitions the items into three parts: - The first part is selected by the select_first_part_op operator. - The second part is selected by the select_second_part_op operator. - The unselected items are not selected by either operator.

This function automatically handles temporary storage allocation and execution.

Example

Below, three_way_partition is used to partition a sequence of integers into three parts.

"""
Example showing how to use three_way_partition to partition a sequence of integers into three parts.
"""

import cupy as cp
import numpy as np

import cuda.compute

# Prepare the input and output arrays.
dtype = np.int32
h_input = np.array([0, 2, 9, 1, 5, 6, 7, -3, 17, 10], dtype=dtype)
d_input = cp.asarray(h_input)
d_first_part = cp.empty_like(d_input)
d_second_part = cp.empty_like(d_input)
d_unselected = cp.empty_like(d_input)
d_num_selected = cp.empty(2, dtype=np.int64)


def less_than_op(x):
    return x < 8 and x >= 0


def greater_than_equal_op(x):
    return x >= 8


# Perform the three_way_partition.
cuda.compute.three_way_partition(
    d_input,
    d_first_part,
    d_second_part,
    d_unselected,
    d_num_selected,
    less_than_op,
    greater_than_equal_op,
    len(h_input),
)

# Verify the result.
expected_first_part = np.array([0, 2, 1, 5, 6, 7], dtype=dtype)
expected_second_part = np.array([9, 17, 10], dtype=dtype)
expected_unselected = np.array([-3], dtype=dtype)
expected_num_selected = np.array([6, 3], dtype=np.int64)

actual_num_selected = d_num_selected.get()
num_selected_first_part = int(actual_num_selected[0])
num_selected_second_part = int(actual_num_selected[1])
actual_first_part = d_first_part.get()[:num_selected_first_part]
actual_second_part = d_second_part.get()[:num_selected_second_part]
actual_unselected = d_unselected.get()[
    : d_input.size - num_selected_first_part - num_selected_second_part
]

np.testing.assert_array_equal(actual_first_part, expected_first_part)
np.testing.assert_array_equal(actual_second_part, expected_second_part)
np.testing.assert_array_equal(actual_unselected, expected_unselected)
np.testing.assert_array_equal(actual_num_selected, expected_num_selected)

print("Three way partition basic example completed successfully")

Parameters:

d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items
d_first_part_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the first part of the output
d_second_part_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the second part of the output
d_unselected_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the unselected items
d_num_selected_out (DeviceArrayLike | IteratorBase) – Device array to store the number of items selected. The total number of items selected by select_first_part_op and select_second_part_op is stored in d_num_selected_out[0] and d_num_selected_out[1], respectively.
select_first_part_op (Callable) – Callable representing the unary operator to select the first part
select_second_part_op (Callable) – Callable representing the unary operator to select the second part
num_items (int) – Number of items to partition
stream – CUDA stream for the operation (optional)

cuda.compute.algorithms.make_three_way_partition( d_in, d_first_part_out, d_second_part_out, d_unselected_out, d_num_selected_out, select_first_part_op, select_second_part_op, )#

Computes a device-wide three-way partition using the specified unary select_first_part_op and select_second_part_op operators.

Example

Below, make_three_way_partition is used to create a three-way partition object that can be reused.

"""
Example showing how to use three_way_partition with the object API.
"""

import cupy as cp
import numpy as np

import cuda.compute

# Prepare the input and output arrays.
dtype = np.int32
h_input = np.array([0, 2, 9, 1, 5, 6, 7, -3, 17, 10], dtype=dtype)
d_input = cp.asarray(h_input)
d_first_part = cp.empty_like(d_input)
d_second_part = cp.empty_like(d_input)
d_unselected = cp.empty_like(d_input)
d_num_selected = cp.empty(2, dtype=np.int64)


def less_than_op(x):
    return x < 8 and x >= 0


def greater_than_equal_op(x):
    return x >= 8


# Create the three_way_partition object.
partitioner = cuda.compute.make_three_way_partition(
    d_input,
    d_first_part,
    d_second_part,
    d_unselected,
    d_num_selected,
    less_than_op,
    greater_than_equal_op,
)

# Get the temporary storage size.
temp_storage_size = partitioner(
    None,
    d_input,
    d_first_part,
    d_second_part,
    d_unselected,
    d_num_selected,
    len(h_input),
)
d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)

# Perform the three_way_partition.
partitioner(
    d_temp_storage,
    d_input,
    d_first_part,
    d_second_part,
    d_unselected,
    d_num_selected,
    len(h_input),
)

# Verify the result.
expected_first_part = np.array([0, 2, 1, 5, 6, 7], dtype=dtype)
expected_second_part = np.array([9, 17, 10], dtype=dtype)
expected_unselected = np.array([-3], dtype=dtype)
expected_num_selected = np.array([6, 3], dtype=np.int64)

actual_num_selected = d_num_selected.get()
num_selected_first_part = int(actual_num_selected[0])
num_selected_second_part = int(actual_num_selected[1])
actual_first_part = d_first_part.get()[:num_selected_first_part]
actual_second_part = d_second_part.get()[:num_selected_second_part]
actual_unselected = d_unselected.get()[
    : d_input.size - num_selected_first_part - num_selected_second_part
]

np.testing.assert_array_equal(actual_first_part, expected_first_part)
np.testing.assert_array_equal(actual_second_part, expected_second_part)
np.testing.assert_array_equal(actual_unselected, expected_unselected)
np.testing.assert_array_equal(actual_num_selected, expected_num_selected)

print("Three way partition object example completed successfully")

Parameters:

d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items
d_first_part_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the first part of the output
d_second_part_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the second part of the output
d_unselected_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the unselected items
d_num_selected_out (DeviceArrayLike | IteratorBase) – Device array to store the number of items selected. The total number of items selected by select_first_part_op and select_second_part_op is stored in d_num_selected_out[0] and d_num_selected_out[1], respectively.
select_first_part_op (Callable | _OpAdapter) – Callable representing the unary operator to select the first part. Can reference device arrays as globals/closures - they will be automatically captured.
select_second_part_op (Callable | _OpAdapter) – Callable representing the unary operator to select the second part. Can reference device arrays as globals/closures - they will be automatically captured.

Returns:

A callable object that can be used to perform the three-way partition

cuda.compute.algorithms.select(d_in, d_out, d_num_selected_out, cond, num_items, stream=None)#

Performs device-wide selection of elements based on a condition.

Given an input sequence, this function selects all elements for which the condition function cond returns true (non-zero) and writes them to the output in a compacted form. The number of selected elements is written to d_num_selected_out[0].

This function automatically handles temporary storage allocation and execution.

The cond function can reference device arrays as globals or closures - they will be automatically captured as state arrays, enabling stateful operations like counting.

Example

Below, select is used to select even numbers from an input array:

import cupy as cp

from cuda.compute.algorithms import select

# Create input data
d_in = cp.array([1, 2, 3, 4, 5, 6, 7, 8], dtype=cp.int32)
d_out = cp.empty_like(d_in)
d_num_selected = cp.zeros(2, dtype=cp.uint64)


# Define select condition (keep even numbers)
def is_even(x):
    return x % 2 == 0


# Execute select
select(d_in, d_out, d_num_selected, is_even, len(d_in))

# Get results
num_selected = int(d_num_selected[0])
result = d_out[:num_selected].get()
print(f"Selected {num_selected} items: {result}")
# Output: Selected 4 items: [2 4 6 8]
# example-end

assert num_selected == 4
assert (result == [2, 4, 6, 8]).all()

You can also use iterators for more complex selection patterns:

import cupy as cp

from cuda.compute.algorithms import select
from cuda.compute.iterators import TransformIterator

# Create input data
d_in = cp.array([1, 2, 3, 4, 5, 6, 7, 8], dtype=cp.int32)
d_out = cp.empty_like(d_in)
d_num_selected = cp.zeros(2, dtype=cp.uint64)


# Create iterator that squares each value
def square(x):
    return x * x


squared_iter = TransformIterator(d_in, square)


# Select squared values that are greater than 20
def greater_than_20(x):
    return x > 20


select(squared_iter, d_out, d_num_selected, greater_than_20, len(d_in))

# Get results
num_selected = int(d_num_selected[0])
result = d_out[:num_selected].get()
print(f"Selected {num_selected} items: {result}")
# Output: Selected 4 items: [25 36 49 64]
# (5^2=25, 6^2=36, 7^2=49, 8^2=64, all > 20)
# example-end

assert num_selected == 4
assert (result == [25, 36, 49, 64]).all()

Parameters:

d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items.
d_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the selected output items.
d_num_selected_out (DeviceArrayLike) – Device array to store the number of items that passed the selection. The count is stored in d_num_selected_out[0].
cond (Callable) – Callable representing the selection condition (predicate). Should return a boolean-like value (typically uint8) where non-zero means the item passes the selection. Can reference device arrays as globals/closures - they will be automatically captured.
num_items (int) – Number of items in the input sequence.
stream – CUDA stream to use for the operation (optional).

cuda.compute.algorithms.make_select(d_in, d_out, d_num_selected_out, cond)#

Create a select object that can be called to select elements matching a condition.

This is the object-oriented API that allows explicit control over temporary storage allocation. For simpler usage, consider using select().

Example

Below, make_select is used to create a select object that can be reused.

import cupy as cp

from cuda.compute.algorithms import make_select

# Create input data
d_in = cp.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype=cp.int32)
d_out = cp.empty_like(d_in)
d_num_selected = cp.zeros(2, dtype=cp.uint64)


# Define select condition (keep values > 5)
def greater_than_5(x):
    return x > 5


# Create select object (can be reused)
selector = make_select(d_in, d_out, d_num_selected, greater_than_5)

# Get required temp storage
temp_storage_bytes = selector(None, d_in, d_out, d_num_selected, len(d_in))
d_temp_storage = cp.empty(temp_storage_bytes, dtype=cp.uint8)

# Execute select
selector(d_temp_storage, d_in, d_out, d_num_selected, len(d_in))

# Get results
num_selected = int(d_num_selected[0])
result = d_out[:num_selected].get()
print(f"Selected {num_selected} items: {result}")
# Output: Selected 5 items: [ 6  7  8  9 10]

# Reuse the same select object with different input
d_in2 = cp.array([10, 20, 3, 15, 2, 8, 30], dtype=cp.int32)
d_out2 = cp.empty_like(d_in2)
d_num_selected2 = cp.zeros(2, dtype=cp.uint64)

selector(d_temp_storage, d_in2, d_out2, d_num_selected2, len(d_in2))

num_selected2 = int(d_num_selected2[0])
result2 = d_out2[:num_selected2].get()
print(f"Second select: {num_selected2} items: {result2}")
# Output: Second select: 5 items: [10 20 15  8 30]
# example-end

assert num_selected == 5
assert (result == [6, 7, 8, 9, 10]).all()

Parameters:

d_in (DeviceArrayLike | IteratorBase) – Device array or iterator containing the input sequence of data items.
d_out (DeviceArrayLike | IteratorBase) – Device array or iterator to store the selected output items.
d_num_selected_out (DeviceArrayLike) – Device array to store the number of items that passed the selection. The count is stored in d_num_selected_out[0].
cond (Callable) – Callable representing the selection condition (predicate). Should return a boolean-like value (typically uint8) where non-zero means the item passes the selection.

Returns:

A callable object that performs the selection operation.

class cuda.compute.algorithms.DoubleBuffer(d_current, d_alternate)#

Parameters:

d_current (DeviceArrayLike)
d_alternate (DeviceArrayLike)

__init__(d_current, d_alternate)#

Parameters:

d_current (DeviceArrayLike)
d_alternate (DeviceArrayLike)

alternate()#

current()#

class cuda.compute.algorithms.SortOrder(*values)#

ASCENDING = 0#

DESCENDING = 1#

Iterators#

cuda.compute.iterators.CacheModifiedInputIterator(device_array, modifier)#

Random Access Cache Modified Iterator that wraps a native device pointer.

Currently the only supported modifier is “stream” (LOAD_CS).

Example

The code snippet below demonstrates the usage of a CacheModifiedInputIterator:

"""
Example showing how to use cache_modified_iterator.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    CacheModifiedInputIterator,
    OpKind,
)

# Prepare the input array.
h_input = np.array([1, 2, 3, 4, 5], dtype=np.int32)
d_input = cp.asarray(h_input)

# Create the cache modified iterator.
cache_it = CacheModifiedInputIterator(d_input, "stream")

# Prepare the initial value for the reduction.
h_init = np.array([0], dtype=np.int32)

# Prepare the output array.
d_output = cp.empty(1, dtype=np.int32)

# Perform the reduction.
cuda.compute.reduce_into(cache_it, d_output, OpKind.PLUS, len(d_input), h_init)

# Verify the result.
expected_output = sum(h_input)  # 1 + 2 + 3 + 4 + 5 = 15
assert (d_output == expected_output).all()
print(f"Cache modified iterator result: {d_output[0]} (expected: {expected_output})")

Parameters:

device_array – Array storing the input sequence of data items
modifier – The PTX cache load modifier

Returns:

A CacheModifiedInputIterator object initialized with device_array

cuda.compute.iterators.ConstantIterator(value)#

Returns an Iterator representing a sequence of constant values.

Example

The code snippet below demonstrates the usage of a ConstantIterator representing a sequence of constant values:

"""
Example showing how to use constant_iterator.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    ConstantIterator,
    OpKind,
)

# Prepare the input and output arrays.
constant_value = 42
num_items = 5

# Create the constant iterator.
constant_it = ConstantIterator(np.int32(constant_value))

# Prepare the initial value for the reduction.
h_init = np.array([0], dtype=np.int32)

# Prepare the output array.
d_output = cp.empty(1, dtype=np.int32)

# Perform the reduction.
cuda.compute.reduce_into(constant_it, d_output, OpKind.PLUS, num_items, h_init)

# Verify the result.
expected_output = constant_value * num_items
assert (d_output == expected_output).all()
print(f"Constant iterator result: {d_output[0]} (expected: {expected_output})")

Parameters:: value – The value of every item in the sequence
Returns:: A ConstantIterator object initialized to value

cuda.compute.iterators.CountingIterator(offset)#

Returns an Iterator representing a sequence of incrementing values.

Example

The code snippet below demonstrates the usage of a CountingIterator representing the sequence [10, 11, 12]:

"""
Example showing how to use counting_iterator.
"""

import functools

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    CountingIterator,
    OpKind,
)

# Prepare the input and output arrays.
first_item = 10
num_items = 3

# Create the counting iterator.
first_it = CountingIterator(np.int32(first_item))

# Prepare the initial value for the reduction.
h_init = np.array([0], dtype=np.int32)

# Prepare the output array.
d_output = cp.empty(1, dtype=np.int32)

# Perform the reduction.
cuda.compute.reduce_into(first_it, d_output, OpKind.PLUS, num_items, h_init)

# Verify the result.
expected_output = functools.reduce(
    lambda a, b: a + b, range(first_item, first_item + num_items)
)
assert (d_output == expected_output).all()
print(f"Counting iterator result: {d_output[0]} (expected: {expected_output})")

Parameters:: offset – The initial value of the sequence
Returns:: A CountingIterator object initialized to offset

cuda.compute.iterators.DiscardIterator(reference_iterator=None)#

Returns an Input or Output Iterator that discards all values written to it.

Parameters:: reference_iterator – Optional iterator to use as a reference for value_type and state_type. If not provided, defaults to uint8.

Example

The code snippet below demonstrates the usage of a DiscardIterator to discard items in a unique by key operation:

"""
Example showing how to use DiscardIterator.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    DiscardIterator,
    OpKind,
)

# Prepare the input and output arrays.
h_in_keys = np.array([1, 1, 2, 3, 3, 7, 8, 8], dtype="int32")
d_in_keys = cp.asarray(h_in_keys)
d_out_keys = cp.empty_like(d_in_keys)
d_out_num_selected = cp.empty(1, np.int32)

# Prepare the discard iterator for values.
d_in_values = DiscardIterator()
d_out_values = DiscardIterator()

# Perform the unique by key operation.
cuda.compute.unique_by_key(
    d_in_keys,
    d_in_values,
    d_out_keys,
    d_out_values,
    d_out_num_selected,
    OpKind.EQUAL_TO,
    d_in_keys.size,
)

# Verify the result.
num_selected = cp.asnumpy(d_out_num_selected)[0]
h_out_keys = cp.asnumpy(d_out_keys)[:num_selected]

expected_keys = np.array([1, 2, 3, 7, 8])

assert np.array_equal(h_out_keys, expected_keys)
print(f"Discard iterator result - keys: {h_out_keys}, count: {num_selected}")

Returns:: A DiscardIterator object

cuda.compute.iterators.PermutationIterator(values, indices)#

Returns an Iterator that accesses values through an index mapping.

The permutation iterator accesses elements from the values collection using indices from the indices collection, effectively computing values[indices[i]] at position i. This is useful for gather/scatter operations and indirect array access patterns.

Example

The code snippet below demonstrates the usage of a PermutationIterator to access values in a permuted order:

"""
Demonstrate reduction with permutation iterator as input.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    OpKind,
    PermutationIterator,
)

# Create a permutation iterator which selects values at the given indices:
d_values = cp.asarray([10, 20, 30, 40, 50], dtype=np.int32)
d_indices = cp.asarray([2, 0, 4, 1], dtype=np.int32)  # permutation indices
perm_it = PermutationIterator(d_values, d_indices)

# Prepare the initial value and output for the reduction.
h_init = np.array([0], dtype=np.int32)
d_output = cp.empty(1, dtype=np.int32)

# Perform the reduction on the permuted values.
num_items = len(d_indices)
cuda.compute.reduce_into(perm_it, d_output, OpKind.PLUS, num_items, h_init)

# Verify the result:
expected_output = d_values[d_indices].sum()
assert d_output[0] == expected_output
print(f"Permutation iterator result: {d_output[0]} (expected: {expected_output})")

Parameters:

values – The values array or iterator to be permuted
indices – An iterator or device array providing the indices for permutation

Returns:

A PermutationIterator object that yields values[indices[i]] at position i

cuda.compute.iterators.ReverseIterator(sequence)#

Returns an Iterator over an array or another iterator in reverse.

Similar to [std::reverse_iterator](https://en.cppreference.com/w/cpp/iterator/reverse_iterator).

Examples

The code snippet below demonstrates the usage of a ReverseIterator as an input iterator:

"""
Example showing how to use reverse_input_iterator.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    OpKind,
    ReverseIterator,
)

# Prepare the input and output arrays.
h_input = np.array([1, 2, 3, 4, 5], dtype=np.int32)
d_input = cp.asarray(h_input)

# Create the reverse input iterator.
reverse_it = ReverseIterator(d_input)
d_output = cp.empty(len(d_input), dtype=np.int32)

# Prepare the initial value for the reduction.
h_init = np.array(0, dtype=np.int32)

# Perform the reduction.
cuda.compute.inclusive_scan(reverse_it, d_output, OpKind.PLUS, h_init, len(d_input))

# Verify the result.
expected_output = np.array([5, 9, 12, 14, 15], dtype=np.int32)
result = d_output.get()

np.testing.assert_array_equal(result, expected_output)
print(f"Original input: {h_input}")
print(f"Reverse scan result: {result}")
print(f"Expected result: {expected_output}")

The code snippet below demonstrates the usage of a ReverseIterator as an output iterator:

"""
Example showing how to use reverse_output_iterator.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    OpKind,
    ReverseIterator,
)

# Prepare the input and output arrays.
h_input = np.array([1, 2, 3, 4, 5], dtype=np.int32)
d_input = cp.asarray(h_input)

# Prepare the output array.
d_output = cp.empty(len(d_input), dtype=np.int32)
h_init = np.array(0, dtype=np.int32)

# Create the reverse output iterator.
reverse_out_it = ReverseIterator(d_output)

# Perform the reduction.
cuda.compute.inclusive_scan(d_input, reverse_out_it, OpKind.PLUS, h_init, len(d_input))

# Verify the result.
expected_output = np.array([15, 10, 6, 3, 1], dtype=np.int32)
result = d_output.get()

np.testing.assert_array_equal(result, expected_output)
print(f"Original input: {h_input}")
print(f"Reverse output result: {result}")
print(f"Expected result: {expected_output}")

Parameters:: sequence – The iterator or array to be reversed
Returns:: A ReverseIterator object

cuda.compute.iterators.TransformIterator(it, op)#

An iterator that applies a user-defined unary function to the elements of an underlying iterator as they are read.

Similar to [thrust::transform_iterator](https://nvidia.github.io/cccl/thrust/api/classthrust_1_1transform__iterator.html)

Example

The code snippet below demonstrates the usage of a TransformIterator composed with a CountingIterator to transform the input before performing a reduction.

"""
Demonstrate reduction with transform iterator.
"""

import functools

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    CountingIterator,
    OpKind,
    TransformIterator,
)


def transform_op(a):
    return -a if a % 2 == 0 else a


# Prepare the input and output arrays.
first_item = 10
num_items = 100

transform_it = TransformIterator(
    CountingIterator(np.int32(first_item)), transform_op
)  # Input sequence
h_init = np.array([0], dtype=np.int64)  # Initial value for the reduction
d_output = cp.empty(1, dtype=np.int64)  # Storage for output

# Perform the reduction.
cuda.compute.reduce_into(transform_it, d_output, OpKind.PLUS, num_items, h_init)

# Verify the result.
expected_output = functools.reduce(
    lambda a, b: a + b,
    [-a if a % 2 == 0 else a for a in range(first_item, first_item + num_items)],
)

# Test assertions
print(f"Transform iterator result: {d_output[0]} (expected: {expected_output})")
assert (d_output == expected_output).all()
assert d_output[0] == expected_output

Parameters:

it – The underlying iterator
op – The unary operation to be applied to values as they are read from it

Returns:

A TransformIterator object to transform the items in it using op

cuda.compute.iterators.TransformOutputIterator(it, op)#

An iterator that applies a user-defined unary function to values before writing them to an underlying iterator.

Similar to [thrust::transform_output_iterator](https://nvidia.github.io/cccl/thrust/api/classthrust_1_1transform__output__iterator.html).

Example

The code snippet below demonstrates the usage of a TransformOutputIterator to transform the output of a reduction before writing to an output array.

Parameters:

it – The underlying iterator
op – The operation to be applied to values before they are written to it

Returns:

A TransformOutputIterator object that applies op to transform values before writing them to it

cuda.compute.iterators.ZipIterator(*iterators)#

Returns an Iterator representing a zipped sequence of values from N iterators.

The resulting iterator yields gpu_struct objects with fields corresponding to each input iterator. For 2 iterators, fields are named ‘first’ and ‘second’. For N iterators, fields are indexed as field_0, field_1, …, field_N-1.

Example

The code snippet below demonstrates the usage of a ZipIterator combining two device arrays:

"""
Example showing how to use zip_iterator to perform elementwise sum of two arrays.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    ZipIterator,
)

# Prepare the input arrays.
d_input1 = cp.array([1, 2, 3, 4, 5], dtype=np.int32)
d_input2 = cp.array([10, 20, 30, 40, 50], dtype=np.int32)

# Create the zip iterator.
zip_it = ZipIterator(d_input1, d_input2)

# Prepare the output array.
num_items = len(d_input1)
d_output = cp.empty(num_items, dtype=np.int32)


def sum_paired_values(pair):
    """Extract values from the zip iterator pair and sum them."""
    return pair[0] + pair[1]


# Perform the unary transform.
cuda.compute.unary_transform(zip_it, d_output, sum_paired_values, num_items)

# Calculate the expected results.
expected = d_input1.get() + d_input2.get()
result = d_output.get()

# Verify the result.
np.testing.assert_allclose(result, expected)

print(f"Input array 1: {d_input1.get()}")
print(f"Input array 2: {d_input2.get()}")
print(f"Elementwise sum result: {result}")
print(f"Expected result: {expected}")

ZipIterator can also be used with nested gpu_struct types:

"""
Example showing ZipIterator with nested gpu_struct types.

This example demonstrates combining separate arrays of nested structs using
ZipIterator, then performing a reduction that operates on the combined data.
This is useful when you have related data stored in separate arrays that need
to be processed together.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import ZipIterator, gpu_struct


# Define nested structs for geometric and color data
@gpu_struct
class Point:
    x: np.int32
    y: np.int32


@gpu_struct
class Color:
    r: np.uint8
    g: np.uint8
    b: np.uint8


@gpu_struct
class Pixel:
    position: Point
    color: Color


def sum_pixels(p1, p2):
    """Reduction operation that sums all fields of two pixels."""
    return Pixel(
        Point(p1.position.x + p2.position.x, p1.position.y + p2.position.y),
        Color(
            p1.color.r + p2.color.r, p1.color.g + p2.color.g, p1.color.b + p2.color.b
        ),
    )


# Prepare separate arrays for points and colors
num_items = 100

h_points = np.array([(i, i * 2) for i in range(num_items)], dtype=Point.dtype)
h_colors = np.array(
    [(i % 256, (i * 2) % 256, (i * 3) % 256) for i in range(num_items)],
    dtype=Color.dtype,
)

d_points = cp.empty(num_items, dtype=Point.dtype)
d_points.set(h_points)

d_colors = cp.empty(num_items, dtype=Color.dtype)
d_colors.set(h_colors)

# Create a zip iterator to combine the points and colors
zip_it = ZipIterator(d_points, d_colors)

# Prepare output and initial value
d_output = cp.empty(1, dtype=Pixel.dtype)
h_init = Pixel(Point(0, 0), Color(0, 0, 0))

# Perform the reduction on the zipped data
cuda.compute.reduce_into(zip_it, d_output, sum_pixels, num_items, h_init)

# Verify the result
result = d_output.get()[0]
expected_x = sum(range(num_items))
expected_y = sum(i * 2 for i in range(num_items))
expected_r = sum(i % 256 for i in range(num_items)) % 256
expected_g = sum((i * 2) % 256 for i in range(num_items)) % 256
expected_b = sum((i * 3) % 256 for i in range(num_items)) % 256

assert result["position"]["x"] == expected_x
assert result["position"]["y"] == expected_y
assert result["color"]["r"] == expected_r
assert result["color"]["g"] == expected_g
assert result["color"]["b"] == expected_b

print("Nested struct with ZipIterator result:")
print(f"  position.x: {result['position']['x']} (expected: {expected_x})")
print(f"  position.y: {result['position']['y']} (expected: {expected_y})")
print(f"  color.r: {result['color']['r']} (expected: {expected_r})")
print(f"  color.g: {result['color']['g']} (expected: {expected_g})")
print(f"  color.b: {result['color']['b']} (expected: {expected_b})")

Parameters:: *iterators – Variable number of iterators to zip (at least 1)
Returns:: A ZipIterator object that yields combined values from all input iterators

Operators#

class cuda.compute.op.OpKind#

Enumeration of operator kinds for CUDA parallel algorithms.

This enum defines the types of operations that can be performed in parallel algorithms, including arithmetic, logical, and bitwise operations.

STATELESS#

STATEFUL#

PLUS#

MINUS#

MULTIPLIES#

DIVIDES#

MODULUS#

EQUAL_TO#

NOT_EQUAL_TO#

GREATER#

LESS#

GREATER_EQUAL#

LESS_EQUAL#

LOGICAL_AND#

LOGICAL_OR#

LOGICAL_NOT#

BIT_AND#

BIT_OR#

BIT_XOR#

BIT_NOT#

IDENTITY#

NEGATE#

MINIMUM#

MAXIMUM#

Utilities#

This module provides gpu_struct, a factory for producing “struct” types that have pass-by-value semantics when used with Numba CUDA device functions.

Numba supports record types, but they have pass-by-reference semantics:

from numba.np.numpy_support import from_struct_dtype
from numba import cuda
import numpy as np

tp = from_struct_dtype(np.dtype([('x', 'i4'), ('y', 'i8')]))

def foo(a: tp, b: tp):
    return a

ptx, _ = cuda.compile(foo, (tp, tp))
print(ptx)

Inspecting the PTX, we see that the structs are passed by reference to the device function foo.

.visible .func  (.param .b64 func_retval0) foo(
        .param .b64 foo_param_0,
        .param .b64 foo_param_1
)

With gpu_struct, we can create a struct type with pass-by-value semantics:

from numba.core.extending import as_numba_type
from cuda.compute import gpu_struct

S = gpu_struct({'x': np.int32, 'y': np.int64})
tp = as_numba_type(S)

def foo(a: tp, b: tp):
    return a

ptx, _ = cuda.compile(foo, (tp, tp))
print(ptx)

Inspecting the PTX, we see that the structs are passed by value to the device function:

.visible .func  (.param .align 8 .b8 func_retval0[16]) foo(
    .param .b32 foo_param_0,
    .param .b64 foo_param_1,
    .param .b32 foo_param_2,
    .param .b64 foo_param_3

)

cuda.compute.struct.gpu_struct(field_dict, name='AnonymousStruct')#

A factory for creating struct types with pass-by-value semantics in Numba.

Parameters:

field_dict (dict | dtype | type) – A dictionary, numpy dtype, or annotated class providing the mapping of field names to data types.
name (str) – The name of the struct type that will be returned.

Returns:

A Python class that has been registered with Numba as a struct type. as_numba_type() can be used to get the underlying Numba type. Instances of this class can be passed as arguments to device functions.

Examples:

Construction from a dictionary.

S = gpu_struct({'x': np.int32, 'y': np.int64})

Construction from a numpy dtype.

S = gpu_struct(np.dtype([('x', 'i4'), ('y', 'i8')]))

Construction from an annotated class.

@gpu_struct
class MyStruct:
    x: np.int32
    y: np.int64

Nesting gpu_structs.

@gpu_struct
class MyStruct:
    x: np.int32
    y: np.int64

@gpu_struct
class MyNestedStruct:
    a: MyStruct
    b: MyStruct

Compiling a device function with gpu_struct arguments.

def foo(a: MyStruct, b: MyStruct):
    return a

nb_type = as_numba_type(MyStruct)
ptx, _ = cuda.compile(foo, (nb_type, nb_type))
print(ptx)

cuda.compute.struct.make_struct_type(name, field_names, field_types)#

Core factory function that uses the Numba extension machinery to create a struct type with pass-by-value semantics.

Creates a Python type (and corresponding Numba type) representing a struct with the given field names and types.
Implements typing and lowering for operations like construction, indexing, and attribute access within device functions.
Returns the Python type. The corresponding Numba type can be obtained using as_numba_type().

Note, the return Python type does not have any useful methods or attributes, other than the private class attribute _field_spec containing the mapping of field names to field types. Any Python-side functionality can be implemented by patching in methods.

Parameters:

name – The name of the struct type that will be returned.
field_names – A tuple of field names.
field_types – A tuple of field types.

Returns:

Python type that can be used to construct struct instances. The corresponding Numba type can be obtained using as_numba_type().

cuda.compute API Reference#

Algorithms#

Iterators#

Operators#

Utilities#

`cuda.compute` API Reference#