cuda.compute API Reference#

Warning

cuda.compute is in public beta. The API is subject to change without notice.

Algorithms#

cuda.compute.algorithms.reduce_into(
d_in: DeviceArrayLike | IteratorT,
d_out: DeviceArrayLike | IteratorT,
op: Operator,
num_items: int,
h_init: ndarray | GpuStruct,
stream=None,
**kwargs,
)#

Performs device-wide reduction.

This function automatically handles temporary storage allocation and execution.

Example

Below, reduce_into is used to compute the sum of a sequence of integers.

"""
Sum all values in an array using reduction with PLUS operation.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import OpKind

# Prepare the input and output arrays.
dtype = np.int32
h_init = np.array([0], dtype=dtype)
d_input = cp.array([1, 2, 3, 4, 5], dtype=dtype)
d_output = cp.empty(1, dtype=dtype)

# Perform the reduction.
cuda.compute.reduce_into(d_input, d_output, OpKind.PLUS, len(d_input), h_init)

# Verify the result.
expected_output = 15
assert (d_output == expected_output).all()
result = d_output[0]
print(f"Sum reduction result: {result}")
Parameters:
  • d_in – Device array or iterator containing the input sequence of data items

  • d_out – Device array to store the result of the reduction

  • op – Binary operator to apply. The signature is (T, T) -> T, where T is the data type of the initial value h_init.

  • num_items – Number of items to reduce

  • h_init – Initial value for the reduction

  • stream – CUDA stream for the operation (optional)

cuda.compute.algorithms.make_reduce_into(
d_in: DeviceArrayLike | IteratorT,
d_out: DeviceArrayLike | IteratorT,
op: Operator,
h_init: ndarray | GpuStruct,
**kwargs,
)#

Computes a device-wide reduction using the specified binary op and initial value init.

Example

Below, make_reduce_into is used to create a reduction object that can be reused.

"""
Reduction example using the object API.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    OpKind,
)

# Prepare the input and output arrays.
dtype = np.int32
init_value = 5
h_init = np.array([init_value], dtype=dtype)
h_input = np.array([1, 2, 3, 4], dtype=dtype)
d_input = cp.asarray(h_input)
d_output = cp.empty(1, dtype=dtype)

# Create a reducer object.
reducer = cuda.compute.make_reduce_into(d_input, d_output, OpKind.PLUS, h_init)

# Get the temporary storage size.
temp_storage_size = reducer(None, d_input, d_output, OpKind.PLUS, len(h_input), h_init)

# Allocate temporary storage using any user-defined allocator.
# The result must be an object exposing `__cuda_array_interface__`.
d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)

# Perform the reduction.
reducer(d_temp_storage, d_input, d_output, OpKind.PLUS, len(h_input), h_init)

expected_result = np.sum(h_input) + init_value
actual_result = d_output.get()[0]
assert actual_result == expected_result
print("Reduce object example completed successfully")
Parameters:
  • d_in – Device array or iterator containing the input sequence of data items

  • d_out – Device array (of size 1) that will store the result of the reduction

  • op – Binary operator to apply. The signature is (T, T) -> T, where T is the data type of the initial value h_init.

  • init – Numpy array storing initial value of the reduction

Returns:

A callable object that can be used to perform the reduction

cuda.compute.algorithms.lower_bound(
d_data: DeviceArrayLike,
d_values: DeviceArrayLike | IteratorT,
d_out: DeviceArrayLike,
num_items: int,
num_values: int,
comp: TypeAliasForwardRef('Operator') | None = None,
stream=None,
)#

Find the first position that each value in d_values would be inserted into d_data to maintain sorted order.

Example

import cupy as cp
import numpy as np

import cuda.compute

h_data = np.array([1, 3, 3, 5, 7, 9], dtype=np.int32)
h_values = np.array([0, 3, 4, 10], dtype=np.int32)

d_data = cp.asarray(h_data)
d_values = cp.asarray(h_values)
d_out = cp.empty(len(h_values), dtype=np.uintp)

cuda.compute.lower_bound(d_data, d_values, d_out, len(d_data), len(d_values))

expected = np.searchsorted(h_data, h_values, side="left").astype(np.uintp)
got = cp.asnumpy(d_out)

assert np.array_equal(got, expected)
Parameters:
  • d_data – Device array containing the sorted input range.

  • d_values – Device array or iterator containing the search values.

  • d_out – Device array to store the index results.

  • num_items – Number of items in d_data.

  • num_values – Number of items in d_values.

  • comp – Optional comparison operator (default: OpKind.LESS).

  • stream – CUDA stream for the operation (optional).

cuda.compute.algorithms.make_lower_bound(
d_data: DeviceArrayLike,
d_values: DeviceArrayLike | IteratorT,
d_out: DeviceArrayLike,
comp: TypeAliasForwardRef('Operator') | None = None,
)#

Create a lower_bound object that can be called to find insertion positions.

Example

import cupy as cp
import numpy as np

import cuda.compute

h_data = np.array([1, 3, 3, 5, 7, 9], dtype=np.int32)
h_values = np.array([0, 3, 4, 10], dtype=np.int32)

d_data = cp.asarray(h_data)
d_values = cp.asarray(h_values)
d_out = cp.empty(len(h_values), dtype=np.uintp)

searcher = cuda.compute.make_lower_bound(d_data, d_values, d_out)
searcher(d_data, d_values, d_out, None, len(d_data), len(d_values))

expected = np.searchsorted(h_data, h_values, side="left").astype(np.uintp)
got = cp.asnumpy(d_out)

assert np.array_equal(got, expected)
Parameters:
  • d_data – Device array containing the sorted input range.

  • d_values – Device array or iterator containing the search values.

  • d_out – Device array to store the index results.

  • comp – Optional comparison operator (default: OpKind.LESS).

Returns:

A callable object that performs lower_bound.

See also

lower_bound()

cuda.compute.algorithms.upper_bound(
d_data: DeviceArrayLike,
d_values: DeviceArrayLike | IteratorT,
d_out: DeviceArrayLike,
num_items: int,
num_values: int,
comp: TypeAliasForwardRef('Operator') | None = None,
stream=None,
)#

Find the last position that each value in d_values would be inserted into d_data to maintain sorted order.

Example

import cupy as cp
import numpy as np

import cuda.compute

h_data = np.array([1, 3, 3, 5, 7, 9], dtype=np.int32)
h_values = np.array([0, 3, 4, 10], dtype=np.int32)

d_data = cp.asarray(h_data)
d_values = cp.asarray(h_values)
d_out = cp.empty(len(h_values), dtype=np.uintp)

cuda.compute.upper_bound(d_data, d_values, d_out, len(d_data), len(d_values))

expected = np.searchsorted(h_data, h_values, side="right").astype(np.uintp)
got = cp.asnumpy(d_out)

assert np.array_equal(got, expected)
Parameters:
  • d_data – Device array containing the sorted input range.

  • d_values – Device array or iterator containing the search values.

  • d_out – Device array to store the index results.

  • num_items – Number of items in d_data.

  • num_values – Number of items in d_values.

  • comp – Optional comparison operator (default: OpKind.LESS).

  • stream – CUDA stream for the operation (optional).

cuda.compute.algorithms.make_upper_bound(
d_data: DeviceArrayLike,
d_values: DeviceArrayLike | IteratorT,
d_out: DeviceArrayLike,
comp: TypeAliasForwardRef('Operator') | None = None,
)#

Create an upper_bound object that can be called to find insertion positions.

Example

import cupy as cp
import numpy as np

import cuda.compute

h_data = np.array([1, 3, 3, 5, 7, 9], dtype=np.int32)
h_values = np.array([0, 3, 4, 10], dtype=np.int32)

d_data = cp.asarray(h_data)
d_values = cp.asarray(h_values)
d_out = cp.empty(len(h_values), dtype=np.uintp)

searcher = cuda.compute.make_upper_bound(d_data, d_values, d_out)
searcher(d_data, d_values, d_out, None, len(d_data), len(d_values))

expected = np.searchsorted(h_data, h_values, side="right").astype(np.uintp)
got = cp.asnumpy(d_out)

assert np.array_equal(got, expected)
Parameters:
  • d_data – Device array containing the sorted input range.

  • d_values – Device array or iterator containing the search values.

  • d_out – Device array to store the index results.

  • comp – Optional comparison operator (default: OpKind.LESS).

Returns:

A callable object that performs upper_bound.

See also

upper_bound()

cuda.compute.algorithms.inclusive_scan(
d_in: DeviceArrayLike | IteratorT,
d_out: DeviceArrayLike | IteratorT,
op: Operator,
init_value: ndarray | DeviceArrayLike | GpuStruct | None,
num_items: int,
stream=None,
)#

Performs device-wide inclusive scan.

This function automatically handles temporary storage allocation and execution.

Example

Below, inclusive_scan is used to compute an inclusive scan (prefix sum).

"""
Inclusive scan with custom operation (prefix sum of even values).
"""

import cupy as cp
import numpy as np

import cuda.compute

# Prepare the input and output arrays.
h_init = np.array([0], dtype="int32")
d_input = cp.array([1, 2, 3, 4, 5], dtype="int32")
d_output = cp.empty_like(d_input, dtype="int32")

# Define the binary operation for the scan.


def add_op(a, b):
    return (a if a % 2 == 0 else 0) + (b if b % 2 == 0 else 0)


# Perform the inclusive scan.
cuda.compute.inclusive_scan(d_input, d_output, add_op, h_init, d_input.size)

# Verify the result.
expected = np.asarray([0, 2, 2, 6, 6])
assert np.array_equal(d_output.get(), expected)
result = d_output.get()
print(f"Inclusive scan custom result: {result}")
Parameters:
  • d_in – Device array or iterator containing the input sequence of data items

  • d_out – Device array or iterator to store the result of the scan

  • op – Binary scan operator. The signature is (T, T) -> T, where T is the data type of the initial value init_value.

  • init_value – Initial value for the scan

  • num_items – Number of items to scan

  • stream – CUDA stream for the operation (optional)

cuda.compute.algorithms.make_inclusive_scan(
d_in: DeviceArrayLike | IteratorT,
d_out: DeviceArrayLike | IteratorT,
op: Operator,
init_value: ndarray | DeviceArrayLike | GpuStruct | None,
)#

Computes a device-wide scan using the specified binary op and initial value init.

Example

Below, make_inclusive_scan is used to create an inclusive scan object that can be reused.

"""
Inclusive scan example demonstrating the object API.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    OpKind,
)

# Prepare the input and output arrays.
dtype = np.int32
h_init = np.array([0], dtype=dtype)
h_input = np.array([1, 2, 3, 4], dtype=dtype)
d_input = cp.asarray(h_input)
d_output = cp.empty(len(h_input), dtype=dtype)

# Create the scanner object and allocate temporary storage.
scanner = cuda.compute.make_inclusive_scan(d_input, d_output, OpKind.PLUS, h_init)
temp_storage_size = scanner(None, d_input, d_output, OpKind.PLUS, len(h_input), h_init)
d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)

# Perform the inclusive scan.
scanner(d_temp_storage, d_input, d_output, OpKind.PLUS, len(h_input), h_init)

# Verify the result.
expected_result = np.array([1, 3, 6, 10], dtype=dtype)
actual_result = d_output.get()
np.testing.assert_array_equal(actual_result, expected_result)
print("Inclusive scan object example completed successfully")
Parameters:
  • d_in – Device array or iterator containing the input sequence of data items

  • d_out – Device array that will store the result of the scan

  • op – Binary scan operator. The signature is (T, T) -> T, where T is the data type of the initial value init_value.

  • init_value – Numpy array, device array, or GPU struct storing initial value of the scan, or None for no initial value

Returns:

A callable object that can be used to perform the scan

cuda.compute.algorithms.exclusive_scan(
d_in: DeviceArrayLike | IteratorT,
d_out: DeviceArrayLike | IteratorT,
op: Operator,
init_value: ndarray | DeviceArrayLike | GpuStruct | None,
num_items: int,
stream=None,
)#

Performs device-wide exclusive scan.

This function automatically handles temporary storage allocation and execution.

Example

Below, exclusive_scan is used to compute an exclusive scan with max operation.

"""
Exclusive scan using custom maximum operation.
"""

import cupy as cp
import numpy as np

import cuda.compute

# Define the binary operation for the scan.


def max_op(a, b):
    return max(a, b)


# Prepare the input and output arrays.
h_init = np.array([1], dtype="int32")
d_input = cp.array([-5, 0, 2, -3, 2, 4, 0, -1, 2, 8], dtype="int32")
d_output = cp.empty_like(d_input, dtype="int32")

# Perform the exclusive scan.
cuda.compute.exclusive_scan(d_input, d_output, max_op, h_init, d_input.size)

# Verify the result.
expected = np.asarray([1, 1, 1, 2, 2, 2, 4, 4, 4, 4])
result = d_output.get()

np.testing.assert_equal(result, expected)
print(f"Exclusive scan max result: {result}")
Parameters:
  • d_in – Device array or iterator containing the input sequence of data items

  • d_out – Device array or iterator to store the result of the scan

  • op – Binary scan operator. The signature is (T, T) -> T, where T is the data type of the initial value init_value.

  • init_value – Initial value for the scan

  • num_items – Number of items to scan

  • stream – CUDA stream for the operation (optional)

cuda.compute.algorithms.make_exclusive_scan(
d_in: DeviceArrayLike | IteratorT,
d_out: DeviceArrayLike | IteratorT,
op: Operator,
init_value: ndarray | DeviceArrayLike | GpuStruct | None,
)#

Computes a device-wide scan using the specified binary op and initial value init.

Example

Below, make_exclusive_scan is used to create an exclusive scan object that can be reused.

"""
Exclusive scan example demonstrating the object API.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    OpKind,
)

# Prepare the input and output arrays.
dtype = np.int32
h_init = np.array([0], dtype=dtype)
h_input = np.array([1, 2, 3, 4], dtype=dtype)
d_input = cp.asarray(h_input)
d_output = cp.empty(len(h_input), dtype=dtype)

# Create the scanner object and allocate temporary storage.
scanner = cuda.compute.make_exclusive_scan(d_input, d_output, OpKind.PLUS, h_init)
temp_storage_size = scanner(None, d_input, d_output, OpKind.PLUS, len(h_input), h_init)
d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)

# Perform the exclusive scan.
scanner(d_temp_storage, d_input, d_output, OpKind.PLUS, len(h_input), h_init)

# Verify the result.
expected_result = np.array([0, 1, 3, 6], dtype=dtype)
actual_result = d_output.get()
np.testing.assert_array_equal(actual_result, expected_result)
print("Exclusive scan object example completed successfully")
Parameters:
  • d_in – Device array or iterator containing the input sequence of data items

  • d_out – Device array that will store the result of the scan

  • op – Binary scan operator. The signature is (T, T) -> T, where T is the data type of the initial value init_value.

  • init_value – Numpy array, device array, or GPU struct storing initial value of the scan, or None for no initial value

Returns:

A callable object that can be used to perform the scan

cuda.compute.algorithms.unary_transform(
d_in: DeviceArrayLike | IteratorT,
d_out: DeviceArrayLike | IteratorT,
op: Operator,
num_items: int,
stream=None,
)#

Performs device-wide unary transform.

This function automatically handles temporary storage allocation and execution.

The op function can reference device arrays as globals or closures - they will be automatically captured as state arrays, enabling stateful operations like counting.

Example

Below, unary_transform is used to apply a transformation to each element of the input.

"""
Example showing how to use unary_transform to apply a unary operation to each element.
"""

import cupy as cp
import numpy as np

import cuda.compute

# Prepare the input and output arrays.
input_data = np.array([1, 2, 3, 4, 5], dtype=np.int32)
d_in = cp.asarray(input_data)
d_out = cp.empty_like(d_in)


# Define the unary operation.
def op(a):
    return a + 1


# Perform the unary transform.
cuda.compute.unary_transform(d_in, d_out, op, len(d_in))

# Verify the result.
result = d_out.get()
expected = input_data + 1

np.testing.assert_array_equal(result, expected)
print(f"Unary transform result: {result}")

When working with custom struct types, you need to provide type annotations to help with type inference. See the binary transform struct example for reference:

"""
Example demonstrating binary_transform with custom struct types.

When working with struct inputs in transform operations, you need to provide
type annotations to help Numba infer the correct types. Unlike reduce_into
which can infer types from h_init, transform operations require explicit
annotations when using struct inputs.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import gpu_struct


@gpu_struct
class Point2D:
    x: np.float32
    y: np.float32


def add_points(p1: Point2D, p2: Point2D) -> Point2D:
    return Point2D(p1.x + p2.x, p1.y + p2.y)


num_items = 1000

h_in1 = np.empty(num_items, dtype=Point2D.dtype)
h_in1["x"] = np.random.rand(num_items).astype(np.float32)
h_in1["y"] = np.random.rand(num_items).astype(np.float32)

h_in2 = np.empty(num_items, dtype=Point2D.dtype)
h_in2["x"] = np.random.rand(num_items).astype(np.float32)
h_in2["y"] = np.random.rand(num_items).astype(np.float32)

d_in1 = cp.empty_like(h_in1)
d_in1.set(h_in1)

d_in2 = cp.empty_like(h_in2)
d_in2.set(h_in2)

d_out = cp.empty_like(d_in1)

cuda.compute.binary_transform(d_in1, d_in2, d_out, add_points, num_items)

result = d_out.get()

np.testing.assert_allclose(result["x"], h_in1["x"] + h_in2["x"], rtol=1e-5)
np.testing.assert_allclose(result["y"], h_in1["y"] + h_in2["y"], rtol=1e-5)

print("Binary transform with structs completed successfully")
print(f"First result point: x={result[0]['x']:.4f}, y={result[0]['y']:.4f}")
Parameters:
  • d_in – Device array or iterator containing the input sequence of data items.

  • d_out – Device array or iterator to store the result of the transformation.

  • op – Unary operation to apply to each element. The signature is (T) -> U, where T is the input data type and U is the output data type. Can reference device arrays as globals/closures - they will be automatically captured.

  • num_items – Number of items to transform.

  • stream – CUDA stream to use for the operation.

cuda.compute.algorithms.make_unary_transform(
d_in: DeviceArrayLike | IteratorT,
d_out: DeviceArrayLike | IteratorT,
op: Operator,
)#

Create a unary transform object that can be called to apply a transformation to each element of the input according to the unary operation op.

This is the object-oriented API that allows explicit control over temporary storage allocation. For simpler usage, consider using unary_transform().

Example

"""
Unary transform examples demonstrating the object API and well-known operations.
"""

import cupy as cp
import numpy as np

import cuda.compute

# Prepare the input and output arrays.
dtype = np.int32
h_input = np.array([1, 2, 3, 4], dtype=dtype)
d_input = cp.asarray(h_input)
d_output = cp.empty_like(d_input)


# Define the unary operation.
def add_one_op(a):
    return a + 1


# Create the unary transform object.
transformer = cuda.compute.make_unary_transform(d_input, d_output, add_one_op)

# Perform the unary transform.
transformer(d_input, d_output, add_one_op, len(h_input))

# Verify the result.
expected_result = np.array([2, 3, 4, 5], dtype=dtype)
actual_result = d_output.get()
np.testing.assert_array_equal(actual_result, expected_result)
print("Unary transform object example completed successfully")
Parameters:
  • d_in – Device array or iterator containing the input sequence of data items.

  • d_out – Device array or iterator to store the result of the transformation.

  • op – Unary operation to apply to each element. The signature is (T) -> U, where T is the input data type and U is the output data type.

Returns:

A callable object that performs the transformation.

cuda.compute.algorithms.binary_transform(
d_in1: DeviceArrayLike | IteratorT,
d_in2: DeviceArrayLike | IteratorT,
d_out: DeviceArrayLike | IteratorT,
op: Operator,
num_items: int,
stream=None,
)#

Performs device-wide binary transform.

This function automatically handles temporary storage allocation and execution.

Example

Below, binary_transform is used to apply a transformation to pairs of elements from two input sequences.

"""
Example showing how to use binary_transform to perform elementwise addition.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    OpKind,
)

# Prepare the input and output arrays.
input1_data = np.array([1, 2, 3, 4], dtype=np.int32)
input2_data = np.array([10, 20, 30, 40], dtype=np.int32)
d_in1 = cp.asarray(input1_data)
d_in2 = cp.asarray(input2_data)
d_out = cp.empty_like(d_in1)

# Perform the binary transform.
cuda.compute.binary_transform(d_in1, d_in2, d_out, OpKind.PLUS, len(d_in1))

# Verify the result.
result = d_out.get()
expected = input1_data + input2_data

np.testing.assert_array_equal(result, expected)
print(f"Binary transform result: {result}")

When working with custom struct types, you need to provide type annotations to help with type inference. See the following example:

"""
Example demonstrating binary_transform with custom struct types.

When working with struct inputs in transform operations, you need to provide
type annotations to help Numba infer the correct types. Unlike reduce_into
which can infer types from h_init, transform operations require explicit
annotations when using struct inputs.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import gpu_struct


@gpu_struct
class Point2D:
    x: np.float32
    y: np.float32


def add_points(p1: Point2D, p2: Point2D) -> Point2D:
    return Point2D(p1.x + p2.x, p1.y + p2.y)


num_items = 1000

h_in1 = np.empty(num_items, dtype=Point2D.dtype)
h_in1["x"] = np.random.rand(num_items).astype(np.float32)
h_in1["y"] = np.random.rand(num_items).astype(np.float32)

h_in2 = np.empty(num_items, dtype=Point2D.dtype)
h_in2["x"] = np.random.rand(num_items).astype(np.float32)
h_in2["y"] = np.random.rand(num_items).astype(np.float32)

d_in1 = cp.empty_like(h_in1)
d_in1.set(h_in1)

d_in2 = cp.empty_like(h_in2)
d_in2.set(h_in2)

d_out = cp.empty_like(d_in1)

cuda.compute.binary_transform(d_in1, d_in2, d_out, add_points, num_items)

result = d_out.get()

np.testing.assert_allclose(result["x"], h_in1["x"] + h_in2["x"], rtol=1e-5)
np.testing.assert_allclose(result["y"], h_in1["y"] + h_in2["y"], rtol=1e-5)

print("Binary transform with structs completed successfully")
print(f"First result point: x={result[0]['x']:.4f}, y={result[0]['y']:.4f}")
Parameters:
  • d_in1 – Device array or iterator containing the first input sequence of data items.

  • d_in2 – Device array or iterator containing the second input sequence of data items.

  • d_out – Device array or iterator to store the result of the transformation.

  • op – Binary operation. The signature is (T1, T2) -> U, where T1 and T2 are the input data types and U is the output data type. Can reference device arrays as globals/closures - they will be automatically captured.

  • num_items – Number of items to transform.

  • stream – CUDA stream to use for the operation.

cuda.compute.algorithms.make_binary_transform(
d_in1: DeviceArrayLike | IteratorT,
d_in2: DeviceArrayLike | IteratorT,
d_out: DeviceArrayLike | IteratorT,
op: Operator,
)#

Create a binary transform object that can be called to apply a transformation to the given pair of input sequences according to the binary operation op.

This is the object-oriented API that allows explicit control over temporary storage allocation. For simpler usage, consider using binary_transform().

Example

"""
Binary transform examples demonstrating the transform object API.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    OpKind,
)

# Prepare the input and output arrays.
dtype = np.int32
h_input1 = np.array([1, 2, 3, 4], dtype=dtype)
h_input2 = np.array([10, 20, 30, 40], dtype=dtype)
d_input1 = cp.asarray(h_input1)
d_input2 = cp.asarray(h_input2)
d_output = cp.empty_like(d_input1)

# Create the binary transform object.
transformer = cuda.compute.make_binary_transform(
    d_input1, d_input2, d_output, OpKind.PLUS
)

# Perform the binary transform.
transformer(d_input1, d_input2, d_output, OpKind.PLUS, len(h_input1))

# Verify the result.
expected_result = np.array([11, 22, 33, 44], dtype=dtype)
actual_result = d_output.get()
np.testing.assert_array_equal(actual_result, expected_result)
print("Binary transform object example completed successfully")
Parameters:
  • d_in1 – Device array or iterator containing the first input sequence of data items.

  • d_in2 – Device array or iterator containing the second input sequence of data items.

  • d_out – Device array or iterator to store the result of the transformation.

  • op – Binary operation. The signature is (T1, T2) -> U, where T1 and T2 are the input data types and U is the output data type.

Returns:

A callable object that performs the transformation.

cuda.compute.algorithms.histogram_even(
d_samples: DeviceArrayLike | IteratorT,
d_histogram: DeviceArrayLike,
num_output_levels: int,
lower_level: floating | integer,
upper_level: floating | integer,
num_samples: int,
stream=None,
)#

Performs device-wide histogram computation with evenly-spaced bins.

This function automatically handles temporary storage allocation and execution.

Example

Below, histogram_even is used to compute a histogram with evenly-spaced bins.

Basic histogram example.#
"""
Example showing how to use histogram_even to bin a sequence of samples.
"""

import cupy as cp
import numpy as np

import cuda.compute

# Prepare the input and output arrays.
num_samples = 10
h_samples = np.array(
    [2.2, 6.1, 7.1, 2.9, 3.5, 0.3, 2.9, 2.1, 6.1, 999.5], dtype="float32"
)
d_samples = cp.asarray(h_samples)
num_levels = 7
d_histogram = cp.empty(num_levels - 1, dtype="int32")
lower_level = np.float32(0)
upper_level = np.float32(12)

# Perform the histogram even.
cuda.compute.histogram_even(
    d_samples,
    d_histogram,
    num_levels,
    lower_level,
    upper_level,
    num_samples,
)

# Verify the result.
h_actual_histogram = cp.asnumpy(d_histogram)
h_expected_histogram, _ = np.histogram(
    h_samples, bins=num_levels - 1, range=(lower_level, upper_level)
)
h_expected_histogram = h_expected_histogram.astype("int32")

np.testing.assert_array_equal(h_actual_histogram, h_expected_histogram)
print(f"Histogram even basic result: {h_actual_histogram}")
Parameters:
  • d_samples – Device array or iterator containing the input sequence of data samples

  • d_histogram – Device array to store the computed histogram

  • num_output_levels – Number of histogram bin levels (num_bins = num_output_levels - 1)

  • lower_level – Lower sample value bound (inclusive)

  • upper_level – Upper sample value bound (exclusive)

  • num_samples – Number of input samples

  • stream – CUDA stream for the operation (optional)

cuda.compute.algorithms.make_histogram_even(
d_samples: DeviceArrayLike | IteratorT,
d_histogram: DeviceArrayLike,
h_num_output_levels: ndarray,
h_lower_level: ndarray,
h_upper_level: ndarray,
num_samples: int,
)#

Implements a device-wide histogram that places d_samples into evenly-spaced bins.

Example

Below, make_histogram_even is used to create a histogram object that can be reused.

"""
Example showing how to use histogram object API to bin a sequence of samples.
"""

import cupy as cp
import numpy as np

import cuda.compute

# Prepare the input and output arrays.
h_samples = np.array(
    [1.5, 2.3, 4.7, 6.2, 7.8, 3.1, 5.5, 8.9, 2.7, 6.4], dtype="float32"
)
d_samples = cp.asarray(h_samples)

num_levels = 6

# note that the object API requires passing numpy arrays
# rather than scalars:
h_num_output_levels = np.array([num_levels], dtype=np.int32)
h_lower_level = np.array([0.0], dtype=np.float32)
h_upper_level = np.array([10.0], dtype=np.float32)

d_histogram = cp.zeros(num_levels - 1, dtype="int32")

# Create the histogram object.
histogrammer = cuda.compute.make_histogram_even(
    d_samples,
    d_histogram,
    h_num_output_levels,
    h_lower_level,
    h_upper_level,
    len(h_samples),
)

# Get the temporary storage size.
temp_storage_size = histogrammer(
    None,
    d_samples,
    d_histogram,
    h_num_output_levels,
    h_lower_level,
    h_upper_level,
    len(h_samples),
)

# Allocate the temporary storage.
d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)

# Perform the histogram.
histogrammer(
    d_temp_storage,
    d_samples,
    d_histogram,
    h_num_output_levels,
    h_lower_level,
    h_upper_level,
    len(h_samples),
)

# Verify the result.
h_result = cp.asnumpy(d_histogram)
expected_histogram = np.array([1, 3, 2, 3, 1], dtype="int32")

np.testing.assert_array_equal(h_result, expected_histogram)
print("Histogram object example completed successfully")
Parameters:
  • d_samples – Device array or iterator containing the input samples to be histogrammed

  • d_histogram – Device array to store the histogram

  • h_num_output_levels – Host array containing the number of output levels

  • h_lower_level – Host array containing the lower level

  • h_upper_level – Host array containing the upper level

  • num_samples – Number of samples to be histogrammed

Returns:

A callable object that can be used to perform the histogram

cuda.compute.algorithms.merge_sort(
d_in_keys: DeviceArrayLike | IteratorT,
d_in_items: DeviceArrayLike | IteratorT | None,
d_out_keys: DeviceArrayLike,
d_out_items: DeviceArrayLike | None,
op: Operator,
num_items: int,
stream=None,
)#

Performs device-wide merge sort.

This function automatically handles temporary storage allocation and execution.

Example

Below, merge_sort is used to sort a sequence of keys inplace. It also rearranges the items according to the keys’ order.

"""
Demonstrate basic merge sort with keys and values.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    OpKind,
)

# Prepare the input and output arrays.
h_in_keys = np.array([-5, 0, 2, -3, 2, 4, 0, -1, 2, 8], dtype="int32")
h_in_values = np.array(
    [-3.2, 2.2, 1.9, 4.0, -3.9, 2.7, 0, 8.3 - 1, 2.9, 5.4], dtype="float32"
)

d_in_keys = cp.asarray(h_in_keys)
d_in_values = cp.asarray(h_in_values)

# Perform the merge sort.
cuda.compute.merge_sort(
    d_in_keys,
    d_in_values,
    d_in_keys,
    d_in_values,
    OpKind.LESS,
    d_in_keys.size,
)

# Verify the result.
h_out_keys = cp.asnumpy(d_in_keys)
h_out_values = cp.asnumpy(d_in_values)

argsort = np.argsort(h_in_keys, stable=True)
expected_keys = np.array(h_in_keys)[argsort]
expected_values = np.array(h_in_values)[argsort]

assert np.array_equal(h_out_keys, expected_keys)
assert np.array_equal(h_out_values, expected_values)
print(f"Merge sort basic result - keys: {h_out_keys}, values: {h_out_values}")
Parameters:
  • d_in_keys – Device array or iterator containing the input sequence of keys

  • d_in_items – Device array or iterator containing the input sequence of items (optional)

  • d_out_keys – Device array to store the sorted keys

  • d_out_items – Device array to store the sorted items (optional)

  • op – The comparison operator for sorting. The signature is (T, T) -> int8, where T is the input data type.

  • num_items – Number of items to sort

  • stream – CUDA stream for the operation (optional)

cuda.compute.algorithms.make_merge_sort(
d_in_keys: DeviceArrayLike | IteratorT,
d_in_items: DeviceArrayLike | IteratorT | None,
d_out_keys: DeviceArrayLike,
d_out_items: DeviceArrayLike | None,
op: Operator,
)#

Implements a device-wide merge sort using d_in_keys and the comparison operator op.

Example

Below, make_merge_sort is used to create a merge sort object that can be reused.

"""
Merge sort example demonstrating the object API.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    OpKind,
)

# Prepare the input and output arrays.
dtype = np.int32
h_input_keys = np.array([4, 2, 3, 1], dtype=dtype)
h_input_values = np.array([40, 20, 30, 10], dtype=dtype)
d_input_keys = cp.asarray(h_input_keys)
d_input_values = cp.asarray(h_input_values)
d_output_keys = cp.empty_like(d_input_keys)
d_output_values = cp.empty_like(d_input_values)

# Create the merge sort object.
sorter = cuda.compute.make_merge_sort(
    d_input_keys,
    d_input_values,
    d_output_keys,
    d_output_values,
    OpKind.LESS,
)

# Get the temporary storage size.
temp_storage_size = sorter(
    None,
    d_input_keys,
    d_input_values,
    d_output_keys,
    d_output_values,
    OpKind.LESS,
    len(h_input_keys),
)

# Allocate the temporary storage.
d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)

# Perform the merge sort.
sorter(
    d_temp_storage,
    d_input_keys,
    d_input_values,
    d_output_keys,
    d_output_values,
    OpKind.LESS,
    len(h_input_keys),
)

# Verify the result.
expected_keys = np.array([1, 2, 3, 4], dtype=dtype)
expected_values = np.array([10, 20, 30, 40], dtype=dtype)
actual_keys = d_output_keys.get()
actual_values = d_output_values.get()
np.testing.assert_array_equal(actual_keys, expected_keys)
np.testing.assert_array_equal(actual_values, expected_values)
print("Merge sort object example completed successfully")
Parameters:
  • d_in_keys – Device array or iterator containing the input keys to be sorted

  • d_in_items – Optional device array or iterator that contains each key’s corresponding item

  • d_out_keys – Device array to store the sorted keys

  • d_out_items – Device array to store the sorted items

  • op – The comparison operator for sorting. The signature is (T, T) -> int8, where T is the input data type.

Returns:

A callable object that can be used to perform the merge sort

cuda.compute.algorithms.radix_sort(
d_in_keys: DeviceArrayLike | DoubleBuffer,
d_out_keys: DeviceArrayLike | None,
d_in_values: DeviceArrayLike | DoubleBuffer | None,
d_out_values: DeviceArrayLike | None,
order: SortOrder,
num_items: int,
begin_bit: int | None = None,
end_bit: int | None = None,
stream=None,
)#

Performs device-wide radix sort.

This function automatically handles temporary storage allocation and execution.

Example

Below, radix_sort is used to sort a sequence of keys. It also rearranges the values according to the keys’ order.

"""
Example showing how to use radix_sort to sort keys and values.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    SortOrder,
)

# Prepare the input and output arrays.
h_in_keys = np.array([-5, 0, 2, -3, 2, 4, 0, -1, 2, 8], dtype="int32")
h_in_values = np.array(
    [-3.2, 2.2, 1.9, 4.0, -3.9, 2.7, 0, 8.3 - 1, 2.9, 5.4], dtype="float32"
)

d_in_keys = cp.asarray(h_in_keys)
d_in_values = cp.asarray(h_in_values)

# Prepare the output arrays.
d_out_keys = cp.empty_like(d_in_keys)
d_out_values = cp.empty_like(d_in_values)

# Perform the radix sort.
cuda.compute.radix_sort(
    d_in_keys,
    d_out_keys,
    d_in_values,
    d_out_values,
    SortOrder.ASCENDING,
    d_in_keys.size,
)

# Verify the result.
h_out_keys = cp.asnumpy(d_out_keys)
h_out_values = cp.asnumpy(d_out_values)

argsort = np.argsort(h_in_keys, stable=True)
expected_keys = np.array(h_in_keys)[argsort]
expected_values = np.array(h_in_values)[argsort]

assert np.array_equal(h_out_keys, expected_keys)
assert np.array_equal(h_out_values, expected_values)
print(f"Radix sort basic result - keys: {h_out_keys}, values: {h_out_values}")

In the following example, radix_sort is used to sort a sequence of keys with a ``DoubleBuffer` for reduced temporary storage.

"""
Example showing how to use radix_sort with DoubleBuffer for reduced temporary storage.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    DoubleBuffer,
    SortOrder,
)

# Prepare the input and output arrays.
h_in_keys = np.array([-5, 0, 2, -3, 2, 4, 0, -1, 2, 8], dtype="int32")
h_in_values = np.array(
    [-3.2, 2.2, 1.9, 4.0, -3.9, 2.7, 0, 8.3 - 1, 2.9, 5.4], dtype="float32"
)

d_in_keys = cp.asarray(h_in_keys)
d_in_values = cp.asarray(h_in_values)

d_out_keys = cp.empty_like(d_in_keys)
d_out_values = cp.empty_like(d_in_values)

# Create the double buffer.
keys_double_buffer = DoubleBuffer(d_in_keys, d_out_keys)
values_double_buffer = DoubleBuffer(d_in_values, d_out_values)

# Perform the radix sort.
cuda.compute.radix_sort(
    keys_double_buffer,
    None,
    values_double_buffer,
    None,
    SortOrder.ASCENDING,
    d_in_keys.size,
)

# Verify the result.
h_out_keys = cp.asnumpy(keys_double_buffer.current())
h_out_values = cp.asnumpy(values_double_buffer.current())

argsort = np.argsort(h_in_keys, stable=True)
h_expected_keys = np.array(h_in_keys)[argsort]
h_expected_values = np.array(h_in_values)[argsort]

assert np.array_equal(h_out_keys, h_expected_keys)
assert np.array_equal(h_out_values, h_expected_values)
print(f"Radix sort buffer result - keys: {h_out_keys}, values: {h_out_values}")
Parameters:
  • d_in_keys – Device array or DoubleBuffer containing the input sequence of keys

  • d_out_keys – Device array to store the sorted keys (optional)

  • d_in_values – Device array or DoubleBuffer containing the input sequence of values (optional)

  • d_out_values – Device array to store the sorted values (optional)

  • order – Sort order (ascending or descending)

  • num_items – Number of items to sort

  • begin_bit – Beginning bit position for comparison (optional)

  • end_bit – Ending bit position for comparison (optional)

  • stream – CUDA stream for the operation (optional)

cuda.compute.algorithms.make_radix_sort(
d_in_keys: DeviceArrayLike | DoubleBuffer,
d_out_keys: DeviceArrayLike | None,
d_in_values: DeviceArrayLike | DoubleBuffer | None,
d_out_values: DeviceArrayLike | None,
order: SortOrder,
)#

Implements a device-wide radix sort using d_in_keys in the requested order.

Example

Below, make_radix_sort is used to create a radix sort object that can be reused.

"""
Example showing how to use radix_sort with the object API.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    SortOrder,
)

# Prepare the input and output arrays.
dtype = np.int32
h_input_keys = np.array([4, 2, 3, 1], dtype=dtype)
h_input_values = np.array([40, 20, 30, 10], dtype=dtype)
d_input_keys = cp.asarray(h_input_keys)
d_input_values = cp.asarray(h_input_values)
d_output_keys = cp.empty_like(d_input_keys)
d_output_values = cp.empty_like(d_input_values)

# Create the radix sort object.
sorter = cuda.compute.make_radix_sort(
    d_input_keys,
    d_output_keys,
    d_input_values,
    d_output_values,
    SortOrder.ASCENDING,
)

# Get the temporary storage size.
temp_storage_size = sorter(
    None,
    d_input_keys,
    d_output_keys,
    d_input_values,
    d_output_values,
    len(h_input_keys),
)
d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)

# Perform the radix sort.
sorter(
    d_temp_storage,
    d_input_keys,
    d_output_keys,
    d_input_values,
    d_output_values,
    len(h_input_keys),
)

# Verify the result.
expected_keys = np.array([1, 2, 3, 4], dtype=dtype)
expected_values = np.array([10, 20, 30, 40], dtype=dtype)
actual_keys = d_output_keys.get()
actual_values = d_output_values.get()
np.testing.assert_array_equal(actual_keys, expected_keys)
np.testing.assert_array_equal(actual_values, expected_values)
print("Radix sort object example completed successfully")
Parameters:
  • d_in_keys – Device array or DoubleBuffer containing the input keys to be sorted

  • d_out_keys – Device array to store the sorted keys

  • d_in_values – Optional Device array or DoubleBuffer containing the input keys to be sorted

  • d_out_values – Device array to store the sorted values

  • op – Callable representing the comparison operator

Returns:

A callable object that can be used to perform the radix sort

cuda.compute.algorithms.segmented_reduce(
d_in: DeviceArrayLike | IteratorT,
d_out: DeviceArrayLike | IteratorT,
start_offsets_in: DeviceArrayLike | IteratorT,
end_offsets_in: DeviceArrayLike | IteratorT,
op: Operator,
h_init: ndarray | GpuStruct,
num_segments: int,
stream=None,
)#

Performs device-wide segmented reduction.

This function automatically handles temporary storage allocation and execution.

Example

Below, segmented_reduce is used to compute the minimum value of segments in a sequence of integers.

"""
Example showing how to use segmented_reduce to find the minimum in each segment.
"""

import cupy as cp
import numpy as np

import cuda.compute


def min_op(a, b):
    return a if a < b else b


dtype = np.dtype(np.int32)
max_val = np.iinfo(dtype).max
h_init = np.asarray(max_val, dtype=dtype)

# Prepare the offsets.
offsets = cp.array([0, 7, 11, 16], dtype=np.int64)
first_segment = (8, 6, 7, 5, 3, 0, 9)
second_segment = (-4, 3, 0, 1)
third_segment = (3, 1, 11, 25, 8)

# Prepare the input array.
d_input = cp.array(
    [*first_segment, *second_segment, *third_segment],
    dtype=dtype,
)

# Prepare the start and end offsets.
start_o = offsets[:-1]
end_o = offsets[1:]

# Prepare the output array.
n_segments = start_o.size
d_output = cp.empty(n_segments, dtype=dtype)

# Perform the segmented reduce.
cuda.compute.segmented_reduce(
    d_input, d_output, start_o, end_o, min_op, h_init, n_segments
)

# Verify the result.
expected_output = cp.asarray([0, -4, 1], dtype=d_output.dtype)
assert (d_output == expected_output).all()
print(f"Segmented reduce basic result: {d_output.get()}")
Parameters:
  • d_in – Device array or iterator containing the input sequence of data items

  • d_out – Device array to store the result of the reduction for each segment

  • start_offsets_in – Device array or iterator containing the sequence of beginning offsets

  • end_offsets_in – Device array or iterator containing the sequence of ending offsets

  • op – Binary operator to apply. The signature is (T, T) -> T, where T is the data type of the initial value h_init.

  • h_init – Initial value for the reduction

  • num_segments – Number of segments to reduce

  • stream – CUDA stream for the operation (optional)

cuda.compute.algorithms.make_segmented_reduce(
d_in: DeviceArrayLike | IteratorT,
d_out: DeviceArrayLike | IteratorT,
start_offsets_in: DeviceArrayLike | IteratorT,
end_offsets_in: DeviceArrayLike | IteratorT,
op: Operator,
h_init: ndarray | GpuStruct,
)#

Computes a device-wide segmented reduction using the specified binary op and initial value init.

Example

Below, make_segmented_reduce is used to create a segmented reduction object that can be reused.

"""
Segmented reduction using the object API.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    OpKind,
)

# Prepare the input and output arrays.
dtype = np.int32
h_init = np.array([0], dtype=dtype)
h_input = np.array([1, 2, 3, 4, 5, 6], dtype=dtype)
d_input = cp.asarray(h_input)
d_output = cp.empty(2, dtype=dtype)

start_offsets = cp.array([0, 3], dtype=np.int64)
end_offsets = cp.array([3, 6], dtype=np.int64)

# Create the segmented reduce object.
reducer = cuda.compute.make_segmented_reduce(
    d_input, d_output, start_offsets, end_offsets, OpKind.PLUS, h_init
)

# Get the temporary storage size.
temp_storage_size = reducer(
    None, d_input, d_output, OpKind.PLUS, 2, start_offsets, end_offsets, h_init
)

# Allocate the temporary storage.
d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)

# Perform the segmented reduce.
reducer(
    d_temp_storage,
    d_input,
    d_output,
    OpKind.PLUS,
    2,
    start_offsets,
    end_offsets,
    h_init,
)

# Verify the result.
expected_result = np.array([6, 15], dtype=dtype)
actual_result = d_output.get()
np.testing.assert_array_equal(actual_result, expected_result)
print("Segmented reduce object example completed successfully")
Parameters:
  • d_in – Device array or iterator containing the input sequence of data items

  • d_out – Device array that will store the result of the reduction

  • start_offsets_in – Device array or iterator containing offsets to start of segments

  • end_offsets_in – Device array or iterator containing offsets to end of segments

  • op – Binary operator to apply. The signature is (T, T) -> T, where T is the data type of the initial value h_init.

  • init – Numpy array storing initial value of the reduction

Returns:

A callable object that can be used to perform the reduction

cuda.compute.algorithms.unique_by_key(
d_in_keys: DeviceArrayLike | IteratorT,
d_in_items: DeviceArrayLike | IteratorT,
d_out_keys: DeviceArrayLike | IteratorT,
d_out_items: DeviceArrayLike | IteratorT,
d_out_num_selected: DeviceArrayLike,
op: Operator,
num_items: int,
stream=None,
)#

Performs device-wide unique by key operation using the single-phase API.

This function automatically handles temporary storage allocation and execution.

Example

Below, unique_by_key is used to populate the arrays of output keys and items with the first key and its corresponding item from each sequence of equal keys. It also outputs the number of items selected.

"""
Example showing how to use unique_by_key to remove all
but the first value for each group of consecutive keys.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    OpKind,
)

# Prepare the input and output arrays.
h_in_keys = np.array([0, 2, 2, 9, 5, 5, 5, 8], dtype="int32")
h_in_values = np.array([1, 2, 3, 4, 5, 6, 7, 8], dtype="float32")

d_in_keys = cp.asarray(h_in_keys)
d_in_values = cp.asarray(h_in_values)
d_out_keys = cp.empty_like(d_in_keys)
d_out_values = cp.empty_like(d_in_values)
d_out_num_selected = cp.empty(1, np.int32)

# Perform the unique by key operation.
cuda.compute.unique_by_key(
    d_in_keys,
    d_in_values,
    d_out_keys,
    d_out_values,
    d_out_num_selected,
    OpKind.EQUAL_TO,
    d_in_keys.size,
)

# Verify the result.
num_selected = cp.asnumpy(d_out_num_selected)[0]
h_out_keys = cp.asnumpy(d_out_keys)[:num_selected]
h_out_values = cp.asnumpy(d_out_values)[:num_selected]

expected_keys = np.array([0, 2, 9, 5, 8])
expected_values = np.array([1, 2, 4, 5, 8])

assert np.array_equal(h_out_keys, expected_keys)
assert np.array_equal(h_out_values, expected_values)
print(
    f"Unique by key basic result - keys: {h_out_keys}, values: {h_out_values}, count: {num_selected}"
)
Parameters:
  • d_in_keys – Device array or iterator containing the input sequence of keys

  • d_in_items – Device array or iterator that contains each key’s corresponding item

  • d_out_keys – Device array or iterator to store the outputted keys

  • d_out_items – Device array or iterator to store each outputted key’s item

  • d_out_num_selected – Device array to store how many items were selected

  • op – Callable or OpKind representing the equality operator

  • num_items – Number of items to process

  • stream – CUDA stream for the operation (optional)

cuda.compute.algorithms.make_unique_by_key(
d_in_keys: DeviceArrayLike | IteratorT,
d_in_items: DeviceArrayLike | IteratorT,
d_out_keys: DeviceArrayLike | IteratorT,
d_out_items: DeviceArrayLike | IteratorT,
d_out_num_selected: DeviceArrayLike,
op: Operator,
)#

Implements a device-wide unique by key operation using d_in_keys and the comparison operator op. Only the first key and its value from each run is selected and the total number of items selected is also reported.

Example

Below, make_unique_by_key is used to create a unique by key object that can be reused.

"""
Example showing how to use unique_by_key with the object API.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    OpKind,
)

# Unique by key example demonstrating the object API
dtype = np.int32
h_input_keys = np.array([1, 1, 2, 3, 3], dtype=dtype)
h_input_values = np.array([10, 20, 30, 40, 50], dtype=dtype)
d_input_keys = cp.asarray(h_input_keys)
d_input_values = cp.asarray(h_input_values)
d_output_keys = cp.empty_like(d_input_keys)
d_output_values = cp.empty_like(d_input_values)
d_num_selected = cp.empty(1, dtype=np.int32)

# Create the unique by key object.
uniquer = cuda.compute.make_unique_by_key(
    d_input_keys,
    d_input_values,
    d_output_keys,
    d_output_values,
    d_num_selected,
    OpKind.EQUAL_TO,
)

# Get the temporary storage size.
temp_storage_size = uniquer(
    None,
    d_input_keys,
    d_input_values,
    d_output_keys,
    d_output_values,
    d_num_selected,
    OpKind.EQUAL_TO,
    len(h_input_keys),
)

# Allocate the temporary storage.
d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)

# Perform the unique by key operation.
uniquer(
    d_temp_storage,
    d_input_keys,
    d_input_values,
    d_output_keys,
    d_output_values,
    d_num_selected,
    OpKind.EQUAL_TO,
    len(h_input_keys),
)

# Verify the result.
num_selected = d_num_selected.get()[0]
expected_keys = np.array([1, 2, 3], dtype=dtype)
expected_values = np.array([10, 30, 40], dtype=dtype)
actual_keys = d_output_keys.get()[:num_selected]
actual_values = d_output_values.get()[:num_selected]
np.testing.assert_array_equal(actual_keys, expected_keys)
np.testing.assert_array_equal(actual_values, expected_values)
print("Unique by key object example completed successfully")
Parameters:
  • d_in_keys – Device array or iterator containing the input sequence of keys

  • d_in_items – Device array or iterator that contains each key’s corresponding item

  • d_out_keys – Device array or iterator to store the outputted keys

  • d_out_items – Device array or iterator to store each outputted key’s item

  • d_out_num_selected – Device array to store how many items were selected

  • op – Callable or OpKind representing the equality operator

Returns:

A callable object that can be used to perform unique by key

cuda.compute.algorithms.segmented_sort(
d_in_keys: DeviceArrayLike | DoubleBuffer,
d_out_keys: DeviceArrayLike | None,
d_in_values: DeviceArrayLike | DoubleBuffer | None,
d_out_values: DeviceArrayLike | None,
num_items: int,
num_segments: int,
start_offsets_in: DeviceArrayLike,
end_offsets_in: DeviceArrayLike,
order: SortOrder,
stream=None,
)#

Performs device-wide segmented sort.

This function automatically handles temporary storage allocation and execution.

Example

Below, segmented_sort is used to perform a segmented sort. It also rearranges the values according to the keys’ order.

"""
Example showing how to use segmented_sort to sort keys and values within segments.
"""

import cupy as cp
import numpy as np

import cuda.compute

# Prepare input keys and values, and segment offsets.
h_in_keys = np.array([9, 1, 5, 4, 2, 8, 7, 3, 6], dtype="int32")
h_in_vals = np.array([90, 10, 50, 40, 20, 80, 70, 30, 60], dtype="int32")

# 3 segments: [0,3), [3,5), [5,9)
start_offsets = np.array([0, 3, 5], dtype=np.int64)
end_offsets = np.array([3, 5, 9], dtype=np.int64)

d_in_keys = cp.asarray(h_in_keys)
d_in_vals = cp.asarray(h_in_vals)
d_out_keys = cp.empty_like(d_in_keys)
d_out_vals = cp.empty_like(d_in_vals)

# Perform the segmented sort (ascending within each segment).
cuda.compute.segmented_sort(
    d_in_keys,
    d_out_keys,
    d_in_vals,
    d_out_vals,
    d_in_keys.size,
    start_offsets.size,
    cp.asarray(start_offsets),
    cp.asarray(end_offsets),
    cuda.compute.SortOrder.ASCENDING,
)

# Verify the result.
h_out_keys = cp.asnumpy(d_out_keys)
h_out_vals = cp.asnumpy(d_out_vals)

expected_pairs = []
for s, e in zip(start_offsets, end_offsets):
    seg_pairs = sorted(zip(h_in_keys[s:e], h_in_vals[s:e]), key=lambda kv: kv[0])
    expected_pairs.extend(seg_pairs)

expected_keys = np.array([k for k, _ in expected_pairs], dtype=h_in_keys.dtype)
expected_vals = np.array([v for _, v in expected_pairs], dtype=h_in_vals.dtype)

assert np.array_equal(h_out_keys, expected_keys)
assert np.array_equal(h_out_vals, expected_vals)
print(f"Segmented sort basic result - keys: {h_out_keys}, values: {h_out_vals}")

In the following example, segmented_sort is used to perform a segmented sort with a ``DoubleBuffer` for reduced temporary storage.

"""
Example showing how to use segmented_sort with DoubleBuffer for reduced temporary storage.
"""

import cupy as cp
import numpy as np

import cuda.compute

# Prepare input keys and values, and segment offsets.
h_in_keys = np.array([9, 1, 5, 4, 2, 8, 7, 3, 6], dtype="int32")
h_in_vals = np.array([90, 10, 50, 40, 20, 80, 70, 30, 60], dtype="int32")

# 3 segments: [0,3), [3,5), [5,9)
start_offsets = np.array([0, 3, 5], dtype=np.int64)
end_offsets = np.array([3, 5, 9], dtype=np.int64)

d_in_keys = cp.asarray(h_in_keys)
d_in_vals = cp.asarray(h_in_vals)
d_tmp_keys = cp.empty_like(d_in_keys)
d_tmp_vals = cp.empty_like(d_in_vals)

# Create double buffers for keys and values.
keys_db = cuda.compute.DoubleBuffer(d_in_keys, d_tmp_keys)
vals_db = cuda.compute.DoubleBuffer(d_in_vals, d_tmp_vals)

# Perform the segmented sort (descending within each segment).
cuda.compute.segmented_sort(
    keys_db,
    None,
    vals_db,
    None,
    d_in_keys.size,
    start_offsets.size,
    cp.asarray(start_offsets),
    cp.asarray(end_offsets),
    cuda.compute.SortOrder.DESCENDING,
)

# Verify the result.
h_out_keys = cp.asnumpy(keys_db.current())
h_out_vals = cp.asnumpy(vals_db.current())

expected_pairs = []
for s, e in zip(start_offsets, end_offsets):
    seg_pairs = sorted(
        zip(h_in_keys[s:e], h_in_vals[s:e]), key=lambda kv: kv[0], reverse=True
    )
    expected_pairs.extend(seg_pairs)

expected_keys = np.array([k for k, _ in expected_pairs], dtype=h_in_keys.dtype)
expected_vals = np.array([v for _, v in expected_pairs], dtype=h_in_vals.dtype)

assert np.array_equal(h_out_keys, expected_keys)
assert np.array_equal(h_out_vals, expected_vals)
print(f"Segmented sort buffer result - keys: {h_out_keys}, values: {h_out_vals}")
Parameters:
  • d_in_keys – Device array or DoubleBuffer containing the input keys to be sorted

  • d_out_keys – Device array to store the sorted keys (optional)

  • d_in_values – Device array or DoubleBuffer containing the input values to be sorted (optional)

  • d_out_values – Device array to store the sorted values (optional)

  • num_items – Total number of items to sort

  • num_segments – Number of segments to sort

  • start_offsets_in – Device array or iterator containing the sequence of beginning offsets

  • end_offsets_in – Device array or iterator containing the sequence of ending offsets

  • order – Sort order (ascending or descending)

  • stream – CUDA stream for the operation (optional)

cuda.compute.algorithms.make_segmented_sort(
d_in_keys: DeviceArrayLike | DoubleBuffer,
d_out_keys: DeviceArrayLike | None,
d_in_values: DeviceArrayLike | DoubleBuffer | None,
d_out_values: DeviceArrayLike | None,
start_offsets_in: DeviceArrayLike,
end_offsets_in: DeviceArrayLike,
order: SortOrder,
)#

Performs a device-wide segmented sort using the specified keys and values.

Example

Below, make_segmented_sort is used to create a segmented sort object that can be reused.

"""
Example showing how to use segmented_sort with the object API.
"""

import cupy as cp
import numpy as np

import cuda.compute

# Prepare the input and segment offsets.
dtype = np.int32
h_input_keys = np.array([9, 1, 5, 4, 2, 8, 7, 3, 6], dtype=dtype)
h_input_vals = np.array([90, 10, 50, 40, 20, 80, 70, 30, 60], dtype=dtype)
start_offsets = np.array([0, 3, 5], dtype=np.int64)
end_offsets = np.array([3, 5, 9], dtype=np.int64)

d_input_keys = cp.asarray(h_input_keys)
d_input_vals = cp.asarray(h_input_vals)
d_output_keys = cp.empty_like(d_input_keys)
d_output_vals = cp.empty_like(d_input_vals)

# Create the segmented sort object.
sorter = cuda.compute.make_segmented_sort(
    d_input_keys,
    d_output_keys,
    d_input_vals,
    d_output_vals,
    cp.asarray(start_offsets),
    cp.asarray(end_offsets),
    cuda.compute.SortOrder.ASCENDING,
)

# Get the temporary storage size.
temp_storage_size = sorter(
    None,
    d_input_keys,
    d_output_keys,
    d_input_vals,
    d_output_vals,
    h_input_keys.size,
    start_offsets.size,
    cp.asarray(start_offsets),
    cp.asarray(end_offsets),
)
d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)

# Perform the segmented sort.
sorter(
    d_temp_storage,
    d_input_keys,
    d_output_keys,
    d_input_vals,
    d_output_vals,
    h_input_keys.size,
    start_offsets.size,
    cp.asarray(start_offsets),
    cp.asarray(end_offsets),
)

# Verify the result.
expected_pairs = []
for s, e in zip(start_offsets, end_offsets):
    seg_pairs = sorted(zip(h_input_keys[s:e], h_input_vals[s:e]), key=lambda kv: kv[0])
    expected_pairs.extend(seg_pairs)

expected_keys = np.array([k for k, _ in expected_pairs], dtype=dtype)
expected_values = np.array([v for _, v in expected_pairs], dtype=dtype)

actual_keys = d_output_keys.get()
actual_values = d_output_vals.get()
np.testing.assert_array_equal(actual_keys, expected_keys)
np.testing.assert_array_equal(actual_values, expected_values)
print("Segmented sort object example completed successfully")
Parameters:
  • d_in_keys – Device array or DoubleBuffer containing the input keys to be sorted

  • d_out_keys – Device array to store the sorted keys

  • d_in_values – Optional Device array or DoubleBuffer containing the input values to be sorted

  • d_out_values – Device array to store the sorted values

  • start_offsets_in – Device array or iterator containing the sequence of beginning offsets

  • end_offsets_in – Device array or iterator containing the sequence of ending offsets

  • order – SortOrder specifying the order of the sort

Returns:

A callable object that can be used to perform the segmented sort

cuda.compute.algorithms.three_way_partition(
d_in: DeviceArrayLike | IteratorT,
d_first_part_out: DeviceArrayLike | IteratorT,
d_second_part_out: DeviceArrayLike | IteratorT,
d_unselected_out: DeviceArrayLike | IteratorT,
d_num_selected_out: DeviceArrayLike | IteratorT,
select_first_part_op: Operator,
select_second_part_op: Operator,
num_items: int,
stream=None,
)#

Performs device-wide three-way partition. Given an input sequence of data items, it partitions the items into three parts: - The first part is selected by the select_first_part_op operator. - The second part is selected by the select_second_part_op operator. - The unselected items are not selected by either operator.

This function automatically handles temporary storage allocation and execution.

Example

Below, three_way_partition is used to partition a sequence of integers into three parts.

"""
Example showing how to use three_way_partition to partition a sequence of integers into three parts.
"""

import cupy as cp
import numpy as np

import cuda.compute

# Prepare the input and output arrays.
dtype = np.int32
h_input = np.array([0, 2, 9, 1, 5, 6, 7, -3, 17, 10], dtype=dtype)
d_input = cp.asarray(h_input)
d_first_part = cp.empty_like(d_input)
d_second_part = cp.empty_like(d_input)
d_unselected = cp.empty_like(d_input)
d_num_selected = cp.empty(2, dtype=np.int64)


def less_than_op(x):
    return x < 8 and x >= 0


def greater_than_equal_op(x):
    return x >= 8


# Perform the three_way_partition.
cuda.compute.three_way_partition(
    d_input,
    d_first_part,
    d_second_part,
    d_unselected,
    d_num_selected,
    less_than_op,
    greater_than_equal_op,
    len(h_input),
)

# Verify the result.
expected_first_part = np.array([0, 2, 1, 5, 6, 7], dtype=dtype)
expected_second_part = np.array([9, 17, 10], dtype=dtype)
expected_unselected = np.array([-3], dtype=dtype)
expected_num_selected = np.array([6, 3], dtype=np.int64)

actual_num_selected = d_num_selected.get()
num_selected_first_part = int(actual_num_selected[0])
num_selected_second_part = int(actual_num_selected[1])
actual_first_part = d_first_part.get()[:num_selected_first_part]
actual_second_part = d_second_part.get()[:num_selected_second_part]
actual_unselected = d_unselected.get()[
    : d_input.size - num_selected_first_part - num_selected_second_part
]

np.testing.assert_array_equal(actual_first_part, expected_first_part)
np.testing.assert_array_equal(actual_second_part, expected_second_part)
np.testing.assert_array_equal(actual_unselected, expected_unselected)
np.testing.assert_array_equal(actual_num_selected, expected_num_selected)

print("Three way partition basic example completed successfully")
Parameters:
  • d_in – Device array or iterator containing the input sequence of data items

  • d_first_part_out – Device array or iterator to store the first part of the output

  • d_second_part_out – Device array or iterator to store the second part of the output

  • d_unselected_out – Device array or iterator to store the unselected items

  • d_num_selected_out – Device array to store the number of items selected. The total number of items selected by select_first_part_op and select_second_part_op is stored in d_num_selected_out[0] and d_num_selected_out[1], respectively.

  • select_first_part_op – Unary operator to select the first part. The signature is (T) -> uint8, where T is the input data type. Returns 1 (selected) or 0 (not selected).

  • select_second_part_op – Unary operator to select the second part. The signature is (T) -> uint8, where T is the input data type. Returns 1 (selected) or 0 (not selected).

  • num_items – Number of items to partition

  • stream – CUDA stream for the operation (optional)

cuda.compute.algorithms.make_three_way_partition(
d_in: DeviceArrayLike | IteratorT,
d_first_part_out: DeviceArrayLike | IteratorT,
d_second_part_out: DeviceArrayLike | IteratorT,
d_unselected_out: DeviceArrayLike | IteratorT,
d_num_selected_out: DeviceArrayLike | IteratorT,
select_first_part_op: Operator,
select_second_part_op: Operator,
)#

Computes a device-wide three-way partition using the specified unary select_first_part_op and select_second_part_op operators.

Example

Below, make_three_way_partition is used to create a three-way partition object that can be reused.

"""
Example showing how to use three_way_partition with the object API.
"""

import cupy as cp
import numpy as np

import cuda.compute

# Prepare the input and output arrays.
dtype = np.int32
h_input = np.array([0, 2, 9, 1, 5, 6, 7, -3, 17, 10], dtype=dtype)
d_input = cp.asarray(h_input)
d_first_part = cp.empty_like(d_input)
d_second_part = cp.empty_like(d_input)
d_unselected = cp.empty_like(d_input)
d_num_selected = cp.empty(2, dtype=np.int64)


def less_than_op(x):
    return x < 8 and x >= 0


def greater_than_equal_op(x):
    return x >= 8


# Create the three_way_partition object.
partitioner = cuda.compute.make_three_way_partition(
    d_input,
    d_first_part,
    d_second_part,
    d_unselected,
    d_num_selected,
    less_than_op,
    greater_than_equal_op,
)

# Get the temporary storage size.
temp_storage_size = partitioner(
    None,
    d_input,
    d_first_part,
    d_second_part,
    d_unselected,
    d_num_selected,
    less_than_op,
    greater_than_equal_op,
    len(h_input),
)
d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)

# Perform the three_way_partition.
partitioner(
    d_temp_storage,
    d_input,
    d_first_part,
    d_second_part,
    d_unselected,
    d_num_selected,
    less_than_op,
    greater_than_equal_op,
    len(h_input),
)

# Verify the result.
expected_first_part = np.array([0, 2, 1, 5, 6, 7], dtype=dtype)
expected_second_part = np.array([9, 17, 10], dtype=dtype)
expected_unselected = np.array([-3], dtype=dtype)
expected_num_selected = np.array([6, 3], dtype=np.int64)

actual_num_selected = d_num_selected.get()
num_selected_first_part = int(actual_num_selected[0])
num_selected_second_part = int(actual_num_selected[1])
actual_first_part = d_first_part.get()[:num_selected_first_part]
actual_second_part = d_second_part.get()[:num_selected_second_part]
actual_unselected = d_unselected.get()[
    : d_input.size - num_selected_first_part - num_selected_second_part
]

np.testing.assert_array_equal(actual_first_part, expected_first_part)
np.testing.assert_array_equal(actual_second_part, expected_second_part)
np.testing.assert_array_equal(actual_unselected, expected_unselected)
np.testing.assert_array_equal(actual_num_selected, expected_num_selected)

print("Three way partition object example completed successfully")
Parameters:
  • d_in – Device array or iterator containing the input sequence of data items

  • d_first_part_out – Device array or iterator to store the first part of the output

  • d_second_part_out – Device array or iterator to store the second part of the output

  • d_unselected_out – Device array or iterator to store the unselected items

  • d_num_selected_out – Device array to store the number of items selected. The total number of items selected by select_first_part_op and select_second_part_op is stored in d_num_selected_out[0] and d_num_selected_out[1], respectively.

  • select_first_part_op – Unary operator to select the first part. The signature is (T) -> uint8, where T is the input data type. Returns 1 (selected) or 0 (not selected). Can reference device arrays as globals/closures - they will be automatically captured.

  • select_second_part_op – Unary operator to select the second part. The signature is (T) -> uint8, where T is the input data type. Returns 1 (selected) or 0 (not selected). Can reference device arrays as globals/closures - they will be automatically captured.

Returns:

A callable object that can be used to perform the three-way partition

cuda.compute.algorithms.select(
d_in: DeviceArrayLike | IteratorT,
d_out: DeviceArrayLike | IteratorT,
d_num_selected_out: DeviceArrayLike,
cond: Operator,
num_items: int,
stream=None,
)#

Performs device-wide selection of elements based on a condition.

Given an input sequence, this function selects all elements for which the condition function cond returns true (non-zero) and writes them to the output in a compacted form. The number of selected elements is written to d_num_selected_out[0].

This function automatically handles temporary storage allocation and execution.

The cond function can reference device arrays as globals or closures - they will be automatically captured as state arrays, enabling stateful operations like counting.

Example

Below, select is used to select even numbers from an input array:

import cupy as cp

from cuda.compute.algorithms import select

# Create input data
d_in = cp.array([1, 2, 3, 4, 5, 6, 7, 8], dtype=cp.int32)
d_out = cp.empty_like(d_in)
d_num_selected = cp.zeros(2, dtype=cp.uint64)


# Define select condition (keep even numbers)
def is_even(x):
    return x % 2 == 0


# Execute select
select(d_in, d_out, d_num_selected, is_even, len(d_in))

# Get results
num_selected = int(d_num_selected[0])
result = d_out[:num_selected].get()
print(f"Selected {num_selected} items: {result}")
# Output: Selected 4 items: [2 4 6 8]
# example-end

assert num_selected == 4
assert (result == [2, 4, 6, 8]).all()

You can also use iterators for more complex selection patterns:

import cupy as cp

from cuda.compute.algorithms import select
from cuda.compute.iterators import TransformIterator

# Create input data
d_in = cp.array([1, 2, 3, 4, 5, 6, 7, 8], dtype=cp.int32)
d_out = cp.empty_like(d_in)
d_num_selected = cp.zeros(2, dtype=cp.uint64)


# Create iterator that squares each value
def square(x):
    return x * x


squared_iter = TransformIterator(d_in, square)


# Select squared values that are greater than 20
def greater_than_20(x):
    return x > 20


select(squared_iter, d_out, d_num_selected, greater_than_20, len(d_in))

# Get results
num_selected = int(d_num_selected[0])
result = d_out[:num_selected].get()
print(f"Selected {num_selected} items: {result}")
# Output: Selected 4 items: [25 36 49 64]
# (5^2=25, 6^2=36, 7^2=49, 8^2=64, all > 20)
# example-end

assert num_selected == 4
assert (result == [25, 36, 49, 64]).all()
Parameters:
  • d_in – Device array or iterator containing the input sequence of data items.

  • d_out – Device array or iterator to store the selected output items.

  • d_num_selected_out – Device array to store the number of items that passed the selection. The count is stored in d_num_selected_out[0].

  • cond – Selection condition (predicate). The signature is (T) -> uint8, where T is the input data type. Returns 1 (selected) or 0 (not selected). Can reference device arrays as globals/closures - they will be automatically captured.

  • num_items – Number of items in the input sequence.

  • stream – CUDA stream to use for the operation (optional).

cuda.compute.algorithms.make_select(
d_in: DeviceArrayLike | IteratorT,
d_out: DeviceArrayLike | IteratorT,
d_num_selected_out: DeviceArrayLike,
cond: Operator,
)#

Create a select object that can be called to select elements matching a condition.

This is the object-oriented API that allows explicit control over temporary storage allocation. For simpler usage, consider using select().

Example

Below, make_select is used to create a select object that can be reused.

import cupy as cp

from cuda.compute.algorithms import make_select

# Create input data
d_in = cp.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype=cp.int32)
d_out = cp.empty_like(d_in)
d_num_selected = cp.zeros(2, dtype=cp.uint64)


# Define select condition (keep values > 5)
def greater_than_5(x):
    return x > 5


# Create select object (can be reused)
selector = make_select(d_in, d_out, d_num_selected, greater_than_5)

# Get required temp storage
temp_storage_bytes = selector(
    None, d_in, d_out, d_num_selected, greater_than_5, len(d_in)
)
d_temp_storage = cp.empty(temp_storage_bytes, dtype=cp.uint8)

# Execute select
selector(d_temp_storage, d_in, d_out, d_num_selected, greater_than_5, len(d_in))

# Get results
num_selected = int(d_num_selected[0])
result = d_out[:num_selected].get()
print(f"Selected {num_selected} items: {result}")
# Output: Selected 5 items: [ 6  7  8  9 10]

# Reuse the same select object with different input
d_in2 = cp.array([10, 20, 3, 15, 2, 8, 30], dtype=cp.int32)
d_out2 = cp.empty_like(d_in2)
d_num_selected2 = cp.zeros(2, dtype=cp.uint64)

selector(d_temp_storage, d_in2, d_out2, d_num_selected2, greater_than_5, len(d_in2))

num_selected2 = int(d_num_selected2[0])
result2 = d_out2[:num_selected2].get()
print(f"Second select: {num_selected2} items: {result2}")
# Output: Second select: 5 items: [10 20 15  8 30]
# example-end

assert num_selected == 5
assert (result == [6, 7, 8, 9, 10]).all()
Parameters:
  • d_in – Device array or iterator containing the input sequence of data items.

  • d_out – Device array or iterator to store the selected output items.

  • d_num_selected_out – Device array to store the number of items that passed the selection. The count is stored in d_num_selected_out[0].

  • cond – Selection condition (predicate). The signature is (T) -> uint8, where T is the input data type. Returns 1 (selected) or 0 (not selected).

Returns:

A callable object that performs the selection operation.

class cuda.compute.algorithms.DoubleBuffer(
d_current: DeviceArrayLike,
d_alternate: DeviceArrayLike,
)#
__init__(
d_current: DeviceArrayLike,
d_alternate: DeviceArrayLike,
)#
current()#
alternate()#
class cuda.compute.algorithms.SortOrder(*values)#
ASCENDING = 0#
DESCENDING = 1#

Iterators#

class cuda.compute.iterators.CacheModifiedInputIterator(
array,
modifier: Literal['stream', 'global', 'volatile'] = 'stream',
)#

Iterator that wraps a device pointer with cache-modified loads.

This iterator uses PTX cache modifiers to control how data is loaded: - “stream”: Uses streaming loads (ld.global.cs) - hints that data will not be reused - “global”: Uses global cache loads (ld.global.cg) - caches only at L2 - “volatile”: Uses volatile loads (ld.global.cv) - always fetches from memory

Supports element types of size 1, 2, 4, 8, or 16 bytes.

__init__(
array,
modifier: Literal['stream', 'global', 'volatile'] = 'stream',
)#

Create a cache-modified input iterator.

Parameters:
  • array – Device array to wrap (must support __cuda_array_interface__)

  • modifier – Cache modifier - “stream”, “global”, or “volatile”

__add__(
offset: int,
) CacheModifiedInputIterator#

Advance the iterator by offset elements.

property kind#

Return a hashable kind for caching purposes.

class cuda.compute.iterators.ConstantIterator(value: number)#

Iterator representing a sequence of constant values.

Similar to thrust::constant_iterator.

Every dereference returns the same constant value.

Example

The code snippet below demonstrates the usage of a ConstantIterator representing a sequence of constant values:

"""
Example showing how to use constant_iterator.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    ConstantIterator,
    OpKind,
)

# Prepare the input and output arrays.
constant_value = 42
num_items = 5

# Create the constant iterator.
constant_it = ConstantIterator(np.int32(constant_value))

# Prepare the initial value for the reduction.
h_init = np.array([0], dtype=np.int32)

# Prepare the output array.
d_output = cp.empty(1, dtype=np.int32)

# Perform the reduction.
cuda.compute.reduce_into(constant_it, d_output, OpKind.PLUS, num_items, h_init)

# Verify the result.
expected_output = constant_value * num_items
assert (d_output == expected_output).all()
print(f"Constant iterator result: {d_output[0]} (expected: {expected_output})")
Parameters:

value – The value of every item in the sequence

__init__(value: number)#

Create a constant iterator with the given value.

Parameters:

value – The constant value (must be a numpy scalar)

__add__(
offset: int,
) ConstantIterator#

Return a new ConstantIterator (value doesn’t change with position).

class cuda.compute.iterators.CountingIterator(start: number)#

Iterator representing a sequence of incrementing values.

Similar to thrust::counting_iterator.

The iterator starts at start and increments by 1 for each advance.

Example

The code snippet below demonstrates the usage of a CountingIterator representing the sequence [10, 11, 12]:

"""
Example showing how to use counting_iterator.
"""

import functools

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    CountingIterator,
    OpKind,
)

# Prepare the input and output arrays.
first_item = 1
num_items = 100

# Create the counting iterator.
first_it = CountingIterator(np.int32(first_item))

# Prepare the initial value for the reduction.
h_init = np.array([0], dtype=np.int32)

# Prepare the output array.
d_output = cp.empty(1, dtype=np.int32)

# Perform the reduction.
cuda.compute.reduce_into(first_it, d_output, OpKind.PLUS, num_items, h_init)

# Verify the result.
expected_output = functools.reduce(
    lambda a, b: a + b, range(first_item, first_item + num_items)
)
assert (d_output == expected_output).all()
print(f"Counting iterator result: {d_output[0]} (expected: {expected_output})")
Parameters:

start – The initial value of the sequence

__init__(start: number)#

Create a counting iterator starting at start.

Parameters:

start – The initial value (must be a numpy scalar)

__add__(
offset: int,
) CountingIterator#

Return a new CountingIterator advanced by offset elements.

class cuda.compute.iterators.DiscardIterator(reference_iterator=None)#

Iterator that discards all reads and writes.

__init__(reference_iterator=None)#

Create a discard iterator.

Parameters:

reference_iterator – Optional iterator or device array used to infer value_type/state_type. Defaults to a temporary byte buffer.

class cuda.compute.iterators.IteratorBase(
state_bytes: bytes,
state_alignment: int,
value_type: TypeDescriptor,
)#

Iterators represent streams of data computed on the fly.

See cuda.compute.iterators for available iterators.

__init__(
state_bytes: bytes,
state_alignment: int,
value_type: TypeDescriptor,
)#
Parameters:
  • state_bytes – bytes object representing iterator’s state

  • state_alignment – Alignment of the state

  • value_type – Type of dereferenced values

property state: cuda.compute._bindings.IteratorState#

Return the iterator state for CCCL interop.

property state_alignment: int#

Return the alignment of the iterator state.

property value_type: TypeDescriptor#

Return the TypeDescriptor for dereferenced values.

property children: tuple[IteratorBase, ...]#

Return child iterators for automatic dependency tracking. Override in subclasses.

get_advance_op() cuda.compute._bindings.Op#

Get the cached Op for the advance operation.

get_input_deref_op() Op | None#

Get the cached Op for input dereference operation, or None if not supported.

get_output_deref_op() Op | None#

Get the cached Op for output dereference operation, or None if not supported.

property is_input_iterator: bool#

Return True if this iterator supports input dereference.

property is_output_iterator: bool#

Return True if this iterator supports output dereference.

to_cccl_iter(
is_output: bool = False,
) cuda.compute._bindings.Iterator#

Convert this iterator to a CCCL Iterator for algorithm interop.

Parameters:

is_output – If True, use output_dereference; otherwise use input_dereference

Returns:

CCCL Iterator object

property kind: Hashable#

Return a hashable kind for caching purposes.

Note: state_bytes is intentionally excluded - iterators with the same type structure but different runtime state should share cached reducers.

class cuda.compute.iterators.PermutationIterator(values, indices)#

Iterator that accesses values through an index mapping.

At position i, yields values[indices[i]].

Similar to thrust::permutation_iterator.

Example

The code snippet below demonstrates accessing values through an index mapping.

"""
Demonstrate reduction with permutation iterator as input.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    OpKind,
    PermutationIterator,
)

# Create a permutation iterator which selects values at the given indices:
d_values = cp.asarray([10, 20, 30, 40, 50], dtype=np.int32)
d_indices = cp.asarray([2, 0, 4, 1], dtype=np.int32)  # permutation indices
perm_it = PermutationIterator(d_values, d_indices)

# Prepare the initial value and output for the reduction.
h_init = np.array([0], dtype=np.int32)
d_output = cp.empty(1, dtype=np.int32)

# Perform the reduction on the permuted values.
num_items = len(d_indices)
cuda.compute.reduce_into(perm_it, d_output, OpKind.PLUS, num_items, h_init)

# Verify the result:
expected_output = d_values[d_indices].sum()
assert d_output[0] == expected_output
print(f"Permutation iterator result: {d_output[0]} (expected: {expected_output})")
__init__(values, indices)#

Create a permutation iterator.

Parameters:
  • values – Iterator or array providing the values to be permuted

  • indices – Iterator or array providing the indices for permutation

property children#

Return child iterators for automatic dependency tracking. Override in subclasses.

__add__(
offset: int,
) PermutationIterator#

Advance the indices iterator by offset, keeping values at base.

property kind#

Return a hashable kind for caching purposes.

class cuda.compute.iterators.ReverseIterator(underlying)#

Iterator that reverses the direction of an underlying iterator.

Advance with positive offset moves backward in the underlying iterator.

__init__(underlying)#

Create a reverse iterator.

Parameters:

underlying – The underlying iterator or array to reverse

property children#

Return child iterators for automatic dependency tracking. Override in subclasses.

property kind#

Return a hashable kind for caching purposes.

class cuda.compute.iterators.TransformIterator(
underlying,
transform_op,
value_type: TypeDescriptor | None = None,
is_input: bool = True,
)#

An iterator that applies a unary function to elements as they are read from an underlying iterator.

Similar to thrust::transform_iterator.

For input iteration (default): reads from underlying, applies transform, returns result. For output iteration: applies transform to input values, writes to underlying.

Example

The code snippet below demonstrates the usage of a TransformIterator composed with a CountingIterator to transform the input before performing a reduction:

"""
Using ``reduce_into`` with a ``TransformIterator`` to compute the
sum of squares of a sequence of numbers.
"""

import cupy as cp
import numpy as np

from cuda.compute import (
    OpKind,
    TransformIterator,
    reduce_into,
)

# Prepare the input and output arrays.
d_input = cp.arange(10, dtype=np.int32)
d_output = cp.empty(1, dtype=np.int32)
h_init = np.array([0], dtype=np.int32)  # Initial value for the reduction

# Create a TransformIterator to (lazily) apply the square
it_input = TransformIterator(d_input, lambda a: a**2)

# Use `reduce_into` to compute the sum of the squares of the input.
reduce_into(it_input, d_output, OpKind.PLUS, len(d_input), h_init)

# Verify the result.
expected_output = cp.sum(d_input**2).get()
assert d_output[0] == expected_output
print(f"Transform iterator result: {d_output[0]} (expected: {expected_output})")
Parameters:
  • underlying – The underlying iterator or device array

  • transform_op – The unary operation to apply

  • output_value_type – TypeDescriptor for the output type (optional, will be inferred if not provided)

  • is_input – True for input iterator (default), False for output iterator

__init__(
underlying,
transform_op,
value_type: TypeDescriptor | None = None,
is_input: bool = True,
)#

Create a transform iterator.

Parameters:
  • underlying – The underlying iterator or device array to transform

  • transform_op – The unary transform operation (callable or OpKind)

  • value_type – TypeDescriptor for the transformed value type. For input iterators: inferred if None. For output iterators: must be provided or have annotations.

  • is_input – True for input iterator, False for output iterator

advance(
offset: int,
) TransformIterator#

Return a new iterator advanced by offset elements.

property children#

Return child iterators for automatic dependency tracking. Override in subclasses.

property kind#

Return a hashable kind for caching purposes.

class cuda.compute.iterators.TransformOutputIterator(
underlying,
transform_op,
output_value_type=None,
)#

An iterator that applies a unary function to values before writing them to an underlying iterator.

Similar to thrust::transform_output_iterator.

This is a convenience subclass of TransformIterator configured for output mode.

Example

The code snippet below demonstrates the usage of a TransformOutputIterator to transform the output of a reduction before writing to an output array:

"""
TransformOutputIterator example demonstrating reduction with transform output iterator.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    OpKind,
    TransformOutputIterator,
)

# Create input and output arrays
d_input = cp.array([1, 2, 3, 4, 5.0], dtype=np.float32)
d_output = cp.empty(shape=1, dtype=np.float32)


# Define the transform operation to be applied
# to the result of the sum reduction.
# TransformOutputIterator requires type annotations:
def sqrt(x: np.float32) -> np.float32:
    return x**0.5


# Create transform output iterator
d_out_it = TransformOutputIterator(d_output, sqrt)


# Apply a sum reduction into the transform output iterator
cuda.compute.reduce_into(
    d_input,
    d_out_it,
    OpKind.PLUS,
    len(d_input),
    np.asarray([0], dtype=np.float32),
)

assert cp.allclose(d_output, cp.sqrt(cp.sum(d_input)), atol=1e-6)
Parameters:
  • underlying – The underlying iterator or device array

  • transform_op – The operation to be applied to values before they are written

  • output_value_type – TypeDescriptor for the input value type (optional, will be extracted from annotations if not provided)

__init__(
underlying,
transform_op,
output_value_type=None,
)#

Create a transform iterator.

Parameters:
  • underlying – The underlying iterator or device array to transform

  • transform_op – The unary transform operation (callable or OpKind)

  • value_type – TypeDescriptor for the transformed value type. For input iterators: inferred if None. For output iterators: must be provided or have annotations.

  • is_input – True for input iterator, False for output iterator

class cuda.compute.iterators.ZipIterator(*args)#

Iterator that zips multiple iterators together.

At each position, yields a tuple of values from all underlying iterators.

Similar to thrust::zip_iterator.

Example

The code snippet below demonstrates how to zip together an array and a CountingIterator to find the index of the maximum value of the array.

"""
Example showing how to use zip_iterator with counting iterator to
find the index with maximum value in an array.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    CountingIterator,
    ZipIterator,
)


def max_by_value(p1, p2):
    """Reduction operation that returns the pair with the larger value."""
    return p1 if p1[1] > p2[1] else p2


# Create the counting iterator.
counting_it = CountingIterator(np.int32(0))

# Prepare the input array.
arr = cp.asarray([0, 1, 2, 4, 7, 3, 5, 6], dtype=np.int32)

# Create the zip iterator.
zip_it = ZipIterator(counting_it, arr)

num_items = 8

# Note: initial value passed as a numpy struct
dtype = np.dtype([("index", np.int32), ("value", np.int32)], align=True)
h_init = np.asarray([(-1, -1)], dtype=dtype)

d_output = cp.empty(1, dtype=dtype)

# Perform the reduction.
cuda.compute.reduce_into(zip_it, d_output, max_by_value, num_items, h_init)

result = d_output.get()[0]
expected_index = 4
expected_value = 7

assert result["index"] == expected_index
assert result["value"] == expected_value

print(
    f"Zip iterator with counting result: index={result['index']} "
    f"(expected: {expected_index}), value={result['value']} (expected: {expected_value})"
)
__init__(*args)#

Create a zip iterator.

Parameters:

*args – Iterators or arrays to zip together. Can be: - Multiple iterators/arrays: ZipIterator(it1, it2, it3) - A single sequence of iterators: ZipIterator([it1, it2, it3])

property children#

Return child iterators for automatic dependency tracking. Override in subclasses.

__add__(
offset: int,
) ZipIterator#

Advance all child iterators by offset.

property kind#

Return a hashable kind for caching purposes.

Operators#

class cuda.compute.op.OpKind#

Enumeration of operator kinds for CUDA parallel algorithms.

This enum defines the types of operations that can be performed in parallel algorithms, including arithmetic, logical, and bitwise operations.

PLUS#
MINUS#
MULTIPLIES#
DIVIDES#
MODULUS#
EQUAL_TO#
NOT_EQUAL_TO#
GREATER#
LESS#
GREATER_EQUAL#
LESS_EQUAL#
LOGICAL_AND#
LOGICAL_OR#
LOGICAL_NOT#
BIT_AND#
BIT_OR#
BIT_XOR#
BIT_NOT#
IDENTITY#
NEGATE#
MINIMUM#
MAXIMUM#
class cuda.compute.op.RawOp(
*,
ltoir: bytes,
name: str,
state: bytes = b'',
state_alignment: int = 1,
extra_ltoirs: list[bytes] | None = None,
)#

RawOp can be used to directly pass compiled device code (LTO-IR) implementing custom operators.

This is useful for users who wish to implement custom operators in C++ or another language, or wish to use a different compilation pipeline than the default (JIT compilation from Python callables using Numba CUDA).

Example

The example below shows how to compile C++ device code to LTOIR and use it with reduce_into:

"""
Create a custom C++ operator from LTOIR bytecode using RawOp.

This example demonstrates how to compile C++ device code to LTOIR and use it
as a custom operator.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute.op import RawOp
from cuda.core import Device, Program, ProgramOptions


def get_arch():
    """Get the SM architecture string for the current device."""
    device = Device()
    device.set_current()
    cc_major, cc_minor = device.compute_capability
    return f"sm_{cc_major}{cc_minor}"


def compile_cpp_to_ltoir(source: str, arch: str) -> bytes:
    """Compile C++ source to LTOIR using cuda.core."""
    opts = ProgramOptions(
        arch=arch,
        relocatable_device_code=True,
        link_time_optimization=True,
    )
    prog = Program(source, "c++", options=opts)
    return prog.compile("ltoir").code


# Define a C++ custom multiply operator
cpp_source = """
extern "C" __device__ void multiply_op(void* a, void* b, void* result) {
    *static_cast<int*>(result) = *static_cast<int*>(a) * *static_cast<int*>(b);
}
"""

# Compile C++ to LTOIR
arch = get_arch()
ltoir_bytes = compile_cpp_to_ltoir(cpp_source, arch)

# Create a RawOp from the LTOIR bytecode
multiply_op = RawOp(ltoir=ltoir_bytes, name="multiply_op")

# Prepare test data
h_input = np.array([1, 2, 3, 4, 5], dtype=np.int32)
d_input = cp.array(h_input)
d_output = cp.empty(1, dtype=np.int32)
h_init = np.array(1, dtype=np.int32)

# Use the custom operator with reduce_into
cuda.compute.reduce_into(d_input, d_output, multiply_op, len(d_input), h_init)

# Verify the result
result = d_output.get()[0]
expected = np.prod(h_input)  # 1 * 2 * 3 * 4 * 5 = 120
assert result == expected, f"Expected {expected}, got {result}"

print(f"Custom multiply reduction result: {result}")
print("RawOp stateless example completed successfully!")
Parameters:
  • name – The ABI name of the operator

  • ltoir – bytes object containing the LTO-IR of the compiled operator

  • state – Optional bytes representing the operator’s state

  • state_alignment – Alignment requirement for the state bytes (default: 1)

  • extra_ltoirs – Optional list of additional LTO-IRs to include during linking

Notes

  • The provided LTO-IR must define a function with the specified name and the correct signature.

  • The function must use untyped pointers for all parameters and return type. The function body is responsible for correctly interpreting the pointer arguments based on the expected input and output types. For stateless operators, the signature is

    void func(void* arg1, void* arg2, …, void* result)`

    For stateful operators, the first parameter must be a pointer to the state:

    void func(void* state, void* arg1, void* arg2, …)

__init__(
*,
ltoir: bytes,
name: str,
state: bytes = b'',
state_alignment: int = 1,
extra_ltoirs: list[bytes] | None = None,
)#
compile(
input_types,
output_type=None,
) cuda.compute._bindings.Op#

Compile this operator to an Op for CCCL interop.

Parameters:
  • input_types – Tuple of TypeDescriptors for input arguments

  • output_type – Optional TypeDescriptor for return value (inferred if None)

Returns:

Compiled Op object for C++ interop

get_state() bytes#

Return the op’s state bytes.

Utilities#

cuda.compute.struct.gpu_struct(
field_dict: dict | dtype | type,
name: str = 'AnonymousStruct',
)#

A factory for creating struct types.

Parameters:
  • field_dict – A dictionary, numpy dtype, or annotated class providing the mapping of field names to data types.

  • name – The name of the struct type that will be returned.

Returns:

A struct class helpful for writing operations on struct values.

Typing#

class cuda.compute.typing.DeviceArrayLike(*args, **kwargs)#

Protocol for array-like objects that expose device memory via CUDA Array Interface.

Any object implementing the __cuda_array_interface__ attribute can be used where a DeviceArrayLike is expected. This includes CuPy arrays, Numba device arrays, PyTorch CUDA tensors, and other GPU array types.

See CUDA Array Interface specification for details.

__init__(*args, **kwargs)#
class cuda.compute.typing.GpuStruct#

Instance of types created with cuda.compute.struct.gpu_struct.

alias of TypeVar(‘GpuStruct’, bound=_Struct)

class cuda.compute.typing.IteratorT#

Type variable for iterator objects.

Represents any subclass of IteratorBase. See cuda.compute.iterators for all available iterators.

alias of TypeVar(‘IteratorT’, bound=IteratorBase)

cuda.compute.typing.Operator#

Type alias for operator objects passed to algorithm functions.

Algorithms accept the following objects as operators:

  • Python functions or lambdas implementing the operator. This function will be JIT compiled into device code using numba.cuda.

  • OpKind enumerators which are pre-defined constants for common operations.

  • RawOp <cuda.compute.op.RawOp objects containing pre-compiled device code.

alias of Callable | OpKind | RawOp | _OpAdapter