parallel
: Device-Level Parallel Algorithms
The cuda.cccl.parallel
library provides device-level algorithms that operate
on entire arrays or ranges of data. These algorithms are designed to be easy to use from Python
while delivering the performance of hand-optimized CUDA kernels, portable across different
GPU architectures.
Algorithms
The core functionality provided by the parallel
library are algorithms such
as reductions, scans, sorts, and transforms.
Here’s a simple example showing how to use the reduce_into
algorithm to
reduce an array of integers.
def sum_reduction_example():
"""Sum all values in an array using reduction."""
def add_op(a, b):
return a + b
dtype = np.int32
h_init = np.array([0], dtype=dtype)
d_input = cp.array([1, 2, 3, 4, 5], dtype=dtype)
d_output = cp.empty(1, dtype=dtype)
# Run reduction
parallel.reduce_into(d_input, d_output, add_op, len(d_input), h_init)
expected_output = 15 # 1+2+3+4+5
assert (d_output == expected_output).all()
print(f"Sum: {d_output[0]}")
return d_output[0]
Iterators
Algorithms can be used not just on arrays, but also on iterators. Iterators provide a way to represent sequences of data without needing to allocate memory for them.
Here’s an example showing how to use reduction with a CountingIterator
that
generates a sequence of numbers starting from a specified value.
def counting_iterator_example():
"""Demonstrate reduction with counting iterator."""
def add_op(a, b):
return a + b
first_item = 10
num_items = 3
first_it = parallel.CountingIterator(np.int32(first_item)) # Input sequence
h_init = np.array([0], dtype=np.int32) # Initial value for the reduction
d_output = cp.empty(1, dtype=np.int32) # Storage for output
# Run reduction
parallel.reduce_into(first_it, d_output, add_op, num_items, h_init)
expected_output = functools.reduce(
lambda a, b: a + b, range(first_item, first_item + num_items)
)
assert (d_output == expected_output).all()
print(f"Counting iterator result: {d_output[0]} (expected: {expected_output})")
return d_output[0]
Iterators also provide a way to compose operations. Here’s an example showing
how to use reduce_into
with a TransformIterator
to compute the sum of squares
of a sequence of numbers.
def transform_iterator_example():
"""Demonstrate reduction with transform iterator."""
def add_op(a, b):
return a + b
def transform_op(a):
return -a if a % 2 == 0 else a
first_item = 10
num_items = 100
transform_it = parallel.TransformIterator(
parallel.CountingIterator(np.int32(first_item)), transform_op
) # Input sequence
h_init = np.array([0], dtype=np.int64) # Initial value for the reduction
d_output = cp.empty(1, dtype=np.int64) # Storage for output
# Run reduction
parallel.reduce_into(transform_it, d_output, add_op, num_items, h_init)
expected_output = functools.reduce(
lambda a, b: a + b,
[-a if a % 2 == 0 else a for a in range(first_item, first_item + num_items)],
)
# Test assertions
print(f"Transform iterator result: {d_output[0]} (expected: {expected_output})")
assert (d_output == expected_output).all()
assert d_output[0] == expected_output
return d_output[0]
Custom Types
The parallel
library supports defining custom data types,
using the gpu_struct
decorator.
Here are some examples showing how to define and use custom types:
def pixel_reduction_example():
"""Demonstrate reduction with custom Pixel struct to find maximum green value."""
@parallel.gpu_struct
class Pixel:
r: np.int32
g: np.int32
b: np.int32
def max_g_value(x, y):
return x if x.g > y.g else y
# Create random RGB data
d_rgb = cp.random.randint(0, 256, (10, 3), dtype=np.int32).view(Pixel.dtype)
d_out = cp.empty(1, Pixel.dtype)
h_init = Pixel(0, 0, 0)
# Run reduction
parallel.reduce_into(d_rgb, d_out, max_g_value, d_rgb.size, h_init)
# Verify result
h_rgb = d_rgb.get()
expected = h_rgb[h_rgb.view("int32")[:, 1].argmax()]
assert expected["g"] == d_out.get()["g"]
print(f"Maximum green value: {d_out.get()['g']}")
return d_out.get()
Example Collections
For complete runnable examples and more advanced usage patterns, see our full collection of examples.