CUDA Parallel
Warning
Python exposure of parallel algorithms is in public beta. The API is subject to change without notice.
- cuda.parallel.experimental.reduce_into(d_in, d_out, op, init)
Computes a device-wide reduction using the specified binary
op
functor and initial valueinit
.Example
The code snippet below illustrates a user-defined min-reduction of a device vector of
int
data elements.import cupy as cp import numpy as np import cuda.parallel.experimental as cudax
Below is the code snippet that demonstrates the usage of the
reduce_into
API:def min_op(a, b): return a if a < b else b dtype = np.int32 h_init = np.array([42], dtype=dtype) d_input = cp.array([8, 6, 7, 5, 3, 0, 9], dtype=dtype) d_output = cp.empty(1, dtype=dtype) # Instantiate reduction for the given operator and initial value reduce_into = cudax.reduce_into(d_output, d_output, min_op, h_init) # Determine temporary device storage requirements temp_storage_size = reduce_into(None, d_input, d_output, h_init) # Allocate temporary storage d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8) # Run reduction reduce_into(d_temp_storage, d_input, d_output, h_init) # Check the result is correct expected_output = 0 assert (d_output == expected_output).all()
- Parameters
d_in – CUDA device array storing the input sequence of data items
d_out – CUDA device array storing the output aggregate
op – Binary reduction
init – Numpy array storing initial value of the reduction
- Returns
A callable object that can be used to perform the reduction