CUDA Parallel

Warning

Python exposure of parallel algorithms is in public beta. The API is subject to change without notice.

cuda.parallel.experimental.reduce_into(d_in, d_out, op, init)

Computes a device-wide reduction using the specified binary op functor and initial value init.

Example

The code snippet below illustrates a user-defined min-reduction of a device vector of int data elements.

import cupy as cp
import numpy as np
import cuda.parallel.experimental as cudax

Below is the code snippet that demonstrates the usage of the reduce_into API:

def min_op(a, b):
    return a if a < b else b

dtype = np.int32
h_init = np.array([42], dtype=dtype)
d_input = cp.array([8, 6, 7, 5, 3, 0, 9], dtype=dtype)
d_output = cp.empty(1, dtype=dtype)

# Instantiate reduction for the given operator and initial value
reduce_into = cudax.reduce_into(d_output, d_output, min_op, h_init)

# Determine temporary device storage requirements
temp_storage_size = reduce_into(None, d_input, d_output, h_init)

# Allocate temporary storage
d_temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)

# Run reduction
reduce_into(d_temp_storage, d_input, d_output, h_init)

# Check the result is correct
expected_output = 0
assert (d_output == expected_output).all()
Parameters
  • d_in – CUDA device array storing the input sequence of data items

  • d_out – CUDA device array storing the output aggregate

  • op – Binary reduction

  • init – Numpy array storing initial value of the reduction

Returns

A callable object that can be used to perform the reduction