cub::DeviceReduce
Defined in cub/device/device_reduce.cuh
-
struct DeviceReduce
DeviceReduce provides device-wide, parallel operations for computing a reduction across a sequence of data items residing within device-accessible memory.
Overview
A reduction (or fold) uses a binary combining operator to compute a single aggregate from a sequence of input elements.
Usage Considerations
Dynamic parallelism. DeviceReduce methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.
Performance
The work-complexity of reduction, reduce-by-key, and run-length encode as a function of input size is linear, resulting in performance throughput that plateaus with problem sizes large enough to saturate the GPU.
Public Static Functions
-
template<typename InputIteratorT, typename OutputIteratorT, typename ReductionOpT, typename T, typename NumItemsT>
static inline cudaError_t Reduce(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, NumItemsT num_items, ReductionOpT reduction_op, T init, cudaStream_t stream = 0) Computes a device-wide reduction using the specified binary
reduction_op
functor and initial valueinit
.Does not support binary reduction operators that are non-commutative.
Provides “run-to-run” determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
The range
[d_in, d_in + num_items)
shall not overlapd_out
.When
d_temp_storage
isnullptr
, no work is done and the required allocation size is returned intemp_storage_bytes
.
Snippet
The code snippet below illustrates a user-defined min-reduction of a device vector of
int
data elements.#include <cub/cub.cuh> // or equivalently <cub/device/device_radix_sort.cuh> // CustomMin functor struct CustomMin { template <typename T> __host__ __forceinline__ T operator()(const T &a, const T &b) const { return (b < a) ? b : a; } }; // Declare, allocate, and initialize device-accessible pointers for // input and output int num_items; // e.g., 7 int *d_in; // e.g., [8, 6, 7, 5, 3, 0, 9] int *d_out; // e.g., [-] CustomMin min_op; int init; // e.g., INT_MAX ... // Determine temporary device storage requirements void *d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceReduce::Reduce( d_temp_storage, temp_storage_bytes, d_in, d_out, num_items, min_op, init); // Allocate temporary storage cudaMalloc(&d_temp_storage, temp_storage_bytes); // Run reduction cub::DeviceReduce::Reduce( d_temp_storage, temp_storage_bytes, d_in, d_out, num_items, min_op, init); // d_out <-- [0]
- Template Parameters
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
ReductionOpT – [inferred] Binary reduction functor type having member
T operator()(const T &a, const T &b)
T – [inferred] Data element type that is convertible to the
value
type ofInputIteratorT
NumItemsT – [inferred] Type of num_items
- Parameters
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storage
allocationd_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_items – [in] Total number of input items (i.e., length of
d_in
)reduction_op – [in] Binary reduction functor
init – [in] Initial value of the reduction
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename NumItemsT>
static inline cudaError_t Sum(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, NumItemsT num_items, cudaStream_t stream = 0) Computes a device-wide sum using the addition (
+
) operator.Uses
0
as the initial value of the reduction.Does not support
+
operators that are non-commutative..Provides “run-to-run” determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
The range
[d_in, d_in + num_items)
shall not overlapd_out
.When
d_temp_storage
isnullptr
, no work is done and the required allocation size is returned intemp_storage_bytes
.
Snippet
The code snippet below illustrates the sum-reduction of a device vector of
int
data elements.#include <cub/cub.cuh> // or equivalently <cub/device/device_radix_sort.cuh> // Declare, allocate, and initialize device-accessible pointers // for input and output int num_items; // e.g., 7 int *d_in; // e.g., [8, 6, 7, 5, 3, 0, 9] int *d_out; // e.g., [-] ... // Determine temporary device storage requirements void *d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceReduce::Sum( d_temp_storage, temp_storage_bytes, d_in, d_out, num_items); // Allocate temporary storage cudaMalloc(&d_temp_storage, temp_storage_bytes); // Run sum-reduction cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items); // d_out <-- [38]
- Template Parameters
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
NumItemsT – [inferred] Type of num_items
- Parameters
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storage
allocationd_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_items – [in] Total number of input items (i.e., length of
d_in
)stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename NumItemsT>
static inline cudaError_t Min(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, NumItemsT num_items, cudaStream_t stream = 0) Computes a device-wide minimum using the less-than (
<
) operator.Uses
std::numeric_limits<T>::max()
as the initial value of the reduction.Does not support
<
operators that are non-commutative.Provides “run-to-run” determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
The range
[d_in, d_in + num_items)
shall not overlapd_out
.When
d_temp_storage
isnullptr
, no work is done and the required allocation size is returned intemp_storage_bytes
.
Snippet
The code snippet below illustrates the min-reduction of a device vector of
int
data elements.#include <cub/cub.cuh> // or equivalently <cub/device/device_radix_sort.cuh> // Declare, allocate, and initialize device-accessible pointers // for input and output int num_items; // e.g., 7 int *d_in; // e.g., [8, 6, 7, 5, 3, 0, 9] int *d_out; // e.g., [-] ... // Determine temporary device storage requirements void *d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceReduce::Min( d_temp_storage, temp_storage_bytes, d_in, d_out, num_items); // Allocate temporary storage cudaMalloc(&d_temp_storage, temp_storage_bytes); // Run min-reduction cub::DeviceReduce::Min( d_temp_storage, temp_storage_bytes, d_in, d_out, num_items); // d_out <-- [0]
- Template Parameters
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
NumItemsT – [inferred] Type of num_items
- Parameters
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storage
allocationd_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_items – [in] Total number of input items (i.e., length of
d_in
)stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT>
static inline cudaError_t ArgMin(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream = 0) Finds the first device-wide minimum using the less-than (
<
) operator, also returning the index of that item.The output value type of
d_out
iscub::KeyValuePair<int, T>
(assuming the value type ofd_in
isT
)The minimum is written to
d_out.value
and its offset in the input array is written tod_out.key
.The
{1, std::numeric_limits<T>::max()}
tuple is produced for zero-length inputs
Does not support
<
operators that are non-commutative.Provides “run-to-run” determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
The range
[d_in, d_in + num_items)
shall not overlap d_out.When
d_temp_storage
isnullptr
, no work is done and the required allocation size is returned intemp_storage_bytes
.
Snippet
The code snippet below illustrates the argmin-reduction of a device vector of
int
data elements.#include <cub/cub.cuh> // or equivalently <cub/device/device_radix_sort.cuh> // Declare, allocate, and initialize device-accessible pointers // for input and output int num_items; // e.g., 7 int *d_in; // e.g., [8, 6, 7, 5, 3, 0, 9] KeyValuePair<int, int> *d_argmin; // e.g., [{-,-}] ... // Determine temporary device storage requirements void *d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceReduce::ArgMin(d_temp_storage, temp_storage_bytes, d_in, d_argmin, num_items); // Allocate temporary storage cudaMalloc(&d_temp_storage, temp_storage_bytes); // Run argmin-reduction cub::DeviceReduce::ArgMin(d_temp_storage, temp_storage_bytes, d_in, d_argmin, num_items); // d_argmin <-- [{5, 0}]
- Template Parameters
InputIteratorT – [inferred] Random-access input iterator type for reading input items (of some type
T
) (may be a simple pointer type)OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (having value type
cub::KeyValuePair<int, T>
) (may be a simple pointer type)
- Parameters
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storage
allocationd_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_items – [in] Total number of input items (i.e., length of
d_in
)stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename NumItemsT>
static inline cudaError_t Max(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, NumItemsT num_items, cudaStream_t stream = 0) Computes a device-wide maximum using the greater-than (
>
) operator.Uses
std::numeric_limits<T>::lowest()
as the initial value of the reduction.Does not support
>
operators that are non-commutative.Provides “run-to-run” determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
The range
[d_in, d_in + num_items)
shall not overlapd_out
.When
d_temp_storage
isnullptr
, no work is done and the required allocation size is returned intemp_storage_bytes
.
Snippet
The code snippet below illustrates the max-reduction of a device vector of
int
data elements.#include <cub/cub.cuh> // or equivalently <cub/device/device_radix_sort.cuh> // Declare, allocate, and initialize device-accessible pointers // for input and output int num_items; // e.g., 7 int *d_in; // e.g., [8, 6, 7, 5, 3, 0, 9] int *d_max; // e.g., [-] ... // Determine temporary device storage requirements void *d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceReduce::Max(d_temp_storage, temp_storage_bytes, d_in, d_max, num_items); // Allocate temporary storage cudaMalloc(&d_temp_storage, temp_storage_bytes); // Run max-reduction cub::DeviceReduce::Max(d_temp_storage, temp_storage_bytes, d_in, d_max, num_items); // d_max <-- [9]
- Template Parameters
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
NumItemsT – [inferred] Type of num_items
- Parameters
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storage
allocationd_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_items – [in] Total number of input items (i.e., length of
d_in
)stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT>
static inline cudaError_t ArgMax(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream = 0) Finds the first device-wide maximum using the greater-than (
>
) operator, also returning the index of that itemThe output value type of
d_out
iscub::KeyValuePair<int, T>
(assuming the value type ofd_in
isT
)The maximum is written to
d_out.value
and its offset in the input array is written tod_out.key
.The
{1, std::numeric_limits<T>::lowest()}
tuple is produced for zero-length inputs
Does not support
>
operators that are non-commutative.Provides “run-to-run” determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
The range
[d_in, d_in + num_items)
shall not overlapd_out
.When
d_temp_storage
isnullptr
, no work is done and the required allocation size is returned intemp_storage_bytes
.
Snippet
The code snippet below illustrates the argmax-reduction of a device vector of int data elements.
#include <cub/cub.cuh> // or equivalently <cub/device/device_reduce.cuh> // Declare, allocate, and initialize device-accessible pointers // for input and output int num_items; // e.g., 7 int *d_in; // e.g., [8, 6, 7, 5, 3, 0, 9] KeyValuePair<int, int> *d_argmax; // e.g., [{-,-}] ... // Determine temporary device storage requirements void *d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceReduce::ArgMax( d_temp_storage, temp_storage_bytes, d_in, d_argmax, num_items); // Allocate temporary storage cudaMalloc(&d_temp_storage, temp_storage_bytes); // Run argmax-reduction cub::DeviceReduce::ArgMax( d_temp_storage, temp_storage_bytes, d_in, d_argmax, num_items); // d_argmax <-- [{6, 9}]
- Template Parameters
InputIteratorT – [inferred] Random-access input iterator type for reading input items (of some type
T
) (may be a simple pointer type)OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (having value type
cub::KeyValuePair<int, T>
) (may be a simple pointer type)
- Parameters
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storage
allocationd_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_items – [in] Total number of input items (i.e., length of
d_in
)stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename ReductionOpT, typename TransformOpT, typename T, typename NumItemsT>
static inline cudaError_t TransformReduce(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, NumItemsT num_items, ReductionOpT reduction_op, TransformOpT transform_op, T init, cudaStream_t stream = 0) Fuses transform and reduce operations
Does not support binary reduction operators that are non-commutative.
Provides “run-to-run” determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
The range
[d_in, d_in + num_items)
shall not overlapd_out
.When
d_temp_storage
isnullptr
, no work is done and the required allocation size is returned intemp_storage_bytes
.
Snippet
The code snippet below illustrates a user-defined min-reduction of a device vector of int data elements.
#include <cub/cub.cuh> // or equivalently <cub/device/device_reduce.cuh> thrust::device_vector<int> in = { 1, 2, 3, 4 }; thrust::device_vector<int> out(1); std::size_t temp_storage_bytes = 0; std::uint8_t *d_temp_storage = nullptr; const int init = 42; cub::DeviceReduce::TransformReduce( d_temp_storage, temp_storage_bytes, in.begin(), out.begin(), in.size(), cuda::std::plus<>{}, square_t{}, init); thrust::device_vector<std::uint8_t> temp_storage(temp_storage_bytes); d_temp_storage = temp_storage.data().get(); cub::DeviceReduce::TransformReduce( d_temp_storage, temp_storage_bytes, in.begin(), out.begin(), in.size(), cuda::std::plus<>{}, square_t{}, init); // out[0] <-- 72
- Template Parameters
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
ReductionOpT – [inferred] Binary reduction functor type having member
T operator()(const T &a, const T &b)
TransformOpT – [inferred] Unary reduction functor type having member
auto operator()(const T &a)
T – [inferred] Data element type that is convertible to the
value
type ofInputIteratorT
NumItemsT – [inferred] Type of num_items
- Parameters
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storage
allocationd_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_items – [in] Total number of input items (i.e., length of
d_in
)reduction_op – [in] Binary reduction functor
transform_op – [in] Unary transform functor
init – [in] Initial value of the reduction
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename KeysInputIteratorT, typename UniqueOutputIteratorT, typename ValuesInputIteratorT, typename AggregatesOutputIteratorT, typename NumRunsOutputIteratorT, typename ReductionOpT, typename NumItemsT>
static inline cudaError_t ReduceByKey(void *d_temp_storage, size_t &temp_storage_bytes, KeysInputIteratorT d_keys_in, UniqueOutputIteratorT d_unique_out, ValuesInputIteratorT d_values_in, AggregatesOutputIteratorT d_aggregates_out, NumRunsOutputIteratorT d_num_runs_out, ReductionOpT reduction_op, NumItemsT num_items, cudaStream_t stream = 0) Reduces segments of values, where segments are demarcated by corresponding runs of identical keys.
This operation computes segmented reductions within
d_values_in
using the specified binaryreduction_op
functor. The segments are identified by “runs” of corresponding keys in d_keys_in, where runs are maximal ranges of consecutive, identical keys. For the ith run encountered, the first key of the run and the corresponding value aggregate of that run are written tod_unique_out[i]
andd_aggregates_out[i]
, respectively. The total number of runs encountered is written tod_num_runs_out
.The
==
equality operator is used to determine whether keys are equivalentProvides “run-to-run” determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
Let
out
be any of[d_unique_out, d_unique_out + *d_num_runs_out)
[d_aggregates_out, d_aggregates_out + *d_num_runs_out)
d_num_runs_out
. The ranges represented byout
shall not overlap[d_keys_in, d_keys_in + num_items)
,[d_values_in, d_values_in + num_items)
norout
in any way.When
d_temp_storage
isnullptr
, no work is done and the required allocation size is returned intemp_storage_bytes
.
Snippet
The code snippet below illustrates the segmented reduction of
int
values grouped by runs of associatedint
keys.#include <cub/cub.cuh> // or equivalently <cub/device/device_reduce.cuh> // CustomMin functor struct CustomMin { template <typename T> __host__ __device__ __forceinline__ T operator()(const T &a, const T &b) const { return (b < a) ? b : a; } }; // Declare, allocate, and initialize device-accessible pointers // for input and output int num_items; // e.g., 8 int *d_keys_in; // e.g., [0, 2, 2, 9, 5, 5, 5, 8] int *d_values_in; // e.g., [0, 7, 1, 6, 2, 5, 3, 4] int *d_unique_out; // e.g., [-, -, -, -, -, -, -, -] int *d_aggregates_out; // e.g., [-, -, -, -, -, -, -, -] int *d_num_runs_out; // e.g., [-] CustomMin reduction_op; ... // Determine temporary device storage requirements void *d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceReduce::ReduceByKey( d_temp_storage, temp_storage_bytes, d_keys_in, d_unique_out, d_values_in, d_aggregates_out, d_num_runs_out, reduction_op, num_items); // Allocate temporary storage cudaMalloc(&d_temp_storage, temp_storage_bytes); // Run reduce-by-key cub::DeviceReduce::ReduceByKey( d_temp_storage, temp_storage_bytes, d_keys_in, d_unique_out, d_values_in, d_aggregates_out, d_num_runs_out, reduction_op, num_items); // d_unique_out <-- [0, 2, 9, 5, 8] // d_aggregates_out <-- [0, 1, 6, 2, 4] // d_num_runs_out <-- [5]
- Template Parameters
KeysInputIteratorT – [inferred] Random-access input iterator type for reading input keys (may be a simple pointer type)
UniqueOutputIteratorT – [inferred] Random-access output iterator type for writing unique output keys (may be a simple pointer type)
ValuesInputIteratorT – [inferred] Random-access input iterator type for reading input values (may be a simple pointer type)
AggregatesOutputIterator – [inferred] Random-access output iterator type for writing output value aggregates (may be a simple pointer type)
NumRunsOutputIteratorT – [inferred] Output iterator type for recording the number of runs encountered (may be a simple pointer type)
ReductionOpT – [inferred] Binary reduction functor type having member
T operator()(const T &a, const T &b)
NumItemsT – [inferred] Type of num_items
- Parameters
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storage
allocationd_keys_in – [in] Pointer to the input sequence of keys
d_unique_out – [out] Pointer to the output sequence of unique keys (one key per run)
d_values_in – [in] Pointer to the input sequence of corresponding values
d_aggregates_out – [out] Pointer to the output sequence of value aggregates (one aggregate per run)
d_num_runs_out – [out] Pointer to total number of runs encountered (i.e., the length of
d_unique_out
)reduction_op – [in] Binary reduction functor
num_items – [in] Total number of associated key+value pairs (i.e., the length of
d_in_keys
andd_in_values
)stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.