cub::DeviceReduce

Defined in cub/device/device_reduce.cuh

struct DeviceReduce

DeviceReduce provides device-wide, parallel operations for computing a reduction across a sequence of data items residing within device-accessible memory.

../_images/reduce_logo.png

Overview

A reduction (or fold) uses a binary combining operator to compute a single aggregate from a sequence of input elements.

Usage Considerations

  • Dynamic parallelism. DeviceReduce methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.

Performance

The work-complexity of reduction, reduce-by-key, and run-length encode as a function of input size is linear, resulting in performance throughput that plateaus with problem sizes large enough to saturate the GPU.

Public Static Functions

template<typename InputIteratorT, typename OutputIteratorT, typename ReductionOpT, typename T, typename NumItemsT>
static inline cudaError_t Reduce(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, NumItemsT num_items, ReductionOpT reduction_op, T init, cudaStream_t stream = 0)

Computes a device-wide reduction using the specified binary reduction_op functor and initial value init.

  • Does not support binary reduction operators that are non-commutative.

  • Provides “run-to-run” determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.

  • The range [d_in, d_in + num_items) shall not overlap d_out.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates a user-defined min-reduction of a device vector of int data elements.

#include <cub/cub.cuh>
// or equivalently <cub/device/device_radix_sort.cuh>

// CustomMin functor
struct CustomMin
{
    template <typename T>
    __host__ __forceinline__
    T operator()(const T &a, const T &b) const {
        return (b < a) ? b : a;
    }
};

// Declare, allocate, and initialize device-accessible pointers for
// input and output
int          num_items;  // e.g., 7
int          *d_in;      // e.g., [8, 6, 7, 5, 3, 0, 9]
int          *d_out;     // e.g., [-]
CustomMin    min_op;
int          init;       // e.g., INT_MAX
...

// Determine temporary device storage requirements
void     *d_temp_storage = NULL;
size_t   temp_storage_bytes = 0;
cub::DeviceReduce::Reduce(
  d_temp_storage, temp_storage_bytes,
  d_in, d_out, num_items, min_op, init);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run reduction
cub::DeviceReduce::Reduce(
  d_temp_storage, temp_storage_bytes,
  d_in, d_out, num_items, min_op, init);

// d_out <-- [0]

Template Parameters
  • InputIteratorT[inferred] Random-access input iterator type for reading input items (may be a simple pointer type)

  • OutputIteratorT[inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)

  • ReductionOpT[inferred] Binary reduction functor type having member T operator()(const T &a, const T &b)

  • T[inferred] Data element type that is convertible to the value type of InputIteratorT

  • NumItemsT[inferred] Type of num_items

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Pointer to the input sequence of data items

  • d_out[out] Pointer to the output aggregate

  • num_items[in] Total number of input items (i.e., length of d_in)

  • reduction_op[in] Binary reduction functor

  • init[in] Initial value of the reduction

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename InputIteratorT, typename OutputIteratorT, typename ReductionOpT, typename T>
static inline cudaError_t Reduce(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, ReductionOpT reduction_op, T init, cudaStream_t stream, bool debug_synchronous)
template<typename InputIteratorT, typename OutputIteratorT, typename NumItemsT>
static inline cudaError_t Sum(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, NumItemsT num_items, cudaStream_t stream = 0)

Computes a device-wide sum using the addition (+) operator.

  • Uses 0 as the initial value of the reduction.

  • Does not support + operators that are non-commutative..

  • Provides “run-to-run” determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.

  • The range [d_in, d_in + num_items) shall not overlap d_out.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the sum-reduction of a device vector of int data elements.

#include <cub/cub.cuh> // or equivalently <cub/device/device_radix_sort.cuh>

// Declare, allocate, and initialize device-accessible pointers
// for input and output
int  num_items;      // e.g., 7
int  *d_in;          // e.g., [8, 6, 7, 5, 3, 0, 9]
int  *d_out;         // e.g., [-]
...

// Determine temporary device storage requirements
void     *d_temp_storage = NULL;
size_t   temp_storage_bytes = 0;
cub::DeviceReduce::Sum(
  d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run sum-reduction
cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);

// d_out <-- [38]

Template Parameters
  • InputIteratorT[inferred] Random-access input iterator type for reading input items (may be a simple pointer type)

  • OutputIteratorT[inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)

  • NumItemsT[inferred] Type of num_items

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Pointer to the input sequence of data items

  • d_out[out] Pointer to the output aggregate

  • num_items[in] Total number of input items (i.e., length of d_in)

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename InputIteratorT, typename OutputIteratorT>
static inline cudaError_t Sum(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream, bool debug_synchronous)
template<typename InputIteratorT, typename OutputIteratorT, typename NumItemsT>
static inline cudaError_t Min(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, NumItemsT num_items, cudaStream_t stream = 0)

Computes a device-wide minimum using the less-than (<) operator.

  • Uses std::numeric_limits<T>::max() as the initial value of the reduction.

  • Does not support < operators that are non-commutative.

  • Provides “run-to-run” determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.

  • The range [d_in, d_in + num_items) shall not overlap d_out.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the min-reduction of a device vector of int data elements.

#include <cub/cub.cuh>
// or equivalently <cub/device/device_radix_sort.cuh>

// Declare, allocate, and initialize device-accessible pointers
// for input and output
int  num_items;      // e.g., 7
int  *d_in;          // e.g., [8, 6, 7, 5, 3, 0, 9]
int  *d_out;         // e.g., [-]
...

// Determine temporary device storage requirements
void     *d_temp_storage = NULL;
size_t   temp_storage_bytes = 0;
cub::DeviceReduce::Min(
  d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run min-reduction
cub::DeviceReduce::Min(
  d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);

// d_out <-- [0]

Template Parameters
  • InputIteratorT[inferred] Random-access input iterator type for reading input items (may be a simple pointer type)

  • OutputIteratorT[inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)

  • NumItemsT[inferred] Type of num_items

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Pointer to the input sequence of data items

  • d_out[out] Pointer to the output aggregate

  • num_items[in] Total number of input items (i.e., length of d_in)

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename InputIteratorT, typename OutputIteratorT>
static inline cudaError_t Min(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream, bool debug_synchronous)
template<typename InputIteratorT, typename OutputIteratorT>
static inline cudaError_t ArgMin(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream = 0)

Finds the first device-wide minimum using the less-than (<) operator, also returning the index of that item.

  • The output value type of d_out is cub::KeyValuePair<int, T> (assuming the value type of d_in is T)

    • The minimum is written to d_out.value and its offset in the input array is written to d_out.key.

    • The {1, std::numeric_limits<T>::max()} tuple is produced for zero-length inputs

  • Does not support < operators that are non-commutative.

  • Provides “run-to-run” determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.

  • The range [d_in, d_in + num_items) shall not overlap d_out.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the argmin-reduction of a device vector of int data elements.

#include <cub/cub.cuh> // or equivalently <cub/device/device_radix_sort.cuh>

// Declare, allocate, and initialize device-accessible pointers
// for input and output
int                      num_items;      // e.g., 7
int                      *d_in;          // e.g., [8, 6, 7, 5, 3, 0, 9]
KeyValuePair<int, int>   *d_argmin;      // e.g., [{-,-}]
...

// Determine temporary device storage requirements
void     *d_temp_storage = NULL;
size_t   temp_storage_bytes = 0;
cub::DeviceReduce::ArgMin(d_temp_storage, temp_storage_bytes, d_in, d_argmin, num_items);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run argmin-reduction
cub::DeviceReduce::ArgMin(d_temp_storage, temp_storage_bytes, d_in, d_argmin, num_items);

// d_argmin <-- [{5, 0}]

Template Parameters
  • InputIteratorT[inferred] Random-access input iterator type for reading input items (of some type T) (may be a simple pointer type)

  • OutputIteratorT[inferred] Output iterator type for recording the reduced aggregate (having value type cub::KeyValuePair<int, T>) (may be a simple pointer type)

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Pointer to the input sequence of data items

  • d_out[out] Pointer to the output aggregate

  • num_items[in] Total number of input items (i.e., length of d_in)

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename InputIteratorT, typename OutputIteratorT>
static inline cudaError_t ArgMin(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream, bool debug_synchronous)
template<typename InputIteratorT, typename OutputIteratorT, typename NumItemsT>
static inline cudaError_t Max(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, NumItemsT num_items, cudaStream_t stream = 0)

Computes a device-wide maximum using the greater-than (>) operator.

  • Uses std::numeric_limits<T>::lowest() as the initial value of the reduction.

  • Does not support > operators that are non-commutative.

  • Provides “run-to-run” determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.

  • The range [d_in, d_in + num_items) shall not overlap d_out.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the max-reduction of a device vector of int data elements.

#include <cub/cub.cuh> // or equivalently <cub/device/device_radix_sort.cuh>

// Declare, allocate, and initialize device-accessible pointers
// for input and output
int  num_items;      // e.g., 7
int  *d_in;          // e.g., [8, 6, 7, 5, 3, 0, 9]
int  *d_max;         // e.g., [-]
...

// Determine temporary device storage requirements
void     *d_temp_storage = NULL;
size_t   temp_storage_bytes = 0;
cub::DeviceReduce::Max(d_temp_storage, temp_storage_bytes, d_in, d_max, num_items);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run max-reduction
cub::DeviceReduce::Max(d_temp_storage, temp_storage_bytes, d_in, d_max, num_items);

// d_max <-- [9]

Template Parameters
  • InputIteratorT[inferred] Random-access input iterator type for reading input items (may be a simple pointer type)

  • OutputIteratorT[inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)

  • NumItemsT[inferred] Type of num_items

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Pointer to the input sequence of data items

  • d_out[out] Pointer to the output aggregate

  • num_items[in] Total number of input items (i.e., length of d_in)

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename InputIteratorT, typename OutputIteratorT>
static inline cudaError_t Max(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream, bool debug_synchronous)
template<typename InputIteratorT, typename OutputIteratorT>
static inline cudaError_t ArgMax(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream = 0)

Finds the first device-wide maximum using the greater-than (>) operator, also returning the index of that item

  • The output value type of d_out is cub::KeyValuePair<int, T> (assuming the value type of d_in is T)

    • The maximum is written to d_out.value and its offset in the input array is written to d_out.key.

    • The {1, std::numeric_limits<T>::lowest()} tuple is produced for zero-length inputs

  • Does not support > operators that are non-commutative.

  • Provides “run-to-run” determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.

  • The range [d_in, d_in + num_items) shall not overlap d_out.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the argmax-reduction of a device vector of int data elements.

#include <cub/cub.cuh>
// or equivalently <cub/device/device_reduce.cuh>

// Declare, allocate, and initialize device-accessible pointers
// for input and output
int                      num_items;      // e.g., 7
int                      *d_in;          // e.g., [8, 6, 7, 5, 3, 0, 9]
KeyValuePair<int, int>   *d_argmax;      // e.g., [{-,-}]
...

// Determine temporary device storage requirements
void     *d_temp_storage = NULL;
size_t   temp_storage_bytes = 0;
cub::DeviceReduce::ArgMax(
  d_temp_storage, temp_storage_bytes, d_in, d_argmax, num_items);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run argmax-reduction
cub::DeviceReduce::ArgMax(
  d_temp_storage, temp_storage_bytes, d_in, d_argmax, num_items);

// d_argmax <-- [{6, 9}]

Template Parameters
  • InputIteratorT[inferred] Random-access input iterator type for reading input items (of some type T) (may be a simple pointer type)

  • OutputIteratorT[inferred] Output iterator type for recording the reduced aggregate (having value type cub::KeyValuePair<int, T>) (may be a simple pointer type)

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Pointer to the input sequence of data items

  • d_out[out] Pointer to the output aggregate

  • num_items[in] Total number of input items (i.e., length of d_in)

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename InputIteratorT, typename OutputIteratorT>
static inline cudaError_t ArgMax(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, cudaStream_t stream, bool debug_synchronous)
template<typename InputIteratorT, typename OutputIteratorT, typename ReductionOpT, typename TransformOpT, typename T, typename NumItemsT>
static inline cudaError_t TransformReduce(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, NumItemsT num_items, ReductionOpT reduction_op, TransformOpT transform_op, T init, cudaStream_t stream = 0)

Fuses transform and reduce operations

  • Does not support binary reduction operators that are non-commutative.

  • Provides “run-to-run” determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.

  • The range [d_in, d_in + num_items) shall not overlap d_out.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates a user-defined min-reduction of a device vector of int data elements.

#include <cub/cub.cuh>
// or equivalently <cub/device/device_reduce.cuh>

thrust::device_vector<int> in = { 1, 2, 3, 4 };
thrust::device_vector<int> out(1);

std::size_t temp_storage_bytes = 0;
std::uint8_t *d_temp_storage = nullptr;

const int init = 42;

cub::DeviceReduce::TransformReduce(
  d_temp_storage,
  temp_storage_bytes,
  in.begin(),
  out.begin(),
  in.size(),
  cub::Sum{},
  square_t{},
  init);

thrust::device_vector<std::uint8_t> temp_storage(temp_storage_bytes);
d_temp_storage = temp_storage.data().get();

cub::DeviceReduce::TransformReduce(
  d_temp_storage,
  temp_storage_bytes,
  in.begin(),
  out.begin(),
  in.size(),
  cub::Sum{},
  square_t{},
  init);

// out[0] <-- 72

Template Parameters
  • InputIteratorT[inferred] Random-access input iterator type for reading input items (may be a simple pointer type)

  • OutputIteratorT[inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)

  • ReductionOpT[inferred] Binary reduction functor type having member T operator()(const T &a, const T &b)

  • TransformOpT[inferred] Unary reduction functor type having member auto operator()(const T &a)

  • T[inferred] Data element type that is convertible to the value type of InputIteratorT

  • NumItemsT[inferred] Type of num_items

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Pointer to the input sequence of data items

  • d_out[out] Pointer to the output aggregate

  • num_items[in] Total number of input items (i.e., length of d_in)

  • reduction_op[in] Binary reduction functor

  • transform_op[in] Unary transform functor

  • init[in] Initial value of the reduction

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename KeysInputIteratorT, typename UniqueOutputIteratorT, typename ValuesInputIteratorT, typename AggregatesOutputIteratorT, typename NumRunsOutputIteratorT, typename ReductionOpT, typename NumItemsT>
static inline cudaError_t ReduceByKey(void *d_temp_storage, size_t &temp_storage_bytes, KeysInputIteratorT d_keys_in, UniqueOutputIteratorT d_unique_out, ValuesInputIteratorT d_values_in, AggregatesOutputIteratorT d_aggregates_out, NumRunsOutputIteratorT d_num_runs_out, ReductionOpT reduction_op, NumItemsT num_items, cudaStream_t stream = 0)

Reduces segments of values, where segments are demarcated by corresponding runs of identical keys.

This operation computes segmented reductions within d_values_in using the specified binary reduction_op functor. The segments are identified by “runs” of corresponding keys in d_keys_in, where runs are maximal ranges of consecutive, identical keys. For the ith run encountered, the first key of the run and the corresponding value aggregate of that run are written to d_unique_out[i] and d_aggregates_out[i], respectively. The total number of runs encountered is written to d_num_runs_out.

  • The == equality operator is used to determine whether keys are equivalent

  • Provides “run-to-run” determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.

  • Let out be any of [d_unique_out, d_unique_out + *d_num_runs_out) [d_aggregates_out, d_aggregates_out + *d_num_runs_out) d_num_runs_out. The ranges represented by out shall not overlap [d_keys_in, d_keys_in + num_items), [d_values_in, d_values_in + num_items) nor out in any way.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the segmented reduction of int values grouped by runs of associated int keys.

#include <cub/cub.cuh>
// or equivalently <cub/device/device_reduce.cuh>

// CustomMin functor
struct CustomMin
{
    template <typename T>
    CUB_RUNTIME_FUNCTION __forceinline__
    T operator()(const T &a, const T &b) const {
        return (b < a) ? b : a;
    }
};

// Declare, allocate, and initialize device-accessible pointers
// for input and output
int          num_items;          // e.g., 8
int          *d_keys_in;         // e.g., [0, 2, 2, 9, 5, 5, 5, 8]
int          *d_values_in;       // e.g., [0, 7, 1, 6, 2, 5, 3, 4]
int          *d_unique_out;      // e.g., [-, -, -, -, -, -, -, -]
int          *d_aggregates_out;  // e.g., [-, -, -, -, -, -, -, -]
int          *d_num_runs_out;    // e.g., [-]
CustomMin    reduction_op;
...

// Determine temporary device storage requirements
void     *d_temp_storage = NULL;
size_t   temp_storage_bytes = 0;
cub::DeviceReduce::ReduceByKey(
  d_temp_storage, temp_storage_bytes,
  d_keys_in, d_unique_out, d_values_in,
  d_aggregates_out, d_num_runs_out, reduction_op, num_items);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run reduce-by-key
cub::DeviceReduce::ReduceByKey(
  d_temp_storage, temp_storage_bytes,
  d_keys_in, d_unique_out, d_values_in,
  d_aggregates_out, d_num_runs_out, reduction_op, num_items);

// d_unique_out      <-- [0, 2, 9, 5, 8]
// d_aggregates_out  <-- [0, 1, 6, 2, 4]
// d_num_runs_out    <-- [5]

Template Parameters
  • KeysInputIteratorT[inferred] Random-access input iterator type for reading input keys (may be a simple pointer type)

  • UniqueOutputIteratorT[inferred] Random-access output iterator type for writing unique output keys (may be a simple pointer type)

  • ValuesInputIteratorT[inferred] Random-access input iterator type for reading input values (may be a simple pointer type)

  • AggregatesOutputIterator[inferred] Random-access output iterator type for writing output value aggregates (may be a simple pointer type)

  • NumRunsOutputIteratorT[inferred] Output iterator type for recording the number of runs encountered (may be a simple pointer type)

  • ReductionOpT[inferred] Binary reduction functor type having member T operator()(const T &a, const T &b)

  • NumItemsT[inferred] Type of num_items

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_keys_in[in] Pointer to the input sequence of keys

  • d_unique_out[out] Pointer to the output sequence of unique keys (one key per run)

  • d_values_in[in] Pointer to the input sequence of corresponding values

  • d_aggregates_out[out] Pointer to the output sequence of value aggregates (one aggregate per run)

  • d_num_runs_out[out] Pointer to total number of runs encountered (i.e., the length of d_unique_out)

  • reduction_op[in] Binary reduction functor

  • num_items[in] Total number of associated key+value pairs (i.e., the length of d_in_keys and d_in_values)

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename KeysInputIteratorT, typename UniqueOutputIteratorT, typename ValuesInputIteratorT, typename AggregatesOutputIteratorT, typename NumRunsOutputIteratorT, typename ReductionOpT>
static inline cudaError_t ReduceByKey(void *d_temp_storage, size_t &temp_storage_bytes, KeysInputIteratorT d_keys_in, UniqueOutputIteratorT d_unique_out, ValuesInputIteratorT d_values_in, AggregatesOutputIteratorT d_aggregates_out, NumRunsOutputIteratorT d_num_runs_out, ReductionOpT reduction_op, int num_items, cudaStream_t stream, bool debug_synchronous)