cub::DeviceSegmentedReduce

Defined in cub/device/device_segmented_reduce.cuh

struct DeviceSegmentedReduce

DeviceSegmentedReduce provides device-wide, parallel operations for computing a reduction across multiple sequences of data items residing within device-accessible memory.

Overview

A reduction (or fold) uses a binary combining operator to compute a single aggregate from a sequence of input elements.

Usage Considerations

Dynamic parallelism. DeviceSegmentedReduce methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.

Public Static Functions

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT, typename ReductionOpT, typename T> static inline cudaError_t Reduce(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, ::cuda::std::int64_t num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, ReductionOpT reduction_op, T initial_value, cudaStream_t stream = 0)

Computes a device-wide segmented reduction using the specified binary reduction_op functor.

Does not support binary reduction operators that are non-commutative.
Provides “run-to-run” determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
When input a contiguous sequence of segments, a single sequence segment_offsets (of length num_segments + 1) can be aliased for both the d_begin_offsets and d_end_offsets parameters (where the latter is specified as segment_offsets + 1).
Let s be in [0, num_segments). The range [d_out + d_begin_offsets[s], d_out + d_end_offsets[s]) shall not overlap [d_in + d_begin_offsets[s], d_in + d_end_offsets[s]), [d_begin_offsets, d_begin_offsets + num_segments) nor [d_end_offsets, d_end_offsets + num_segments).
When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates a custom min-reduction of a device vector of int data elements.

int num_segments                  = 3;
c2h::device_vector<int> d_offsets = {0, 3, 3, 7};
auto d_offsets_it                 = thrust::raw_pointer_cast(d_offsets.data());
c2h::device_vector<int> d_in{8, 6, 7, 5, 3, 0, 9};
c2h::device_vector<int> d_out(3);
CustomMin min_op;
int initial_value{INT_MAX};

// Determine temporary device storage requirements
void* d_temp_storage      = nullptr;
size_t temp_storage_bytes = 0;
cub::DeviceSegmentedReduce::Reduce(
  d_temp_storage,
  temp_storage_bytes,
  d_in.begin(),
  d_out.begin(),
  num_segments,
  d_offsets_it,
  d_offsets_it + 1,
  min_op,
  initial_value);

c2h::device_vector<std::uint8_t> temp_storage(temp_storage_bytes);
d_temp_storage = thrust::raw_pointer_cast(temp_storage.data());

// Run reduction
cub::DeviceSegmentedReduce::Reduce(
  d_temp_storage,
  temp_storage_bytes,
  d_in.begin(),
  d_out.begin(),
  num_segments,
  d_offsets_it,
  d_offsets_it + 1,
  min_op,
  initial_value);

c2h::device_vector<int> expected{6, INT_MAX, 0};

Template Parameters

InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
ReductionOpT – [inferred] Binary reduction functor type having member T operator()(const T &a, const T &b)
T – [inferred] Data element type that is convertible to the value type of InputIteratorT

Parameters

d_temp_storage – [in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.
temp_storage_bytes – [inout] Reference to size in bytes of d_temp_storage allocation
d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_begin_offsets[i] is the first element of the i^th data segment in d_keys_* and d_values_*
d_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_end_offsets[i] - 1 is the last element of the i^th data segment in d_keys_* and d_values_*. If d_end_offsets[i] - 1 <= d_begin_offsets[i], the i^th is considered empty.
reduction_op – [in] Binary reduction functor
initial_value – [in] Initial value of the reduction for each segment
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream₀.

template<typename InputIteratorT, typename OutputIteratorT, typename ReductionOpT, typename T> static inline cudaError_t Reduce(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_segments, int segment_size, ReductionOpT reduction_op, T initial_value, cudaStream_t stream = 0)

Computes a device-wide segmented reduction using the specified binary reduction_op functor and a fixed segment size.

Does not support binary reduction operators that are non-commutative.
Provides “run-to-run” determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates a custom min-reduction of a device vector of int data elements.

int num_segments = 3;
int segment_size = 2;
c2h::device_vector<int> d_in{8, 6, 7, 5, 3, 0, 9};
c2h::device_vector<int> d_out(3);
CustomMin min_op;
int initial_value{INT_MAX};

// Determine temporary device storage requirements
void* d_temp_storage      = nullptr;
size_t temp_storage_bytes = 0;
cub::DeviceSegmentedReduce::Reduce(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, segment_size, min_op, initial_value);

c2h::device_vector<std::uint8_t> temp_storage(temp_storage_bytes);
d_temp_storage = thrust::raw_pointer_cast(temp_storage.data());

// Run reduction
cub::DeviceSegmentedReduce::Reduce(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, segment_size, min_op, initial_value);

c2h::device_vector<int> expected{6, 5, 0};

Template Parameters

InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
ReductionOpT – [inferred] Binary reduction functor type having member T operator()(const T &a, const T &b)
T – [inferred] Data element type that is convertible to the value type of InputIteratorT

Parameters

d_temp_storage – [in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.
temp_storage_bytes – [inout] Reference to size in bytes of d_temp_storage allocation
d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregates
num_segments – [in] The number of segments that comprise the segmented reduction data
segment_size – [in] The fixed segment size of each segment
reduction_op – [in] Binary reduction functor
initial_value – [in] Initial value of the reduction for each segment
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream₀.

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT> static inline cudaError_t Sum(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, ::cuda::std::int64_t num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, cudaStream_t stream = 0)

Computes a device-wide segmented sum using the addition (+) operator.

Uses 0 as the initial value of the reduction for each segment.
When input a contiguous sequence of segments, a single sequence segment_offsets (of length num_segments + 1) can be aliased for both the d_begin_offsets and d_end_offsets parameters (where the latter is specified as segment_offsets + 1).
Does not support + operators that are non-commutative.
Let s be in [0, num_segments). The range [d_out + d_begin_offsets[s], d_out + d_end_offsets[s]) shall not overlap [d_in + d_begin_offsets[s], d_in + d_end_offsets[s]), [d_begin_offsets, d_begin_offsets + num_segments) nor [d_end_offsets, d_end_offsets + num_segments).
When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the sum reduction of a device vector of int data elements.

int num_segments                  = 3;
c2h::device_vector<int> d_offsets = {0, 3, 3, 7};
auto d_offsets_it                 = thrust::raw_pointer_cast(d_offsets.data());
c2h::device_vector<int> d_in{8, 6, 7, 5, 3, 0, 9};
c2h::device_vector<int> d_out(3);

// Determine temporary device storage requirements
void* d_temp_storage      = nullptr;
size_t temp_storage_bytes = 0;
cub::DeviceSegmentedReduce::Sum(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1);

c2h::device_vector<std::uint8_t> temp_storage(temp_storage_bytes);
d_temp_storage = thrust::raw_pointer_cast(temp_storage.data());

// Run reduction
cub::DeviceSegmentedReduce::Sum(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1);

c2h::device_vector<int> expected{21, 0, 17};

Template Parameters

InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)

Parameters

d_temp_storage – [in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.
temp_storage_bytes – [inout] Reference to size in bytes of d_temp_storage allocation
d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length num_segments`, such that ``d_begin_offsets[i] is the first element of the i^th data segment in d_keys_* and d_values_*
d_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_end_offsets[i] - 1 is the last element of the i^th data segment in d_keys_* and d_values_*. If d_end_offsets[i] - 1 <= d_begin_offsets[i], the i^th is considered empty.
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream₀.

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT> static inline cudaError_t Min(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, ::cuda::std::int64_t num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, cudaStream_t stream = 0)

Computes a device-wide segmented minimum using the less-than (<) operator.

Uses ::cuda::std::numeric_limits<T>::max() as the initial value of the reduction for each segment.
When input a contiguous sequence of segments, a single sequence segment_offsets (of length num_segments + 1) can be aliased for both the d_begin_offsets and d_end_offsets parameters (where the latter is specified as segment_offsets + 1).
Does not support < operators that are non-commutative.
Let s be in [0, num_segments). The range [d_out + d_begin_offsets[s], d_out + d_end_offsets[s]) shall not overlap [d_in + d_begin_offsets[s], d_in + d_end_offsets[s]), [d_begin_offsets, d_begin_offsets + num_segments) nor [d_end_offsets, d_end_offsets + num_segments).
When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the min-reduction of a device vector of int data elements.

struct CustomMin
{
  template <typename T>
  __device__ __forceinline__ T operator()(const T& a, const T& b) const
  {
    return (b < a) ? b : a;
  }
};

int num_segments                  = 3;
c2h::device_vector<int> d_offsets = {0, 3, 3, 7};
auto d_offsets_it                 = thrust::raw_pointer_cast(d_offsets.data());
c2h::device_vector<int> d_in{8, 6, 7, 5, 3, 0, 9};
c2h::device_vector<int> d_out(3);

// Determine temporary device storage requirements
void* d_temp_storage      = nullptr;
size_t temp_storage_bytes = 0;
cub::DeviceSegmentedReduce::Min(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1);

c2h::device_vector<std::uint8_t> temp_storage(temp_storage_bytes);
d_temp_storage = thrust::raw_pointer_cast(temp_storage.data());

// Run reduction
cub::DeviceSegmentedReduce::Min(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1);

c2h::device_vector<int> expected{6, INT_MAX, 0};

Template Parameters

InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)

Parameters

d_temp_storage – [in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.
temp_storage_bytes – [inout] Reference to size in bytes of d_temp_storage allocation
d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_begin_offsets[i] is the first element of the i^th data segment in d_keys_* and d_values_*
d_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_end_offsets[i] - 1 is the last element of the i^th data segment in d_keys_* and d_values_*. If d_end_offsets[i] - 1 <= d_begin_offsets[i], the i^th is considered empty.
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream₀.

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT> static inline cudaError_t ArgMin(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, ::cuda::std::int64_t num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, cudaStream_t stream = 0)

Finds the first device-wide minimum in each segment using the less-than (<) operator, also returning the in-segment index of that item.

The output value type of d_out is cub::KeyValuePair<int, T> (assuming the value type of d_in is T)
- The minimum of the i^th segment is written to d_out[i].value and its offset in that segment is written to d_out[i].key.
- The {1, ::cuda::std::numeric_limits<T>::max()} tuple is produced for zero-length inputs
When input a contiguous sequence of segments, a single sequence segment_offsets (of length num_segments + 1) can be aliased for both the d_begin_offsets and d_end_offsets parameters (where the latter is specified as segment_offsets + 1).
Does not support < operators that are non-commutative.
Let s be in [0, num_segments). The range [d_out + d_begin_offsets[s], d_out + d_end_offsets[s]) shall not overlap [d_in + d_begin_offsets[s], d_in + d_end_offsets[s]), [d_begin_offsets, d_begin_offsets + num_segments) nor [d_end_offsets, d_end_offsets + num_segments).
When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the argmin-reduction of a device vector of int data elements.

int num_segments                  = 3;
c2h::device_vector<int> d_offsets = {0, 3, 3, 7};
auto d_offsets_it                 = thrust::raw_pointer_cast(d_offsets.data());
c2h::device_vector<int> d_in{8, 6, 7, 5, 3, 0, 9};
c2h::device_vector<cub::KeyValuePair<int, int>> d_out(3);

// Determine temporary device storage requirements
void* d_temp_storage      = nullptr;
size_t temp_storage_bytes = 0;
cub::DeviceSegmentedReduce::ArgMin(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1);

c2h::device_vector<std::uint8_t> temp_storage(temp_storage_bytes);
d_temp_storage = thrust::raw_pointer_cast(temp_storage.data());

// Run reduction
cub::DeviceSegmentedReduce::ArgMin(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1);

c2h::device_vector<cub::KeyValuePair<int, int>> expected{{1, 6}, {1, INT_MAX}, {2, 0}};

Template Parameters

InputIteratorT – [inferred] Random-access input iterator type for reading input items (of some type T) (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (having value type KeyValuePair<int, T>) (may be a simple pointer type)
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)

Parameters

d_temp_storage – [in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.
temp_storage_bytes – [inout] Reference to size in bytes of d_temp_storage allocation
d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_begin_offsets[i] is the first element of the i^th data segment in d_keys_* and d_values_*
d_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_end_offsets[i] - 1 is the last element of the i^th data segment in d_keys_* and d_values_*. If d_end_offsets[i] - 1 <= d_begin_offsets[i], the i^th is considered empty.
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream₀.

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT> static inline cudaError_t Max(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, ::cuda::std::int64_t num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, cudaStream_t stream = 0)

Computes a device-wide segmented maximum using the greater-than (>) operator.

Uses ::cuda::std::numeric_limits<T>::lowest() as the initial value of the reduction.
When input a contiguous sequence of segments, a single sequence segment_offsets (of length num_segments + 1) can be aliased for both the d_begin_offsets and d_end_offsets parameters (where the latter is specified as segment_offsets + 1).
Does not support > operators that are non-commutative.
Let s be in [0, num_segments). The range [d_out + d_begin_offsets[s], d_out + d_end_offsets[s]) shall not overlap [d_in + d_begin_offsets[s], d_in + d_end_offsets[s]), [d_begin_offsets, d_begin_offsets + num_segments) nor [d_end_offsets, d_end_offsets + num_segments).
When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the max-reduction of a device vector of int data elements.

int num_segments                  = 3;
c2h::device_vector<int> d_offsets = {0, 3, 3, 7};
auto d_offsets_it                 = thrust::raw_pointer_cast(d_offsets.data());
c2h::device_vector<int> d_in{8, 6, 7, 5, 3, 0, 9};
c2h::device_vector<int> d_out(3);

// Determine temporary device storage requirements
void* d_temp_storage      = nullptr;
size_t temp_storage_bytes = 0;
cub::DeviceSegmentedReduce::Max(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1);

c2h::device_vector<std::uint8_t> temp_storage(temp_storage_bytes);
d_temp_storage = thrust::raw_pointer_cast(temp_storage.data());

// Run reduction
cub::DeviceSegmentedReduce::Max(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1);

c2h::device_vector<int> expected{8, INT_MIN, 9};

Template Parameters

InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)

Parameters

d_temp_storage – [in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.
temp_storage_bytes – [inout] Reference to size in bytes of d_temp_storage allocation
d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_begin_offsets[i] is the first element of the i^th data segment in d_keys_* and d_values_*
d_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_end_offsets[i] - 1 is the last element of the i^th data segment in d_keys_* and d_values_*. If d_end_offsets[i] - 1 <= d_begin_offsets[i], the i^th is considered empty.
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream₀.

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT> static inline cudaError_t ArgMax(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, ::cuda::std::int64_t num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, cudaStream_t stream = 0)

Finds the first device-wide maximum in each segment using the greater-than (>) operator, also returning the in-segment index of that item

The output value type of d_out is cub::KeyValuePair<int, T> (assuming the value type of d_in is T)
- The maximum of the i^th segment is written to d_out[i].value and its offset in that segment is written to d_out[i].key.
- The {1, ::cuda::std::numeric_limits<T>::lowest()} tuple is produced for zero-length inputs
When input a contiguous sequence of segments, a single sequence segment_offsets (of length num_segments + 1) can be aliased for both the d_begin_offsets and d_end_offsets parameters (where the latter is specified as segment_offsets + 1).
Does not support > operators that are non-commutative.
Let s be in [0, num_segments). The range [d_out + d_begin_offsets[s], d_out + d_end_offsets[s]) shall not overlap [d_in + d_begin_offsets[s], d_in + d_end_offsets[s]), [d_begin_offsets, d_begin_offsets + num_segments) nor [d_end_offsets, d_end_offsets + num_segments).
When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the argmax-reduction of a device vector of int data elements.

int num_segments                  = 3;
c2h::device_vector<int> d_offsets = {0, 3, 3, 7};
auto d_offsets_it                 = thrust::raw_pointer_cast(d_offsets.data());
c2h::device_vector<int> d_in{8, 6, 7, 5, 3, 0, 9};
c2h::device_vector<cub::KeyValuePair<int, int>> d_out(3);

// Determine temporary device storage requirements
void* d_temp_storage      = nullptr;
size_t temp_storage_bytes = 0;
cub::DeviceSegmentedReduce::ArgMax(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1);

c2h::device_vector<std::uint8_t> temp_storage(temp_storage_bytes);
d_temp_storage = thrust::raw_pointer_cast(temp_storage.data());

// Run reduction
cub::DeviceSegmentedReduce::ArgMax(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1);

c2h::device_vector<cub::KeyValuePair<int, int>> expected{{0, 8}, {1, INT_MIN}, {3, 9}};

Template Parameters

InputIteratorT – [inferred] Random-access input iterator type for reading input items (of some type T) (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (having value type KeyValuePair<int, T>) (may be a simple pointer type)
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)

Parameters

d_temp_storage – [in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.
temp_storage_bytes – [inout] Reference to size in bytes of d_temp_storage allocation
d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_begin_offsets[i] is the first element of the i^th data segment in d_keys_* and d_values_*
d_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_end_offsets[i] - 1 is the last element of the i^th data segment in d_keys_* and d_values_*. If d_end_offsets[i] - 1 <= d_begin_offsets[i], the i^th is considered empty.
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream₀.