cub::DeviceSegmentedReduce#
-
struct DeviceSegmentedReduce#
DeviceSegmentedReduce provides device-wide, parallel operations for computing a reduction across multiple sequences of data items residing within device-accessible memory.
Overview#
A reduction (or fold) uses a binary combining operator to compute a single aggregate from a sequence of input elements.
Usage Considerations#
Dynamic parallelism. DeviceSegmentedReduce methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.
Public Static Functions
-
template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT, typename ReductionOpT, typename T>
static inline cudaError_t Reduce( - void *d_temp_storage,
- size_t &temp_storage_bytes,
- InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- BeginOffsetIteratorT d_begin_offsets,
- EndOffsetIteratorT d_end_offsets,
- ReductionOpT reduction_op,
- T initial_value,
- cudaStream_t stream = nullptr
Computes a device-wide segmented reduction using the specified binary
reduction_opfunctor.Added in version 2.2.0: First appears in CUDA Toolkit 12.3.
Does not support binary reduction operators that are non-commutative.
Provides “run-to-run” determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
When input a contiguous sequence of segments, a single sequence
segment_offsets(of lengthnum_segments + 1) can be aliased for both thed_begin_offsetsandd_end_offsetsparameters (where the latter is specified assegment_offsets + 1).Let
sbe in[0, num_segments). The range[d_out + d_begin_offsets[s], d_out + d_end_offsets[s])shall not overlap[d_in + d_begin_offsets[s], d_in + d_end_offsets[s]),[d_begin_offsets, d_begin_offsets + num_segments)nor[d_end_offsets, d_end_offsets + num_segments).When
d_temp_storageisnullptr, no work is done and the required allocation size is returned intemp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.
Snippet#
The code snippet below illustrates a custom min-reduction of a device vector of
intdata elements.- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
ReductionOpT – [inferred] Binary reduction functor type having member
T operator()(const T &a, const T &b)T – [inferred] Data element type that is convertible to the
valuetype ofInputIteratorT
- Parameters:
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr, the required allocation size is written totemp_storage_bytesand no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storageallocationd_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments, such thatd_begin_offsets[i]is the first element of the ith data segment ind_ind_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments, such thatd_end_offsets[i] - 1is the last element of the ith data segment ind_in. Ifd_end_offsets[i] - 1 <= d_begin_offsets[i], the ith is considered empty.reduction_op – [in] Binary reduction functor
initial_value – [in] Initial value of the reduction for each segment
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT, typename ReductionOpT, typename T, typename EnvT = ::cuda::std::execution::env<>>
static inline cudaError_t Reduce( - InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- BeginOffsetIteratorT d_begin_offsets,
- EndOffsetIteratorT d_end_offsets,
- ReductionOpT reduction_op,
- T initial_value,
- EnvT env = {}
Computes a device-wide segmented reduction using the specified binary
reduction_opfunctor.Added in version 3.4.0: First appears in CUDA Toolkit 13.4.
Does not support binary reduction operators that are non-commutative.
Provides “run-to-run” determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
When input a contiguous sequence of segments, a single sequence
segment_offsets(of lengthnum_segments + 1) can be aliased for both thed_begin_offsetsandd_end_offsetsparameters (where the latter is specified assegment_offsets + 1).Let
sbe in[0, num_segments). The range[d_out + d_begin_offsets[s], d_out + d_end_offsets[s])shall not overlap[d_in + d_begin_offsets[s], d_in + d_end_offsets[s]),[d_begin_offsets, d_begin_offsets + num_segments)nor[d_end_offsets, d_end_offsets + num_segments).Can use a specific stream or cuda memory resource through the
envparameter.
Snippet#
The code snippet below illustrates a custom min-reduction of a device vector of
intdata elements.- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
ReductionOpT – [inferred] Binary reduction functor type having member
T operator()(const T &a, const T &b)T – [inferred] Data element type that is convertible to the
valuetype ofInputIteratorTEnvT – [inferred] Execution environment type. Default is
cuda::std::execution::env<>.
- Parameters:
d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments, such thatd_begin_offsets[i]is the first element of the ith data segment ind_ind_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments, such thatd_end_offsets[i] - 1is the last element of the ith data segment ind_in. Ifd_end_offsets[i] - 1 <= d_begin_offsets[i], the ith is considered empty.reduction_op – [in] Binary reduction functor
initial_value – [in] Initial value of the reduction for each segment
env – [in]
[optional] Execution environment. Default is
cuda::std::execution::env{}.
-
template<typename InputIteratorT, typename OutputIteratorT, typename ReductionOpT, typename T>
static inline cudaError_t Reduce( - void *d_temp_storage,
- size_t &temp_storage_bytes,
- InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- int segment_size,
- ReductionOpT reduction_op,
- T initial_value,
- cudaStream_t stream = nullptr
Computes a device-wide segmented reduction using the specified binary
reduction_opfunctor and a fixed segment size.Added in version 3.2.0: First appears in CUDA Toolkit 13.2.
Does not support binary reduction operators that are non-commutative.
When
d_temp_storageisnullptr, no work is done and the required allocation size is returned intemp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.
Snippet#
The code snippet below illustrates a custom min-reduction of a device vector of
intdata elements.- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
ReductionOpT – [inferred] Binary reduction functor type having member
T operator()(const T &a, const T &b)T – [inferred] Data element type that is convertible to the
valuetype ofInputIteratorT
- Parameters:
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr, the required allocation size is written totemp_storage_bytesand no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storageallocationd_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregates
num_segments – [in] The number of segments that comprise the segmented reduction data
segment_size – [in] The fixed segment size of each segment
reduction_op – [in] Binary reduction functor
initial_value – [in] Initial value of the reduction for each segment
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename ReductionOpT, typename T, typename EnvT = ::cuda::std::execution::env<>>
static inline cudaError_t Reduce( - InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- int segment_size,
- ReductionOpT reduction_op,
- T initial_value,
- EnvT env = {}
Computes a device-wide segmented reduction using the specified binary
reduction_opfunctor and a fixed segment size.Added in version 3.4.0: First appears in CUDA Toolkit 13.4.
Does not support binary reduction operators that are non-commutative.
Provides “run-to-run” determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
Can use a specific stream or cuda memory resource through the
envparameterWhen
d_temp_storageisnullptr, no work is done and the required allocation size is returned intemp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.
Snippet#
int num_segments = 2; int segment_size = 3; thrust::device_vector<int> d_in{8, 6, 7, 5, 3, 0}; thrust::device_vector<int> d_out(2); cuda::stream stream{cuda::devices[0]}; cuda::stream_ref stream_ref{stream}; auto env = ::cuda::std::execution::env{stream_ref}; auto error = cub::DeviceSegmentedReduce::Reduce( d_in.begin(), d_out.begin(), num_segments, segment_size, ::cuda::std::plus<>{}, 0, env); thrust::device_vector<int> expected{21, 8}; if (error != cudaSuccess) { std::cerr << "cub::DeviceSegmentedReduce::Reduce (fixed-size) failed with status: " << error << '\n'; }
- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
ReductionOpT – [inferred] Binary reduction functor type having member
T operator()(const T &a, const T &b)T – [inferred] Data element type that is convertible to the
valuetype ofInputIteratorTEnvT – [inferred] Execution environment type. Default is
cuda::std::execution::env<>.
- Parameters:
d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregates
num_segments – [in] The number of segments that comprise the segmented reduction data
segment_size – [in] The fixed segment size of each segment
reduction_op – [in] Binary reduction functor
initial_value – [in] Initial value of the reduction for each segment
env – [in]
[optional] Execution environment. Default is
cuda::std::execution::env{}.
-
template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT, typename = ::cuda::std::void_t<typename ::cuda::std::iterator_traits<BeginOffsetIteratorT>::value_type, typename ::cuda::std::iterator_traits<EndOffsetIteratorT>::value_type>>
static inline cudaError_t Sum( - void *d_temp_storage,
- size_t &temp_storage_bytes,
- InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- BeginOffsetIteratorT d_begin_offsets,
- EndOffsetIteratorT d_end_offsets,
- cudaStream_t stream = nullptr
Computes a device-wide segmented sum using the addition (
+) operator.Added in version 2.2.0: First appears in CUDA Toolkit 12.3.
Uses
0as the initial value of the reduction for each segment.When input a contiguous sequence of segments, a single sequence
segment_offsets(of lengthnum_segments + 1) can be aliased for both thed_begin_offsetsandd_end_offsetsparameters (where the latter is specified assegment_offsets + 1).Does not support
+operators that are non-commutative.Let
sbe in[0, num_segments). The range[d_out + d_begin_offsets[s], d_out + d_end_offsets[s])shall not overlap[d_in + d_begin_offsets[s], d_in + d_end_offsets[s]),[d_begin_offsets, d_begin_offsets + num_segments)nor[d_end_offsets, d_end_offsets + num_segments).When
d_temp_storageisnullptr, no work is done and the required allocation size is returned intemp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.
Snippet#
The code snippet below illustrates the sum reduction of a device vector of
intdata elements.- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
- Parameters:
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr, the required allocation size is written totemp_storage_bytesand no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storageallocationd_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments`, such that ``d_begin_offsets[i]is the first element of the ith data segment ind_ind_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments, such thatd_end_offsets[i] - 1is the last element of the ith data segment ind_in. Ifd_end_offsets[i] - 1 <= d_begin_offsets[i], the ith is considered empty.stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT, typename EnvT = ::cuda::std::execution::env<>>
static inline cudaError_t Sum( - InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- BeginOffsetIteratorT d_begin_offsets,
- EndOffsetIteratorT d_end_offsets,
- EnvT env = {}
Computes a device-wide segmented sum using the addition (
+) operator.Added in version 3.4.0: First appears in CUDA Toolkit 13.4.
Uses
0as the initial value of the reduction for each segment.When input a contiguous sequence of segments, a single sequence
segment_offsets(of lengthnum_segments + 1) can be aliased for both thed_begin_offsetsandd_end_offsetsparameters (where the latter is specified assegment_offsets + 1).Does not support
+operators that are non-commutative.Let
sbe in[0, num_segments). The range[d_out + d_begin_offsets[s], d_out + d_end_offsets[s])shall not overlap[d_in + d_begin_offsets[s], d_in + d_end_offsets[s]),[d_begin_offsets, d_begin_offsets + num_segments)nor[d_end_offsets, d_end_offsets + num_segments).Can use a specific stream or cuda memory resource through the env parameter.
When
d_temp_storageisnullptr, no work is done and the required allocation size is returned intemp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.
Snippet#
The code snippet below illustrates the sum reduction of a device vector of
intdata elements.- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
EnvT – [inferred] Execution environment type. Default is
cuda::std::execution::env<>.
- Parameters:
d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments`, such that ``d_begin_offsets[i]is the first element of the ith data segment ind_ind_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments, such thatd_end_offsets[i] - 1is the last element of the ith data segment ind_in. Ifd_end_offsets[i] - 1 <= d_begin_offsets[i], the ith is considered empty.env – [in]
[optional] Execution environment. Default is
cuda::std::execution::env{}.
-
template<typename InputIteratorT, typename OutputIteratorT, ::cuda::std::enable_if_t<!::cuda::std::is_same_v<InputIteratorT, void*>, int> = 0>
static inline cudaError_t Sum( - void *d_temp_storage,
- size_t &temp_storage_bytes,
- InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- int segment_size,
- cudaStream_t stream = nullptr
Computes a device-wide segmented sum using the addition (
+) operator.Added in version 3.2.0: First appears in CUDA Toolkit 13.2.
Uses
0as the initial value of the reduction for each segment.When
d_temp_storageisnullptr, no work is done and the required allocation size is returned intemp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.
Snippet#
The code snippet below illustrates the sum reduction of a device vector of
intdata elements.- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
- Parameters:
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr, the required allocation size is written totemp_storage_bytesand no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storageallocationd_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
segment_size – [in] The fixed segment size of each segment
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename EnvT = ::cuda::std::execution::env<>>
static inline cudaError_t Sum( - InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- int segment_size,
- EnvT env = {}
Computes a device-wide segmented sum using the addition (
+) operator and a fixed segment size.Added in version 3.4.0: First appears in CUDA Toolkit 13.4.
Uses
0as the initial value of the reduction for each segment.Can use a specific stream or cuda memory resource through the
envparameterWhen
d_temp_storageisnullptr, no work is done and the required allocation size is returned intemp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.
Snippet#
int num_segments = 2; int segment_size = 3; thrust::device_vector<int> d_in{8, 6, 7, 5, 3, 0}; thrust::device_vector<int> d_out(2); cuda::stream stream{cuda::devices[0]}; cuda::stream_ref stream_ref{stream}; auto env = ::cuda::std::execution::env{stream_ref}; auto error = cub::DeviceSegmentedReduce::Sum(d_in.begin(), d_out.begin(), num_segments, segment_size, env); thrust::device_vector<int> expected{21, 8}; if (error != cudaSuccess) { std::cerr << "cub::DeviceSegmentedReduce::Sum (fixed-size) failed with status: " << error << '\n'; }
- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
EnvT – [inferred] Execution environment type. Default is
cuda::std::execution::env<>.
- Parameters:
d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
segment_size – [in] The fixed segment size of each segment
env – [in]
[optional] Execution environment. Default is
cuda::std::execution::env{}.
-
template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT, typename = ::cuda::std::void_t<typename ::cuda::std::iterator_traits<BeginOffsetIteratorT>::value_type, typename ::cuda::std::iterator_traits<EndOffsetIteratorT>::value_type>>
static inline cudaError_t Min( - void *d_temp_storage,
- size_t &temp_storage_bytes,
- InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- BeginOffsetIteratorT d_begin_offsets,
- EndOffsetIteratorT d_end_offsets,
- cudaStream_t stream = nullptr
Computes a device-wide segmented minimum using the less-than (
<) operator.Added in version 2.2.0: First appears in CUDA Toolkit 12.3.
Uses
::cuda::std::numeric_limits<T>::max()as the initial value of the reduction for each segment.When input a contiguous sequence of segments, a single sequence
segment_offsets(of lengthnum_segments + 1) can be aliased for both thed_begin_offsetsandd_end_offsetsparameters (where the latter is specified assegment_offsets + 1).Does not support
<operators that are non-commutative.Let
sbe in[0, num_segments). The range[d_out + d_begin_offsets[s], d_out + d_end_offsets[s])shall not overlap[d_in + d_begin_offsets[s], d_in + d_end_offsets[s]),[d_begin_offsets, d_begin_offsets + num_segments)nor[d_end_offsets, d_end_offsets + num_segments).When
d_temp_storageisnullptr, no work is done and the required allocation size is returned intemp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.
Snippet#
The code snippet below illustrates the min-reduction of a device vector of
intdata elements.- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
- Parameters:
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr, the required allocation size is written totemp_storage_bytesand no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storageallocationd_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments, such thatd_begin_offsets[i]is the first element of the ith data segment ind_ind_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments, such thatd_end_offsets[i] - 1is the last element of the ith data segment ind_in. Ifd_end_offsets[i] - 1 <= d_begin_offsets[i], the ith is considered empty.stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT, typename EnvT = ::cuda::std::execution::env<>>
static inline cudaError_t Min( - InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- BeginOffsetIteratorT d_begin_offsets,
- EndOffsetIteratorT d_end_offsets,
- EnvT env = {}
Computes a device-wide segmented minimum using the less-than (
<) operator and a fixed segment size.Added in version 3.4.0: First appears in CUDA Toolkit 13.4.
Uses
::cuda::std::numeric_limits<T>::max()as the initial value of the reduction for each segment.When input a contiguous sequence of segments, a single sequence
segment_offsets(of lengthnum_segments + 1) can be aliased for both thed_begin_offsetsandd_end_offsetsparameters (where the latter is specified assegment_offsets + 1).Does not support
<operators that are non-commutative.Let
sbe in[0, num_segments). The range[d_out + d_begin_offsets[s], d_out + d_end_offsets[s])shall not overlap[d_in + d_begin_offsets[s], d_in + d_end_offsets[s]),[d_begin_offsets, d_begin_offsets + num_segments)nor[d_end_offsets, d_end_offsets + num_segments).Can use a specific stream or cuda memory resource through the
envparameter.
Snippet#
The code snippet below illustrates the min-reduction of a device vector of
intdata elements.- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
EnvT – [inferred] Execution environment type. Default is
cuda::std::execution::env<>.
- Parameters:
d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments, such thatd_begin_offsets[i]is the first element of the ith data segment ind_ind_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments, such thatd_end_offsets[i] - 1is the last element of the ith data segment ind_in. Ifd_end_offsets[i] - 1 <= d_begin_offsets[i], the ith is considered empty.env – [in]
[optional] Execution environment. Default is
cuda::std::execution::env{}.
-
template<typename InputIteratorT, typename OutputIteratorT>
static inline cudaError_t Min( - void *d_temp_storage,
- size_t &temp_storage_bytes,
- InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- int segment_size,
- cudaStream_t stream = nullptr
Computes a device-wide segmented minimum using the less-than (
<) operator.Added in version 3.2.0: First appears in CUDA Toolkit 13.2.
Uses
::cuda::std::numeric_limits<T>::max()as the initial value of the reduction for each segment.
Snippet#
The code snippet below illustrates the min-reduction of a device vector of
intdata elements.- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
- Parameters:
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr, the required allocation size is written totemp_storage_bytesand no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storageallocationd_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
segment_size – [in] The fixed segment size of each segment
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename EnvT = ::cuda::std::execution::env<>>
static inline cudaError_t Min( - InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- int segment_size,
- EnvT env = {}
Computes a device-wide segmented minimum using the less-than (
<) operator and a fixed segment size.Added in version 3.4.0: First appears in CUDA Toolkit 13.4.
Uses
::cuda::std::numeric_limits<T>::max()as the initial value of the reduction for each segment.Can use a specific stream or cuda memory resource through the
envparameterWhen
d_temp_storageisnullptr, no work is done and the required allocation size is returned intemp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.
Snippet#
int num_segments = 2; int segment_size = 3; thrust::device_vector<int> d_in{8, 6, 7, 5, 3, 0}; thrust::device_vector<int> d_out(2); cuda::stream stream{cuda::devices[0]}; cuda::stream_ref stream_ref{stream}; auto env = ::cuda::std::execution::env{stream_ref}; auto error = cub::DeviceSegmentedReduce::Min(d_in.begin(), d_out.begin(), num_segments, segment_size, env); thrust::device_vector<int> expected{6, 0}; if (error != cudaSuccess) { std::cerr << "cub::DeviceSegmentedReduce::Min (fixed-size) failed with status: " << error << '\n'; }
- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
EnvT – [inferred] Execution environment type. Default is
cuda::std::execution::env<>.
- Parameters:
d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
segment_size – [in] The fixed segment size of each segment
env – [in]
[optional] Execution environment. Default is
cuda::std::execution::env{}.
-
template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT, typename = ::cuda::std::void_t<typename ::cuda::std::iterator_traits<BeginOffsetIteratorT>::value_type, typename ::cuda::std::iterator_traits<EndOffsetIteratorT>::value_type>>
static inline cudaError_t ArgMin( - void *d_temp_storage,
- size_t &temp_storage_bytes,
- InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- BeginOffsetIteratorT d_begin_offsets,
- EndOffsetIteratorT d_end_offsets,
- cudaStream_t stream = nullptr
Finds the first device-wide minimum in each segment using the less-than (
<) operator, also returning the in-segment index of that item.Added in version 2.2.0: First appears in CUDA Toolkit 12.3.
The output value type of
d_outiscub::KeyValuePair<int, T>(assuming the value type ofd_inisT)The minimum of the ith segment is written to
d_out[i].valueand its offset in that segment is written tod_out[i].key.The
{1, ::cuda::std::numeric_limits<T>::max()}tuple is produced for zero-length inputs
When input a contiguous sequence of segments, a single sequence
segment_offsets(of lengthnum_segments + 1) can be aliased for both thed_begin_offsetsandd_end_offsetsparameters (where the latter is specified assegment_offsets + 1).Does not support
<operators that are non-commutative.Let
sbe in[0, num_segments). The range[d_out + d_begin_offsets[s], d_out + d_end_offsets[s])shall not overlap[d_in + d_begin_offsets[s], d_in + d_end_offsets[s]),[d_begin_offsets, d_begin_offsets + num_segments)nor[d_end_offsets, d_end_offsets + num_segments).When
d_temp_storageisnullptr, no work is done and the required allocation size is returned intemp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.
Snippet#
The code snippet below illustrates the argmin-reduction of a device vector of
intdata elements.- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (of some type
T) (may be a simple pointer type)OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (having value type
KeyValuePair<int, T>) (may be a simple pointer type)BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
- Parameters:
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr, the required allocation size is written totemp_storage_bytesand no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storageallocationd_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments, such thatd_begin_offsets[i]is the first element of the ith data segment ind_ind_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments, such thatd_end_offsets[i] - 1is the last element of the ith data segment ind_in. Ifd_end_offsets[i] - 1 <= d_begin_offsets[i], the ith is considered empty.stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT, typename EnvT = ::cuda::std::execution::env<>>
static inline cudaError_t ArgMin( - InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- BeginOffsetIteratorT d_begin_offsets,
- EndOffsetIteratorT d_end_offsets,
- EnvT env = {}
Finds the first device-wide minimum in each segment using the less-than (
<) operator, also returning the in-segment index of that item.Added in version 3.4.0: First appears in CUDA Toolkit 13.4.
The output value type of
d_outiscub::KeyValuePair<int, T>(assuming the value type ofd_inisT)The minimum of the ith segment is written to
d_out[i].valueand its offset in that segment is written tod_out[i].key.The
{1, std::numeric_limits<T>::max()}tuple is produced for zero-length inputs
Does not support
<operators that are non-commutative.Provides “run-to-run” determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
When input a contiguous sequence of segments, a single sequence
segment_offsets(of lengthnum_segments + 1) can be aliased for both thed_begin_offsetsandd_end_offsetsparameters (where the latter is specified assegment_offsets + 1).Let
sbe in[0, num_segments). The range[d_out + d_begin_offsets[s], d_out + d_end_offsets[s])shall not overlap[d_in + d_begin_offsets[s], d_in + d_end_offsets[s]),[d_begin_offsets, d_begin_offsets + num_segments)nor[d_end_offsets, d_end_offsets + num_segments).Can use a specific stream or cuda memory resource through the
envparameter.
Snippet#
The code snippet below illustrates the argmin-reduction of a device vector of
intdata elements.- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (of some type
T) (may be a simple pointer type)OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (having value type
cub::KeyValuePair<int, T>) (may be a simple pointer type)BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
EnvT – [inferred] Execution environment type. Default is
cuda::std::execution::env<>.
- Parameters:
d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments, such thatd_begin_offsets[i]is the first element of the ith data segment ind_ind_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments, such thatd_end_offsets[i] - 1is the last element of the ith data segment ind_in. Ifd_end_offsets[i] - 1 <= d_begin_offsets[i], the ith is considered empty.env – [in]
[optional] Execution environment. Default is
cuda::std::execution::env{}.
-
template<typename InputIteratorT, typename OutputIteratorT>
static inline cudaError_t ArgMin( - void *d_temp_storage,
- size_t &temp_storage_bytes,
- InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- int segment_size,
- cudaStream_t stream = nullptr
Finds the first device-wide minimum in each segment using the less-than (
<) operator, also returning the in-segment index of that item.Added in version 3.2.0: First appears in CUDA Toolkit 13.2.
The output value type of
d_outis::cuda::std::pair<int, T>(assuming the value type ofd_inisT)The minimum of the ith segment is written to
d_out[i].secondand its offset in that segment is written tod_out[i].first.The
{1, ::cuda::std::numeric_limits<T>::max()}tuple is produced for zero-length inputs
Snippet#
The code snippet below illustrates the argmin-reduction of a device vector of
intdata elements.- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (of some type
T) (may be a simple pointer type)OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (having value type
cuda::std::pair<int, T>) (may be a simple pointer type)
- Parameters:
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr, the required allocation size is written totemp_storage_bytesand no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storageallocationd_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
segment_size – [in] The fixed segment size of each segment
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename EnvT = ::cuda::std::execution::env<>>
static inline cudaError_t ArgMin( - InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- int segment_size,
- EnvT env = {}
Finds the first device-wide minimum in each segment using the less-than (
<) operator, also returning the in-segment index of that item, with a fixed segment size.Added in version 3.4.0: First appears in CUDA Toolkit 13.4.
The output value type of
d_outis::cuda::std::pair<int, T>(assuming the value type ofd_inisT)The minimum of the ith segment is written to
d_out[i].secondand its offset in that segment is written tod_out[i].first.The
{1, ::cuda::std::numeric_limits<T>::max()}tuple is produced for zero-length inputs
Can use a specific stream or cuda memory resource through the
envparameterWhen
d_temp_storageisnullptr, no work is done and the required allocation size is returned intemp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.
Snippet#
int num_segments = 2; int segment_size = 3; thrust::device_vector<int> d_in{8, 6, 7, 5, 3, 0}; thrust::device_vector<cuda::std::pair<int, int>> d_out(2); cuda::stream stream{cuda::devices[0]}; cuda::stream_ref stream_ref{stream}; auto env = ::cuda::std::execution::env{stream_ref}; auto error = cub::DeviceSegmentedReduce::ArgMin(d_in.begin(), d_out.begin(), num_segments, segment_size, env); if (error != cudaSuccess) { std::cerr << "cub::DeviceSegmentedReduce::ArgMin (fixed-size) failed with status: " << error << '\n'; }
- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (of some type
T) (may be a simple pointer type)OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (having value type
cuda::std::pair<int, T>) (may be a simple pointer type)EnvT – [inferred] Execution environment type. Default is
cuda::std::execution::env<>.
- Parameters:
d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
segment_size – [in] The fixed segment size of each segment
env – [in]
[optional] Execution environment. Default is
cuda::std::execution::env{}.
-
template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT, typename = ::cuda::std::void_t<typename ::cuda::std::iterator_traits<BeginOffsetIteratorT>::value_type, typename ::cuda::std::iterator_traits<EndOffsetIteratorT>::value_type>>
static inline cudaError_t Max( - void *d_temp_storage,
- size_t &temp_storage_bytes,
- InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- BeginOffsetIteratorT d_begin_offsets,
- EndOffsetIteratorT d_end_offsets,
- cudaStream_t stream = nullptr
Computes a device-wide segmented maximum using the greater-than (
>) operator.Added in version 2.2.0: First appears in CUDA Toolkit 12.3.
Uses
::cuda::std::numeric_limits<T>::lowest()as the initial value of the reduction.When input a contiguous sequence of segments, a single sequence
segment_offsets(of lengthnum_segments + 1) can be aliased for both thed_begin_offsetsandd_end_offsetsparameters (where the latter is specified assegment_offsets + 1).Does not support
>operators that are non-commutative.Let
sbe in[0, num_segments). The range[d_out + d_begin_offsets[s], d_out + d_end_offsets[s])shall not overlap[d_in + d_begin_offsets[s], d_in + d_end_offsets[s]),[d_begin_offsets, d_begin_offsets + num_segments)nor[d_end_offsets, d_end_offsets + num_segments).When
d_temp_storageisnullptr, no work is done and the required allocation size is returned intemp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.
Snippet#
The code snippet below illustrates the max-reduction of a device vector of
intdata elements.- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
- Parameters:
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr, the required allocation size is written totemp_storage_bytesand no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storageallocationd_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments, such thatd_begin_offsets[i]is the first element of the ith data segment ind_ind_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments, such thatd_end_offsets[i] - 1is the last element of the ith data segment ind_in. Ifd_end_offsets[i] - 1 <= d_begin_offsets[i], the ith is considered empty.stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT, typename EnvT = ::cuda::std::execution::env<>>
static inline cudaError_t Max( - InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- BeginOffsetIteratorT d_begin_offsets,
- EndOffsetIteratorT d_end_offsets,
- EnvT env = {}
Computes a device-wide segmented maximum using the greater-than (
>) operator.Added in version 3.4.0: First appears in CUDA Toolkit 13.4.
Uses
::cuda::std::numeric_limits<T>::lowest()as the initial value of the reduction.When input a contiguous sequence of segments, a single sequence
segment_offsets(of lengthnum_segments + 1) can be aliased for both thed_begin_offsetsandd_end_offsetsparameters (where the latter is specified assegment_offsets + 1).Does not support
>operators that are non-commutative.Let
sbe in[0, num_segments). The range[d_out + d_begin_offsets[s], d_out + d_end_offsets[s])shall not overlap[d_in + d_begin_offsets[s], d_in + d_end_offsets[s]),[d_begin_offsets, d_begin_offsets + num_segments)nor[d_end_offsets, d_end_offsets + num_segments).Can use a specific stream or cuda memory resource through the
envparameter.
Snippet#
The code snippet below illustrates the max-reduction of a device vector of
intdata elements.- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
EnvT – [inferred] Execution environment type. Default is
cuda::std::execution::env<>.
- Parameters:
d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments, such thatd_begin_offsets[i]is the first element of the ith data segment ind_ind_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments, such thatd_end_offsets[i] - 1is the last element of the ith data segment ind_in. Ifd_end_offsets[i] - 1 <= d_begin_offsets[i], the ith is considered empty.env – [in]
[optional] Execution environment. Default is
cuda::std::execution::env{}.
-
template<typename InputIteratorT, typename OutputIteratorT>
static inline cudaError_t Max( - void *d_temp_storage,
- size_t &temp_storage_bytes,
- InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- int segment_size,
- cudaStream_t stream = nullptr
Computes a device-wide segmented maximum using the greater-than (
>) operator.Added in version 3.2.0: First appears in CUDA Toolkit 13.2.
Uses
::cuda::std::numeric_limits<T>::lowest()as the initial value of the reduction.
Snippet#
The code snippet below illustrates the max-reduction of a device vector of
intdata elements.- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
- Parameters:
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr, the required allocation size is written totemp_storage_bytesand no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storageallocationd_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
segment_size – [in] The fixed segment size of each segment
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename EnvT = ::cuda::std::execution::env<>>
static inline cudaError_t Max( - InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- int segment_size,
- EnvT env = {}
Computes a device-wide segmented maximum using the greater-than (
>) operator and a fixed segment size.Added in version 3.4.0: First appears in CUDA Toolkit 13.4.
Uses
::cuda::std::numeric_limits<T>::lowest()as the initial value of the reduction.Can use a specific stream or cuda memory resource through the
envparameterWhen
d_temp_storageisnullptr, no work is done and the required allocation size is returned intemp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.
Snippet#
int num_segments = 2; int segment_size = 3; thrust::device_vector<int> d_in{8, 6, 7, 5, 3, 0}; thrust::device_vector<int> d_out(2); cuda::stream stream{cuda::devices[0]}; cuda::stream_ref stream_ref{stream}; auto env = ::cuda::std::execution::env{stream_ref}; auto error = cub::DeviceSegmentedReduce::Max(d_in.begin(), d_out.begin(), num_segments, segment_size, env); thrust::device_vector<int> expected{8, 5}; if (error != cudaSuccess) { std::cerr << "cub::DeviceSegmentedReduce::Max (fixed-size) failed with status: " << error << '\n'; }
- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
EnvT – [inferred] Execution environment type. Default is
cuda::std::execution::env<>.
- Parameters:
d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
segment_size – [in] The fixed segment size of each segment
env – [in]
[optional] Execution environment. Default is
cuda::std::execution::env{}.
-
template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT, typename = ::cuda::std::void_t<typename ::cuda::std::iterator_traits<BeginOffsetIteratorT>::value_type, typename ::cuda::std::iterator_traits<EndOffsetIteratorT>::value_type>>
static inline cudaError_t ArgMax( - void *d_temp_storage,
- size_t &temp_storage_bytes,
- InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- BeginOffsetIteratorT d_begin_offsets,
- EndOffsetIteratorT d_end_offsets,
- cudaStream_t stream = nullptr
Finds the first device-wide maximum in each segment using the greater-than (
>) operator, also returning the in-segment index of that itemAdded in version 2.2.0: First appears in CUDA Toolkit 12.3.
The output value type of
d_outiscub::KeyValuePair<int, T>(assuming the value type ofd_inisT)The maximum of the ith segment is written to
d_out[i].valueand its offset in that segment is written tod_out[i].key.The
{1, ::cuda::std::numeric_limits<T>::lowest()}tuple is produced for zero-length inputs
When input a contiguous sequence of segments, a single sequence
segment_offsets(of lengthnum_segments + 1) can be aliased for both thed_begin_offsetsandd_end_offsetsparameters (where the latter is specified assegment_offsets + 1).Does not support
>operators that are non-commutative.Let
sbe in[0, num_segments). The range[d_out + d_begin_offsets[s], d_out + d_end_offsets[s])shall not overlap[d_in + d_begin_offsets[s], d_in + d_end_offsets[s]),[d_begin_offsets, d_begin_offsets + num_segments)nor[d_end_offsets, d_end_offsets + num_segments).When
d_temp_storageisnullptr, no work is done and the required allocation size is returned intemp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.
Snippet#
The code snippet below illustrates the argmax-reduction of a device vector of int data elements.
- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (of some type
T) (may be a simple pointer type)OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (having value type
KeyValuePair<int, T>) (may be a simple pointer type)BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
- Parameters:
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr, the required allocation size is written totemp_storage_bytesand no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storageallocationd_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length num_segments, such that
d_begin_offsets[i]is the first element of the ith data segment ind_ind_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments, such thatd_end_offsets[i] - 1is the last element of the ith data segment ind_in. Ifd_end_offsets[i] - 1 <= d_begin_offsets[i], the ith is considered empty.stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT, typename EnvT = ::cuda::std::execution::env<>>
static inline cudaError_t ArgMax( - InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- BeginOffsetIteratorT d_begin_offsets,
- EndOffsetIteratorT d_end_offsets,
- EnvT env = {}
Finds the first device-wide maximum in each segment using the greater-than (
>) operator, also returning the in-segment index of that item.Added in version 3.4.0: First appears in CUDA Toolkit 13.4.
The output value type of
d_outiscub::KeyValuePair<int, T>(assuming the value type ofd_inisT)The maximum of the ith segment is written to
d_out[i].valueand its offset in that segment is written tod_out[i].key.The
{1, ::cuda::std::numeric_limits<T>::lowest()}tuple is produced for zero-length inputs
Does not support
>operators that are non-commutative.Provides “run-to-run” determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
When input a contiguous sequence of segments, a single sequence
segment_offsets(of lengthnum_segments + 1) can be aliased for both thed_begin_offsetsandd_end_offsetsparameters (where the latter is specified assegment_offsets + 1).Let
sbe in[0, num_segments). The range[d_out + d_begin_offsets[s], d_out + d_end_offsets[s])shall not overlap[d_in + d_begin_offsets[s], d_in + d_end_offsets[s]),[d_begin_offsets, d_begin_offsets + num_segments)nor[d_end_offsets, d_end_offsets + num_segments).Can use a specific stream or cuda memory resource through the
envparameter.
Snippet#
The code snippet below illustrates the argmax-reduction of a device vector of
intdata elements.- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (of some type
T) (may be a simple pointer type)OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (having value type
cub::KeyValuePair<int, T>) (may be a simple pointer type)BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
EnvT – [inferred] Execution environment type. Default is
cuda::std::execution::env<>.
- Parameters:
d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments, such thatd_begin_offsets[i]is the first element of the ith data segment ind_ind_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments, such thatd_end_offsets[i] - 1is the last element of the ith data segment ind_in. Ifd_end_offsets[i] - 1 <= d_begin_offsets[i], the ith is considered empty.env – [in]
[optional] Execution environment. Default is
cuda::std::execution::env{}.
-
template<typename InputIteratorT, typename OutputIteratorT>
static inline cudaError_t ArgMax( - void *d_temp_storage,
- size_t &temp_storage_bytes,
- InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- int segment_size,
- cudaStream_t stream = nullptr
Finds the first device-wide maximum in each segment using the greater-than (
>) operator, also returning the in-segment index of that itemAdded in version 3.2.0: First appears in CUDA Toolkit 13.2.
The output value type of
d_outis::cuda::std::pair<int, T>(assuming the value type ofd_inisT)The maximum of the ith segment is written to
d_out[i].secondand its offset in that segment is written tod_out[i].first.The
{1, ::cuda::std::numeric_limits<T>::lowest()}tuple is produced for zero-length inputs
Snippet#
The code snippet below illustrates the argmax-reduction of a device vector of int data elements.
- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (of some type
T) (may be a simple pointer type)OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (having value type
cuda::std::pair<int, T>) (may be a simple pointer type)
- Parameters:
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr, the required allocation size is written totemp_storage_bytesand no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storageallocationd_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
segment_size – [in] The fixed segment size of each segment
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename EnvT = ::cuda::std::execution::env<>>
static inline cudaError_t ArgMax( - InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- int segment_size,
- EnvT env = {}
Finds the first device-wide maximum in each segment using the greater-than (
>) operator, also returning the in-segment index of that item, with a fixed segment size.Added in version 3.4.0: First appears in CUDA Toolkit 13.4.
The output value type of
d_outiscuda::std::pair<int, T>(assuming the value type ofd_inisT)The maximum of the ith segment is written to
d_out[i].secondand its offset in that segment is written tod_out[i].first.The
{1, ::cuda::std::numeric_limits<T>::lowest()}tuple is produced for zero-length inputs
Can use a specific stream or cuda memory resource through the
envparameterWhen
d_temp_storageisnullptr, no work is done and the required allocation size is returned intemp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.
Snippet#
int num_segments = 2; int segment_size = 3; thrust::device_vector<int> d_in{8, 6, 7, 5, 3, 0}; thrust::device_vector<cuda::std::pair<int, int>> d_out(2); cuda::stream stream{cuda::devices[0]}; cuda::stream_ref stream_ref{stream}; auto env = ::cuda::std::execution::env{stream_ref}; auto error = cub::DeviceSegmentedReduce::ArgMax(d_in.begin(), d_out.begin(), num_segments, segment_size, env); if (error != cudaSuccess) { std::cerr << "cub::DeviceSegmentedReduce::ArgMax (fixed-size) failed with status: " << error << '\n'; }
- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (of some type
T) (may be a simple pointer type)OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (having value type
cuda::std::pair<int, T>) (may be a simple pointer type)EnvT – [inferred] Execution environment type. Default is
cuda::std::execution::env<>.
- Parameters:
d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
segment_size – [in] The fixed segment size of each segment
env – [in]
[optional] Execution environment. Default is
cuda::std::execution::env{}.