cub::DeviceSegmentedReduce#
-
struct DeviceSegmentedReduce#
DeviceSegmentedReduce provides device-wide, parallel operations for computing a reduction across multiple sequences of data items residing within device-accessible memory.
Overview#
A reduction (or fold) uses a binary combining operator to compute a single aggregate from a sequence of input elements.
Usage Considerations#
Dynamic parallelism. DeviceSegmentedReduce methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.
Public Static Functions
-
template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT, typename ReductionOpT, typename T>
static inline cudaError_t Reduce( - void *d_temp_storage,
- size_t &temp_storage_bytes,
- InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- BeginOffsetIteratorT d_begin_offsets,
- EndOffsetIteratorT d_end_offsets,
- ReductionOpT reduction_op,
- T initial_value,
- cudaStream_t stream = 0
Computes a device-wide segmented reduction using the specified binary
reduction_opfunctor.Added in version 2.2.0: First appears in CUDA Toolkit 12.3.
Does not support binary reduction operators that are non-commutative.
Provides “run-to-run” determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
When input a contiguous sequence of segments, a single sequence
segment_offsets(of lengthnum_segments + 1) can be aliased for both thed_begin_offsetsandd_end_offsetsparameters (where the latter is specified assegment_offsets + 1).Let
sbe in[0, num_segments). The range[d_out + d_begin_offsets[s], d_out + d_end_offsets[s])shall not overlap[d_in + d_begin_offsets[s], d_in + d_end_offsets[s]),[d_begin_offsets, d_begin_offsets + num_segments)nor[d_end_offsets, d_end_offsets + num_segments).When
d_temp_storageisnullptr, no work is done and the required allocation size is returned intemp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.
Snippet#
The code snippet below illustrates a custom min-reduction of a device vector of
intdata elements.- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
ReductionOpT – [inferred] Binary reduction functor type having member
T operator()(const T &a, const T &b)T – [inferred] Data element type that is convertible to the
valuetype ofInputIteratorT
- Parameters:
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr, the required allocation size is written totemp_storage_bytesand no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storageallocationd_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments, such thatd_begin_offsets[i]is the first element of the ith data segment ind_ind_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments, such thatd_end_offsets[i] - 1is the last element of the ith data segment ind_in. Ifd_end_offsets[i] - 1 <= d_begin_offsets[i], the ith is considered empty.reduction_op – [in] Binary reduction functor
initial_value – [in] Initial value of the reduction for each segment
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT, typename ReductionOpT, typename T, typename EnvT = ::cuda::std::execution::env<>>
static inline cudaError_t Reduce( - InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- BeginOffsetIteratorT d_begin_offsets,
- EndOffsetIteratorT d_end_offsets,
- ReductionOpT reduction_op,
- T initial_value,
- EnvT env = {}
Computes a device-wide segmented reduction using the specified binary
reduction_opfunctor.Added in version 3.4.0: First appears in CUDA Toolkit 13.4.
Does not support binary reduction operators that are non-commutative.
Provides “run-to-run” determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
When input a contiguous sequence of segments, a single sequence
segment_offsets(of lengthnum_segments + 1) can be aliased for both thed_begin_offsetsandd_end_offsetsparameters (where the latter is specified assegment_offsets + 1).Let
sbe in[0, num_segments). The range[d_out + d_begin_offsets[s], d_out + d_end_offsets[s])shall not overlap[d_in + d_begin_offsets[s], d_in + d_end_offsets[s]),[d_begin_offsets, d_begin_offsets + num_segments)nor[d_end_offsets, d_end_offsets + num_segments).Can use a specific stream or cuda memory resource through the
envparameter.
Snippet#
The code snippet below illustrates a custom min-reduction of a device vector of
intdata elements.- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
ReductionOpT – [inferred] Binary reduction functor type having member
T operator()(const T &a, const T &b)T – [inferred] Data element type that is convertible to the
valuetype ofInputIteratorTEnvT – [inferred] Execution environment type. Default is
cuda::std::execution::env<>.
- Parameters:
d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments, such thatd_begin_offsets[i]is the first element of the ith data segment ind_ind_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments, such thatd_end_offsets[i] - 1is the last element of the ith data segment ind_in. Ifd_end_offsets[i] - 1 <= d_begin_offsets[i], the ith is considered empty.reduction_op – [in] Binary reduction functor
initial_value – [in] Initial value of the reduction for each segment
env – [in]
[optional] Execution environment. Default is
cuda::std::execution::env{}.
-
template<typename InputIteratorT, typename OutputIteratorT, typename ReductionOpT, typename T>
static inline cudaError_t Reduce( - void *d_temp_storage,
- size_t &temp_storage_bytes,
- InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- int segment_size,
- ReductionOpT reduction_op,
- T initial_value,
- cudaStream_t stream = 0
Computes a device-wide segmented reduction using the specified binary
reduction_opfunctor and a fixed segment size.Added in version 3.2.0: First appears in CUDA Toolkit 13.2.
Does not support binary reduction operators that are non-commutative.
Provides “run-to-run” determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
When
d_temp_storageisnullptr, no work is done and the required allocation size is returned intemp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.
Snippet#
The code snippet below illustrates a custom min-reduction of a device vector of
intdata elements.- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
ReductionOpT – [inferred] Binary reduction functor type having member
T operator()(const T &a, const T &b)T – [inferred] Data element type that is convertible to the
valuetype ofInputIteratorT
- Parameters:
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr, the required allocation size is written totemp_storage_bytesand no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storageallocationd_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregates
num_segments – [in] The number of segments that comprise the segmented reduction data
segment_size – [in] The fixed segment size of each segment
reduction_op – [in] Binary reduction functor
initial_value – [in] Initial value of the reduction for each segment
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT, typename = ::cuda::std::void_t<typename ::cuda::std::iterator_traits<BeginOffsetIteratorT>::value_type, typename ::cuda::std::iterator_traits<EndOffsetIteratorT>::value_type>>
static inline cudaError_t Sum( - void *d_temp_storage,
- size_t &temp_storage_bytes,
- InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- BeginOffsetIteratorT d_begin_offsets,
- EndOffsetIteratorT d_end_offsets,
- cudaStream_t stream = 0
Computes a device-wide segmented sum using the addition (
+) operator.Added in version 2.2.0: First appears in CUDA Toolkit 12.3.
Uses
0as the initial value of the reduction for each segment.When input a contiguous sequence of segments, a single sequence
segment_offsets(of lengthnum_segments + 1) can be aliased for both thed_begin_offsetsandd_end_offsetsparameters (where the latter is specified assegment_offsets + 1).Does not support
+operators that are non-commutative.Let
sbe in[0, num_segments). The range[d_out + d_begin_offsets[s], d_out + d_end_offsets[s])shall not overlap[d_in + d_begin_offsets[s], d_in + d_end_offsets[s]),[d_begin_offsets, d_begin_offsets + num_segments)nor[d_end_offsets, d_end_offsets + num_segments).When
d_temp_storageisnullptr, no work is done and the required allocation size is returned intemp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.
Snippet#
The code snippet below illustrates the sum reduction of a device vector of
intdata elements.- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
- Parameters:
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr, the required allocation size is written totemp_storage_bytesand no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storageallocationd_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments`, such that ``d_begin_offsets[i]is the first element of the ith data segment ind_ind_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments, such thatd_end_offsets[i] - 1is the last element of the ith data segment ind_in. Ifd_end_offsets[i] - 1 <= d_begin_offsets[i], the ith is considered empty.stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT, typename = ::cuda::std::void_t<typename ::cuda::std::iterator_traits<BeginOffsetIteratorT>::value_type, typename ::cuda::std::iterator_traits<EndOffsetIteratorT>::value_type>, typename EnvT = ::cuda::std::execution::env<>>
static inline cudaError_t Sum( - InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- BeginOffsetIteratorT d_begin_offsets,
- EndOffsetIteratorT d_end_offsets,
- EnvT env = {}
Computes a device-wide segmented sum using the addition (
+) operator.Added in version 2.2.0: First appears in CUDA Toolkit 12.3.
Uses
0as the initial value of the reduction for each segment.When input a contiguous sequence of segments, a single sequence
segment_offsets(of lengthnum_segments + 1) can be aliased for both thed_begin_offsetsandd_end_offsetsparameters (where the latter is specified assegment_offsets + 1).Does not support
+operators that are non-commutative.Let
sbe in[0, num_segments). The range[d_out + d_begin_offsets[s], d_out + d_end_offsets[s])shall not overlap[d_in + d_begin_offsets[s], d_in + d_end_offsets[s]),[d_begin_offsets, d_begin_offsets + num_segments)nor[d_end_offsets, d_end_offsets + num_segments).Can use a specific stream or cuda memory resource through the env parameter.
When
d_temp_storageisnullptr, no work is done and the required allocation size is returned intemp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.
Snippet#
The code snippet below illustrates the sum reduction of a device vector of
intdata elements.- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
EnvT – [inferred] Execution environment type. Default is
cuda::std::execution::env<>.
- Parameters:
d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments`, such that ``d_begin_offsets[i]is the first element of the ith data segment ind_ind_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments, such thatd_end_offsets[i] - 1is the last element of the ith data segment ind_in. Ifd_end_offsets[i] - 1 <= d_begin_offsets[i], the ith is considered empty.env – [in]
[optional] Execution environment. Default is
cuda::std::execution::env{}.
-
template<typename InputIteratorT, typename OutputIteratorT>
static inline cudaError_t Sum( - void *d_temp_storage,
- size_t &temp_storage_bytes,
- InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- int segment_size,
- cudaStream_t stream = 0
Computes a device-wide segmented sum using the addition (
+) operator.Added in version 3.2.0: First appears in CUDA Toolkit 13.2.
Uses
0as the initial value of the reduction for each segment.When
d_temp_storageisnullptr, no work is done and the required allocation size is returned intemp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.
Snippet#
The code snippet below illustrates the sum reduction of a device vector of
intdata elements.- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
- Parameters:
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr, the required allocation size is written totemp_storage_bytesand no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storageallocationd_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
segment_size – [in] The fixed segment size of each segment
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT, typename = ::cuda::std::void_t<typename ::cuda::std::iterator_traits<BeginOffsetIteratorT>::value_type, typename ::cuda::std::iterator_traits<EndOffsetIteratorT>::value_type>>
static inline cudaError_t Min( - void *d_temp_storage,
- size_t &temp_storage_bytes,
- InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- BeginOffsetIteratorT d_begin_offsets,
- EndOffsetIteratorT d_end_offsets,
- cudaStream_t stream = 0
Computes a device-wide segmented minimum using the less-than (
<) operator.Added in version 2.2.0: First appears in CUDA Toolkit 12.3.
Uses
::cuda::std::numeric_limits<T>::max()as the initial value of the reduction for each segment.When input a contiguous sequence of segments, a single sequence
segment_offsets(of lengthnum_segments + 1) can be aliased for both thed_begin_offsetsandd_end_offsetsparameters (where the latter is specified assegment_offsets + 1).Does not support
<operators that are non-commutative.Let
sbe in[0, num_segments). The range[d_out + d_begin_offsets[s], d_out + d_end_offsets[s])shall not overlap[d_in + d_begin_offsets[s], d_in + d_end_offsets[s]),[d_begin_offsets, d_begin_offsets + num_segments)nor[d_end_offsets, d_end_offsets + num_segments).When
d_temp_storageisnullptr, no work is done and the required allocation size is returned intemp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.
Snippet#
The code snippet below illustrates the min-reduction of a device vector of
intdata elements.- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
- Parameters:
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr, the required allocation size is written totemp_storage_bytesand no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storageallocationd_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments, such thatd_begin_offsets[i]is the first element of the ith data segment ind_ind_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments, such thatd_end_offsets[i] - 1is the last element of the ith data segment ind_in. Ifd_end_offsets[i] - 1 <= d_begin_offsets[i], the ith is considered empty.stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT, typename EnvT = ::cuda::std::execution::env<>>
static inline cudaError_t Min( - InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- BeginOffsetIteratorT d_begin_offsets,
- EndOffsetIteratorT d_end_offsets,
- EnvT env = {}
Computes a device-wide segmented minimum using the less-than (
<) operator.Added in version 3.4.0: First appears in CUDA Toolkit 13.4.
Uses
::cuda::std::numeric_limits<T>::max()as the initial value of the reduction for each segment.When input a contiguous sequence of segments, a single sequence
segment_offsets(of lengthnum_segments + 1) can be aliased for both thed_begin_offsetsandd_end_offsetsparameters (where the latter is specified assegment_offsets + 1).Does not support
<operators that are non-commutative.Let
sbe in[0, num_segments). The range[d_out + d_begin_offsets[s], d_out + d_end_offsets[s])shall not overlap[d_in + d_begin_offsets[s], d_in + d_end_offsets[s]),[d_begin_offsets, d_begin_offsets + num_segments)nor[d_end_offsets, d_end_offsets + num_segments).Can use a specific stream or cuda memory resource through the
envparameter.
Snippet#
The code snippet below illustrates the min-reduction of a device vector of
intdata elements.- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
EnvT – [inferred] Execution environment type. Default is
cuda::std::execution::env<>.
- Parameters:
d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments, such thatd_begin_offsets[i]is the first element of the ith data segment ind_ind_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments, such thatd_end_offsets[i] - 1is the last element of the ith data segment ind_in. Ifd_end_offsets[i] - 1 <= d_begin_offsets[i], the ith is considered empty.env – [in]
[optional] Execution environment. Default is
cuda::std::execution::env{}.
-
template<typename InputIteratorT, typename OutputIteratorT>
static inline cudaError_t Min( - void *d_temp_storage,
- size_t &temp_storage_bytes,
- InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- int segment_size,
- cudaStream_t stream = 0
Computes a device-wide segmented minimum using the less-than (
<) operator.Added in version 3.2.0: First appears in CUDA Toolkit 13.2.
Uses
::cuda::std::numeric_limits<T>::max()as the initial value of the reduction for each segment.
Snippet#
The code snippet below illustrates the min-reduction of a device vector of
intdata elements.- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
- Parameters:
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr, the required allocation size is written totemp_storage_bytesand no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storageallocationd_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
segment_size – [in] The fixed segment size of each segment
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT, typename = ::cuda::std::void_t<typename ::cuda::std::iterator_traits<BeginOffsetIteratorT>::value_type, typename ::cuda::std::iterator_traits<EndOffsetIteratorT>::value_type>>
static inline cudaError_t ArgMin( - void *d_temp_storage,
- size_t &temp_storage_bytes,
- InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- BeginOffsetIteratorT d_begin_offsets,
- EndOffsetIteratorT d_end_offsets,
- cudaStream_t stream = 0
Finds the first device-wide minimum in each segment using the less-than (
<) operator, also returning the in-segment index of that item.Added in version 2.2.0: First appears in CUDA Toolkit 12.3.
The output value type of
d_outiscub::KeyValuePair<int, T>(assuming the value type ofd_inisT)The minimum of the ith segment is written to
d_out[i].valueand its offset in that segment is written tod_out[i].key.The
{1, ::cuda::std::numeric_limits<T>::max()}tuple is produced for zero-length inputs
When input a contiguous sequence of segments, a single sequence
segment_offsets(of lengthnum_segments + 1) can be aliased for both thed_begin_offsetsandd_end_offsetsparameters (where the latter is specified assegment_offsets + 1).Does not support
<operators that are non-commutative.Let
sbe in[0, num_segments). The range[d_out + d_begin_offsets[s], d_out + d_end_offsets[s])shall not overlap[d_in + d_begin_offsets[s], d_in + d_end_offsets[s]),[d_begin_offsets, d_begin_offsets + num_segments)nor[d_end_offsets, d_end_offsets + num_segments).When
d_temp_storageisnullptr, no work is done and the required allocation size is returned intemp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.
Snippet#
The code snippet below illustrates the argmin-reduction of a device vector of
intdata elements.- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (of some type
T) (may be a simple pointer type)OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (having value type
KeyValuePair<int, T>) (may be a simple pointer type)BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
- Parameters:
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr, the required allocation size is written totemp_storage_bytesand no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storageallocationd_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments, such thatd_begin_offsets[i]is the first element of the ith data segment ind_ind_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments, such thatd_end_offsets[i] - 1is the last element of the ith data segment ind_in. Ifd_end_offsets[i] - 1 <= d_begin_offsets[i], the ith is considered empty.stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT, typename EnvT = ::cuda::std::execution::env<>>
static inline cudaError_t ArgMin( - InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- BeginOffsetIteratorT d_begin_offsets,
- EndOffsetIteratorT d_end_offsets,
- EnvT env = {}
Finds the first device-wide minimum in each segment using the less-than (
<) operator, also returning the in-segment index of that item.Added in version 3.4.0: First appears in CUDA Toolkit 13.4.
The output value type of
d_outiscub::KeyValuePair<int, T>(assuming the value type ofd_inisT)The minimum of the ith segment is written to
d_out[i].valueand its offset in that segment is written tod_out[i].key.The
{1, std::numeric_limits<T>::max()}tuple is produced for zero-length inputs
Does not support
<operators that are non-commutative.Provides “run-to-run” determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
When input a contiguous sequence of segments, a single sequence
segment_offsets(of lengthnum_segments + 1) can be aliased for both thed_begin_offsetsandd_end_offsetsparameters (where the latter is specified assegment_offsets + 1).Let
sbe in[0, num_segments). The range[d_out + d_begin_offsets[s], d_out + d_end_offsets[s])shall not overlap[d_in + d_begin_offsets[s], d_in + d_end_offsets[s]),[d_begin_offsets, d_begin_offsets + num_segments)nor[d_end_offsets, d_end_offsets + num_segments).Can use a specific stream or cuda memory resource through the
envparameter.
Snippet#
The code snippet below illustrates the argmin-reduction of a device vector of
intdata elements.- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (of some type
T) (may be a simple pointer type)OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (having value type
cub::KeyValuePair<int, T>) (may be a simple pointer type)BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
EnvT – [inferred] Execution environment type. Default is
cuda::std::execution::env<>.
- Parameters:
d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments, such thatd_begin_offsets[i]is the first element of the ith data segment ind_ind_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments, such thatd_end_offsets[i] - 1is the last element of the ith data segment ind_in. Ifd_end_offsets[i] - 1 <= d_begin_offsets[i], the ith is considered empty.env – [in]
[optional] Execution environment. Default is
cuda::std::execution::env{}.
-
template<typename InputIteratorT, typename OutputIteratorT>
static inline cudaError_t ArgMin( - void *d_temp_storage,
- size_t &temp_storage_bytes,
- InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- int segment_size,
- cudaStream_t stream = 0
Finds the first device-wide minimum in each segment using the less-than (
<) operator, also returning the in-segment index of that item.Added in version 3.2.0: First appears in CUDA Toolkit 13.2.
The output value type of
d_outis::cuda::std::pair<int, T>(assuming the value type ofd_inisT)The minimum of the ith segment is written to
d_out[i].secondand its offset in that segment is written tod_out[i].first.The
{1, ::cuda::std::numeric_limits<T>::max()}tuple is produced for zero-length inputs
Snippet#
The code snippet below illustrates the argmin-reduction of a device vector of
intdata elements.- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (of some type
T) (may be a simple pointer type)OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (having value type
cuda::std::pair<int, T>) (may be a simple pointer type)
- Parameters:
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr, the required allocation size is written totemp_storage_bytesand no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storageallocationd_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
segment_size – [in] The fixed segment size of each segment
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT, typename = ::cuda::std::void_t<typename ::cuda::std::iterator_traits<BeginOffsetIteratorT>::value_type, typename ::cuda::std::iterator_traits<EndOffsetIteratorT>::value_type>>
static inline cudaError_t Max( - void *d_temp_storage,
- size_t &temp_storage_bytes,
- InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- BeginOffsetIteratorT d_begin_offsets,
- EndOffsetIteratorT d_end_offsets,
- cudaStream_t stream = 0
Computes a device-wide segmented maximum using the greater-than (
>) operator.Added in version 2.2.0: First appears in CUDA Toolkit 12.3.
Uses
::cuda::std::numeric_limits<T>::lowest()as the initial value of the reduction.When input a contiguous sequence of segments, a single sequence
segment_offsets(of lengthnum_segments + 1) can be aliased for both thed_begin_offsetsandd_end_offsetsparameters (where the latter is specified assegment_offsets + 1).Does not support
>operators that are non-commutative.Let
sbe in[0, num_segments). The range[d_out + d_begin_offsets[s], d_out + d_end_offsets[s])shall not overlap[d_in + d_begin_offsets[s], d_in + d_end_offsets[s]),[d_begin_offsets, d_begin_offsets + num_segments)nor[d_end_offsets, d_end_offsets + num_segments).When
d_temp_storageisnullptr, no work is done and the required allocation size is returned intemp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.
Snippet#
The code snippet below illustrates the max-reduction of a device vector of
intdata elements.- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
- Parameters:
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr, the required allocation size is written totemp_storage_bytesand no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storageallocationd_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments, such thatd_begin_offsets[i]is the first element of the ith data segment ind_ind_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments, such thatd_end_offsets[i] - 1is the last element of the ith data segment ind_in. Ifd_end_offsets[i] - 1 <= d_begin_offsets[i], the ith is considered empty.stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT, typename EnvT = ::cuda::std::execution::env<>>
static inline cudaError_t Max( - InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- BeginOffsetIteratorT d_begin_offsets,
- EndOffsetIteratorT d_end_offsets,
- EnvT env = {}
Computes a device-wide segmented maximum using the greater-than (
>) operator.Added in version 3.4.0: First appears in CUDA Toolkit 13.4.
Uses
::cuda::std::numeric_limits<T>::lowest()as the initial value of the reduction.When input a contiguous sequence of segments, a single sequence
segment_offsets(of lengthnum_segments + 1) can be aliased for both thed_begin_offsetsandd_end_offsetsparameters (where the latter is specified assegment_offsets + 1).Does not support
>operators that are non-commutative.Let
sbe in[0, num_segments). The range[d_out + d_begin_offsets[s], d_out + d_end_offsets[s])shall not overlap[d_in + d_begin_offsets[s], d_in + d_end_offsets[s]),[d_begin_offsets, d_begin_offsets + num_segments)nor[d_end_offsets, d_end_offsets + num_segments).Can use a specific stream or cuda memory resource through the
envparameter.
Snippet#
The code snippet below illustrates the max-reduction of a device vector of
intdata elements.- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
EnvT – [inferred] Execution environment type. Default is
cuda::std::execution::env<>.
- Parameters:
d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments, such thatd_begin_offsets[i]is the first element of the ith data segment ind_ind_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments, such thatd_end_offsets[i] - 1is the last element of the ith data segment ind_in. Ifd_end_offsets[i] - 1 <= d_begin_offsets[i], the ith is considered empty.env – [in]
[optional] Execution environment. Default is
cuda::std::execution::env{}.
-
template<typename InputIteratorT, typename OutputIteratorT>
static inline cudaError_t Max( - void *d_temp_storage,
- size_t &temp_storage_bytes,
- InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- int segment_size,
- cudaStream_t stream = 0
Computes a device-wide segmented maximum using the greater-than (
>) operator.Added in version 3.2.0: First appears in CUDA Toolkit 13.2.
Uses
::cuda::std::numeric_limits<T>::lowest()as the initial value of the reduction.
Snippet#
The code snippet below illustrates the max-reduction of a device vector of
intdata elements.- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
- Parameters:
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr, the required allocation size is written totemp_storage_bytesand no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storageallocationd_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
segment_size – [in] The fixed segment size of each segment
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT, typename = ::cuda::std::void_t<typename ::cuda::std::iterator_traits<BeginOffsetIteratorT>::value_type, typename ::cuda::std::iterator_traits<EndOffsetIteratorT>::value_type>>
static inline cudaError_t ArgMax( - void *d_temp_storage,
- size_t &temp_storage_bytes,
- InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- BeginOffsetIteratorT d_begin_offsets,
- EndOffsetIteratorT d_end_offsets,
- cudaStream_t stream = 0
Finds the first device-wide maximum in each segment using the greater-than (
>) operator, also returning the in-segment index of that itemAdded in version 2.2.0: First appears in CUDA Toolkit 12.3.
The output value type of
d_outiscub::KeyValuePair<int, T>(assuming the value type ofd_inisT)The maximum of the ith segment is written to
d_out[i].valueand its offset in that segment is written tod_out[i].key.The
{1, ::cuda::std::numeric_limits<T>::lowest()}tuple is produced for zero-length inputs
When input a contiguous sequence of segments, a single sequence
segment_offsets(of lengthnum_segments + 1) can be aliased for both thed_begin_offsetsandd_end_offsetsparameters (where the latter is specified assegment_offsets + 1).Does not support
>operators that are non-commutative.Let
sbe in[0, num_segments). The range[d_out + d_begin_offsets[s], d_out + d_end_offsets[s])shall not overlap[d_in + d_begin_offsets[s], d_in + d_end_offsets[s]),[d_begin_offsets, d_begin_offsets + num_segments)nor[d_end_offsets, d_end_offsets + num_segments).When
d_temp_storageisnullptr, no work is done and the required allocation size is returned intemp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.
Snippet#
The code snippet below illustrates the argmax-reduction of a device vector of int data elements.
- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (of some type
T) (may be a simple pointer type)OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (having value type
KeyValuePair<int, T>) (may be a simple pointer type)BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
- Parameters:
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr, the required allocation size is written totemp_storage_bytesand no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storageallocationd_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length num_segments, such that
d_begin_offsets[i]is the first element of the ith data segment ind_ind_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments, such thatd_end_offsets[i] - 1is the last element of the ith data segment ind_in. Ifd_end_offsets[i] - 1 <= d_begin_offsets[i], the ith is considered empty.stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT, typename EnvT = ::cuda::std::execution::env<>>
static inline cudaError_t ArgMax( - InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- BeginOffsetIteratorT d_begin_offsets,
- EndOffsetIteratorT d_end_offsets,
- EnvT env = {}
Finds the first device-wide maximum in each segment using the greater-than (
>) operator, also returning the in-segment index of that item.Added in version 3.4.0: First appears in CUDA Toolkit 13.4.
The output value type of
d_outiscub::KeyValuePair<int, T>(assuming the value type ofd_inisT)The maximum of the ith segment is written to
d_out[i].valueand its offset in that segment is written tod_out[i].key.The
{1, ::cuda::std::numeric_limits<T>::lowest()}tuple is produced for zero-length inputs
Does not support
>operators that are non-commutative.Provides “run-to-run” determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
When input a contiguous sequence of segments, a single sequence
segment_offsets(of lengthnum_segments + 1) can be aliased for both thed_begin_offsetsandd_end_offsetsparameters (where the latter is specified assegment_offsets + 1).Let
sbe in[0, num_segments). The range[d_out + d_begin_offsets[s], d_out + d_end_offsets[s])shall not overlap[d_in + d_begin_offsets[s], d_in + d_end_offsets[s]),[d_begin_offsets, d_begin_offsets + num_segments)nor[d_end_offsets, d_end_offsets + num_segments).Can use a specific stream or cuda memory resource through the
envparameter.
Snippet#
The code snippet below illustrates the argmax-reduction of a device vector of
intdata elements.- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (of some type
T) (may be a simple pointer type)OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (having value type
cub::KeyValuePair<int, T>) (may be a simple pointer type)BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
EnvT – [inferred] Execution environment type. Default is
cuda::std::execution::env<>.
- Parameters:
d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments, such thatd_begin_offsets[i]is the first element of the ith data segment ind_ind_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments, such thatd_end_offsets[i] - 1is the last element of the ith data segment ind_in. Ifd_end_offsets[i] - 1 <= d_begin_offsets[i], the ith is considered empty.env – [in]
[optional] Execution environment. Default is
cuda::std::execution::env{}.
-
template<typename InputIteratorT, typename OutputIteratorT>
static inline cudaError_t ArgMax( - void *d_temp_storage,
- size_t &temp_storage_bytes,
- InputIteratorT d_in,
- OutputIteratorT d_out,
- ::cuda::std::int64_t num_segments,
- int segment_size,
- cudaStream_t stream = 0
Finds the first device-wide maximum in each segment using the greater-than (
>) operator, also returning the in-segment index of that itemAdded in version 3.2.0: First appears in CUDA Toolkit 13.2.
The output value type of
d_outis::cuda::std::pair<int, T>(assuming the value type ofd_inisT)The maximum of the ith segment is written to
d_out[i].secondand its offset in that segment is written tod_out[i].first.The
{1, ::cuda::std::numeric_limits<T>::lowest()}tuple is produced for zero-length inputs
Snippet#
The code snippet below illustrates the argmax-reduction of a device vector of int data elements.
- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (of some type
T) (may be a simple pointer type)OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (having value type
cuda::std::pair<int, T>) (may be a simple pointer type)
- Parameters:
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr, the required allocation size is written totemp_storage_bytesand no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storageallocationd_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
segment_size – [in] The fixed segment size of each segment
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.