cub::DeviceSegmentedReduce#

struct DeviceSegmentedReduce#

DeviceSegmentedReduce provides device-wide, parallel operations for computing a reduction across multiple sequences of data items residing within device-accessible memory.

Overview#

A reduction (or fold) uses a binary combining operator to compute a single aggregate from a sequence of input elements.

Usage Considerations#

Dynamic parallelism. DeviceSegmentedReduce methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.

Determinism. The default reproducibility guarantee is run_to_run. A different guarantee can be requested through the execution environment with cuda::execution::require. See Determinism in CUB for the supported guarantees.

Determinism#

cub::DeviceSegmentedReduce supports not_guaranteed and run_to_run (default run_to_run). gpu_to_gpu is not supported and is rejected at compile time. See the determinism guarantees for what each level means.

Tuning#

All algorithms in DeviceSegmentedReduce that accept an environment can be tuned by passing a custom policy selector that returns a @ref SegmentedReducePolicy, as shown in the example below:

struct SegmentedReducePolicySelector
{
  __host__ __device__ constexpr auto operator()(cuda::compute_capability cc) const -> cub::SegmentedReducePolicy
  {
    auto rp = cub::ReducePassPolicy{
      .threads_per_block = 128,
      .items_per_thread  = cc > cuda::compute_capability{9, 0} ? 11 : 7,
      .vec_size          = 4,
      .reduce_algorithm  = cub::BLOCK_REDUCE_WARP_REDUCTIONS,
      .load_modifier     = cub::LOAD_LDG};
    return {
      .large_reduce  = rp,
      .medium_reduce = {.threads_per_block = rp.threads_per_block,
                        .threads_per_warp  = 32,
                        .items_per_thread  = rp.items_per_thread,
                        .vec_size          = rp.vec_size,
                        .load_modifier     = rp.load_modifier},
      .small_reduce  = {.threads_per_block = rp.threads_per_block,
                        .threads_per_warp  = 1,
                        .items_per_thread  = rp.items_per_thread,
                        .vec_size          = rp.vec_size,
                        .load_modifier     = rp.load_modifier}};
  }
};

int num_segments                     = 3;
thrust::device_vector<int> d_offsets = {0, 3, 3, 7};
auto d_offsets_it                    = thrust::raw_pointer_cast(d_offsets.data());
thrust::device_vector<int> d_in{8, 6, 7, 5, 3, 0, 9};
thrust::device_vector<int> d_out(3, thrust::no_init);

const auto error = cub::DeviceSegmentedReduce::Sum(
  d_in.begin(),
  d_out.begin(),
  num_segments,
  d_offsets_it,
  d_offsets_it + 1,
  cuda::execution::tune(SegmentedReducePolicySelector{}));
if (error != cudaSuccess)
{
  std::cerr << "cub::DeviceSegmentedReduce::Sum failed with status: " << error << '\n';
}

thrust::device_vector<int> expected{21, 0, 17};

Public Static Functions

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT, typename ReductionOpT, typename T> static inline cudaError_t Reduce( void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, ::cuda::std::int64_t num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, ReductionOpT reduction_op, T initial_value, cudaStream_t stream = nullptr )#

Computes a device-wide segmented reduction using the specified binary reduction_op functor.

Added in version 2.2.0: First appears in CUDA Toolkit 12.3.

Does not support binary reduction operators that are non-commutative.
Provides “run-to-run” determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
When input a contiguous sequence of segments, a single sequence segment_offsets (of length num_segments + 1) can be aliased for both the d_begin_offsets and d_end_offsets parameters (where the latter is specified as segment_offsets + 1).
Let s be in [0, num_segments). The range [d_out + d_begin_offsets[s], d_out + d_end_offsets[s]) shall not overlap [d_in + d_begin_offsets[s], d_in + d_end_offsets[s]), [d_begin_offsets, d_begin_offsets + num_segments) nor [d_end_offsets, d_end_offsets + num_segments).
Temporary storage for this operation. If d_temp_storage is nullptr, the required size is written to temp_storage_bytes without dereferencing iterators or launching kernels. Otherwise, d_temp_storage must point to a device-accessible allocation of at least temp_storage_bytes bytes. No special alignment is required. See Two-Phase API (explicit temporary storage management) for usage guidance.

Snippet#

The code snippet below illustrates a custom min-reduction of a device vector of int data elements.

struct CustomMin
{
  template <typename T>
  __device__ __forceinline__ T operator()(const T& a, const T& b) const
  {
    return (b < a) ? b : a;
  }
};

int num_segments                  = 3;
c2h::device_vector<int> d_offsets = {0, 3, 3, 7};
auto d_offsets_it                 = thrust::raw_pointer_cast(d_offsets.data());
c2h::device_vector<int> d_in{8, 6, 7, 5, 3, 0, 9};
c2h::device_vector<int> d_out(3);
CustomMin min_op;
int initial_value{INT_MAX};

// Determine temporary device storage requirements
void* d_temp_storage      = nullptr;
size_t temp_storage_bytes = 0;
cub::DeviceSegmentedReduce::Reduce(
  d_temp_storage,
  temp_storage_bytes,
  d_in.begin(),
  d_out.begin(),
  num_segments,
  d_offsets_it,
  d_offsets_it + 1,
  min_op,
  initial_value);

c2h::device_vector<std::uint8_t> temp_storage(temp_storage_bytes);
d_temp_storage = thrust::raw_pointer_cast(temp_storage.data());

// Run reduction
cub::DeviceSegmentedReduce::Reduce(
  d_temp_storage,
  temp_storage_bytes,
  d_in.begin(),
  d_out.begin(),
  num_segments,
  d_offsets_it,
  d_offsets_it + 1,
  min_op,
  initial_value);

c2h::device_vector<int> expected{6, INT_MAX, 0};

Template Parameters:

InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
ReductionOpT – [inferred] Binary reduction functor type having member T operator()(const T &a, const T &b)
T – [inferred] Data element type that is convertible to the value type of InputIteratorT

Parameters:

d_temp_storage – [in] Temporary storage for this operation. If d_temp_storage is nullptr, the required size is written to temp_storage_bytes without dereferencing iterators or launching kernels. Otherwise, d_temp_storage must point to a device-accessible allocation of at least temp_storage_bytes bytes. No special alignment is required. See :ref:device-temp-storage for usage guidance.
temp_storage_bytes – [inout] Reference to size in bytes of d_temp_storage allocation
d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_begin_offsets[i] is the first element of the i^th data segment in d_in
d_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_end_offsets[i] - 1 is the last element of the i^th data segment in d_in. If d_end_offsets[i] - 1 <= d_begin_offsets[i], the i^th is considered empty.
reduction_op – [in] Binary reduction functor
initial_value – [in] Initial value of the reduction for each segment
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream₀.

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT, typename ReductionOpT, typename T, typename EnvT = ::cuda::std::execution::env<>> static inline cudaError_t Reduce( InputIteratorT d_in, OutputIteratorT d_out, ::cuda::std::int64_t num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, ReductionOpT reduction_op, T initial_value, EnvT env = {} )#

Computes a device-wide segmented reduction using the specified binary reduction_op functor.

Added in version 3.4.0: First appears in CUDA Toolkit 13.4.

Does not support binary reduction operators that are non-commutative.
Provides “run-to-run” determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
When input a contiguous sequence of segments, a single sequence segment_offsets (of length num_segments + 1) can be aliased for both the d_begin_offsets and d_end_offsets parameters (where the latter is specified as segment_offsets + 1).
Let s be in [0, num_segments). The range [d_out + d_begin_offsets[s], d_out + d_end_offsets[s]) shall not overlap [d_in + d_begin_offsets[s], d_in + d_end_offsets[s]), [d_begin_offsets, d_begin_offsets + num_segments) nor [d_end_offsets, d_end_offsets + num_segments).
Can use a specific stream or cuda memory resource through the env parameter.

Snippet#

The code snippet below illustrates a custom min-reduction of a device vector of int data elements.

int num_segments                     = 3;
thrust::device_vector<int> d_offsets = {0, 3, 3, 7};
auto d_offsets_it                    = thrust::raw_pointer_cast(d_offsets.data());
thrust::device_vector<int> d_in{8, 6, 7, 5, 3, 0, 9};
thrust::device_vector<int> d_out(3);

cuda::stream stream{cuda::devices[0]};
cuda::stream_ref stream_ref{stream};

auto error = cub::DeviceSegmentedReduce::Reduce(
  d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1, ::cuda::std::plus<>{}, 0, stream_ref);
thrust::device_vector<int> expected{21, 0, 17};

if (error != cudaSuccess)
{
  std::cerr << "cub::DeviceSegmentedReduce::Reduce failed with status: " << error << '\n';
}

int num_segments                     = 3;
thrust::device_vector<int> d_offsets = {0, 3, 3, 7};
auto d_offsets_it                    = thrust::raw_pointer_cast(d_offsets.data());
thrust::device_vector<int> d_in{8, 6, 7, 5, 3, 0, 9};
thrust::device_vector<int> d_out(3);

auto env = cuda::execution::require(cuda::execution::determinism::run_to_run);

auto error = cub::DeviceSegmentedReduce::Reduce(
  d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1, ::cuda::std::plus<>{}, 0, env);
thrust::device_vector<int> expected{21, 0, 17};

if (error != cudaSuccess)
{
  std::cerr << "cub::DeviceSegmentedReduce::Reduce failed with status: " << error << '\n';
}

Template Parameters:

InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
ReductionOpT – [inferred] Binary reduction functor type having member T operator()(const T &a, const T &b)
T – [inferred] Data element type that is convertible to the value type of InputIteratorT
EnvT – [inferred] Execution environment type. Default is cuda::std::execution::env<>.

Parameters:

d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_begin_offsets[i] is the first element of the i^th data segment in d_in
d_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_end_offsets[i] - 1 is the last element of the i^th data segment in d_in. If d_end_offsets[i] - 1 <= d_begin_offsets[i], the i^th is considered empty.
reduction_op – [in] Binary reduction functor
initial_value – [in] Initial value of the reduction for each segment
env – [in]
[optional] Execution environment. Default is cuda::std::execution::env{}.

template<typename InputIteratorT, typename OutputIteratorT, typename ReductionOpT, typename T> static inline cudaError_t Reduce( void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, ::cuda::std::int64_t num_segments, int segment_size, ReductionOpT reduction_op, T initial_value, cudaStream_t stream = nullptr )#

Computes a device-wide segmented reduction using the specified binary reduction_op functor and a fixed segment size.

Added in version 3.2.0: First appears in CUDA Toolkit 13.2.

Does not support binary reduction operators that are non-commutative.
Temporary storage for this operation. If d_temp_storage is nullptr, the required size is written to temp_storage_bytes without dereferencing iterators or launching kernels. Otherwise, d_temp_storage must point to a device-accessible allocation of at least temp_storage_bytes bytes. No special alignment is required. See Two-Phase API (explicit temporary storage management) for usage guidance.

Snippet#

The code snippet below illustrates a custom min-reduction of a device vector of int data elements.

struct CustomMin
{
  template <typename T>
  __device__ __forceinline__ T operator()(const T& a, const T& b) const
  {
    return (b < a) ? b : a;
  }
};

int num_segments = 3;
int segment_size = 2;
c2h::device_vector<int> d_in{8, 6, 7, 5, 3, 0, 9};
c2h::device_vector<int> d_out(3);
CustomMin min_op;
int initial_value{INT_MAX};

// Determine temporary device storage requirements
void* d_temp_storage      = nullptr;
size_t temp_storage_bytes = 0;
cub::DeviceSegmentedReduce::Reduce(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, segment_size, min_op, initial_value);

c2h::device_vector<std::uint8_t> temp_storage(temp_storage_bytes);
d_temp_storage = thrust::raw_pointer_cast(temp_storage.data());

// Run reduction
cub::DeviceSegmentedReduce::Reduce(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, segment_size, min_op, initial_value);

c2h::device_vector<int> expected{6, 5, 0};

Template Parameters:

InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
ReductionOpT – [inferred] Binary reduction functor type having member T operator()(const T &a, const T &b)
T – [inferred] Data element type that is convertible to the value type of InputIteratorT

Parameters:

d_temp_storage – [in] Temporary storage for this operation. If d_temp_storage is nullptr, the required size is written to temp_storage_bytes without dereferencing iterators or launching kernels. Otherwise, d_temp_storage must point to a device-accessible allocation of at least temp_storage_bytes bytes. No special alignment is required. See :ref:device-temp-storage for usage guidance.
temp_storage_bytes – [inout] Reference to size in bytes of d_temp_storage allocation
d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregates
num_segments – [in] The number of segments that comprise the segmented reduction data
segment_size – [in] The fixed segment size of each segment
reduction_op – [in] Binary reduction functor
initial_value – [in] Initial value of the reduction for each segment
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream₀.

template<typename InputIteratorT, typename OutputIteratorT, typename ReductionOpT, typename T, typename EnvT = ::cuda::std::execution::env<>> static inline cudaError_t Reduce( InputIteratorT d_in, OutputIteratorT d_out, ::cuda::std::int64_t num_segments, int segment_size, ReductionOpT reduction_op, T initial_value, EnvT env = {} )#

Computes a device-wide segmented reduction using the specified binary reduction_op functor and a fixed segment size.

Added in version 3.4.0: First appears in CUDA Toolkit 13.4.

Does not support binary reduction operators that are non-commutative.
Provides “run-to-run” determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
Can use a specific stream or cuda memory resource through the env parameter

Snippet#

int num_segments = 2;
int segment_size = 3;
thrust::device_vector<int> d_in{8, 6, 7, 5, 3, 0};
thrust::device_vector<int> d_out(2);

cuda::stream stream{cuda::devices[0]};
cuda::stream_ref stream_ref{stream};

auto error = cub::DeviceSegmentedReduce::Reduce(
  d_in.begin(), d_out.begin(), num_segments, segment_size, ::cuda::std::plus<>{}, 0, stream_ref);
thrust::device_vector<int> expected{21, 8};

if (error != cudaSuccess)
{
  std::cerr << "cub::DeviceSegmentedReduce::Reduce (fixed-size) failed with status: " << error << '\n';
}

Template Parameters:

InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
ReductionOpT – [inferred] Binary reduction functor type having member T operator()(const T &a, const T &b)
T – [inferred] Data element type that is convertible to the value type of InputIteratorT
EnvT – [inferred] Execution environment type. Default is cuda::std::execution::env<>.

Parameters:

d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregates
num_segments – [in] The number of segments that comprise the segmented reduction data
segment_size – [in] The fixed segment size of each segment
reduction_op – [in] Binary reduction functor
initial_value – [in] Initial value of the reduction for each segment
env – [in]
[optional] Execution environment. Default is cuda::std::execution::env{}.

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT, typename = ::cuda::std::void_t<typename ::cuda::std::iterator_traits<BeginOffsetIteratorT>::value_type, typename ::cuda::std::iterator_traits<EndOffsetIteratorT>::value_type>> static inline cudaError_t Sum( void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, ::cuda::std::int64_t num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, cudaStream_t stream = nullptr )#

Computes a device-wide segmented sum using the addition (+) operator.

Added in version 2.2.0: First appears in CUDA Toolkit 12.3.

Uses 0 as the initial value of the reduction for each segment.
When input a contiguous sequence of segments, a single sequence segment_offsets (of length num_segments + 1) can be aliased for both the d_begin_offsets and d_end_offsets parameters (where the latter is specified as segment_offsets + 1).
Does not support + operators that are non-commutative.
Let s be in [0, num_segments). The range [d_out + d_begin_offsets[s], d_out + d_end_offsets[s]) shall not overlap [d_in + d_begin_offsets[s], d_in + d_end_offsets[s]), [d_begin_offsets, d_begin_offsets + num_segments) nor [d_end_offsets, d_end_offsets + num_segments).
Temporary storage for this operation. If d_temp_storage is nullptr, the required size is written to temp_storage_bytes without dereferencing iterators or launching kernels. Otherwise, d_temp_storage must point to a device-accessible allocation of at least temp_storage_bytes bytes. No special alignment is required. See Two-Phase API (explicit temporary storage management) for usage guidance.

Snippet#

The code snippet below illustrates the sum reduction of a device vector of int data elements.

int num_segments                  = 3;
c2h::device_vector<int> d_offsets = {0, 3, 3, 7};
auto d_offsets_it                 = thrust::raw_pointer_cast(d_offsets.data());
c2h::device_vector<int> d_in{8, 6, 7, 5, 3, 0, 9};
c2h::device_vector<int> d_out(3);

// Determine temporary device storage requirements
void* d_temp_storage      = nullptr;
size_t temp_storage_bytes = 0;
cub::DeviceSegmentedReduce::Sum(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1);

c2h::device_vector<std::uint8_t> temp_storage(temp_storage_bytes);
d_temp_storage = thrust::raw_pointer_cast(temp_storage.data());

// Run reduction
cub::DeviceSegmentedReduce::Sum(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1);

c2h::device_vector<int> expected{21, 0, 17};

Template Parameters:

InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)

Parameters:

d_temp_storage – [in] Temporary storage for this operation. If d_temp_storage is nullptr, the required size is written to temp_storage_bytes without dereferencing iterators or launching kernels. Otherwise, d_temp_storage must point to a device-accessible allocation of at least temp_storage_bytes bytes. No special alignment is required. See :ref:device-temp-storage for usage guidance.
temp_storage_bytes – [inout] Reference to size in bytes of d_temp_storage allocation
d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length num_segments`, such that ``d_begin_offsets[i] is the first element of the i^th data segment in d_in
d_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_end_offsets[i] - 1 is the last element of the i^th data segment in d_in. If d_end_offsets[i] - 1 <= d_begin_offsets[i], the i^th is considered empty.
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream₀.

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT, typename EnvT = ::cuda::std::execution::env<>> static inline cudaError_t Sum( InputIteratorT d_in, OutputIteratorT d_out, ::cuda::std::int64_t num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, EnvT env = {} )#

Computes a device-wide segmented sum using the addition (+) operator.

Added in version 3.4.0: First appears in CUDA Toolkit 13.4.

Uses 0 as the initial value of the reduction for each segment.
When input a contiguous sequence of segments, a single sequence segment_offsets (of length num_segments + 1) can be aliased for both the d_begin_offsets and d_end_offsets parameters (where the latter is specified as segment_offsets + 1).
Does not support + operators that are non-commutative.
Let s be in [0, num_segments). The range [d_out + d_begin_offsets[s], d_out + d_end_offsets[s]) shall not overlap [d_in + d_begin_offsets[s], d_in + d_end_offsets[s]), [d_begin_offsets, d_begin_offsets + num_segments) nor [d_end_offsets, d_end_offsets + num_segments).
Can use a specific stream or cuda memory resource through the env parameter.

Snippet#

The code snippet below illustrates the sum reduction of a device vector of int data elements.

int num_segments                     = 3;
thrust::device_vector<int> d_offsets = {0, 3, 3, 7};
auto d_offsets_it                    = thrust::raw_pointer_cast(d_offsets.data());
thrust::device_vector<int> d_in{8, 6, 7, 5, 3, 0, 9};
thrust::device_vector<int> d_out(3);

auto req_env = cuda::execution::require(cuda::execution::determinism::not_guaranteed);
cuda::stream stream{cuda::devices[0]};
cuda::stream_ref stream_ref{stream};
auto env = ::cuda::std::execution::env{req_env, stream_ref};

auto error =
  cub::DeviceSegmentedReduce::Sum(d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1, env);
thrust::device_vector<int> expected{21, 0, 17};

if (error != cudaSuccess)
{
  std::cerr << "cub::DeviceSegmentedReduce::Sum failed with status: " << error << '\n';
}

Template Parameters:

InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
EnvT – [inferred] Execution environment type. Default is cuda::std::execution::env<>.

Parameters:

d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length num_segments`, such that ``d_begin_offsets[i] is the first element of the i^th data segment in d_in
d_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_end_offsets[i] - 1 is the last element of the i^th data segment in d_in. If d_end_offsets[i] - 1 <= d_begin_offsets[i], the i^th is considered empty.
env – [in]
[optional] Execution environment. Default is cuda::std::execution::env{}.

template<typename InputIteratorT, typename OutputIteratorT> static inline cudaError_t Sum( void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, ::cuda::std::int64_t num_segments, int segment_size, cudaStream_t stream = nullptr )#

Computes a device-wide segmented sum using the addition (+) operator.

Added in version 3.2.0: First appears in CUDA Toolkit 13.2.

Uses 0 as the initial value of the reduction for each segment.
Temporary storage for this operation. If d_temp_storage is nullptr, the required size is written to temp_storage_bytes without dereferencing iterators or launching kernels. Otherwise, d_temp_storage must point to a device-accessible allocation of at least temp_storage_bytes bytes. No special alignment is required. See Two-Phase API (explicit temporary storage management) for usage guidance.

Snippet#

The code snippet below illustrates the sum reduction of a device vector of int data elements.

int num_segments = 3;
int segment_size = 2;
c2h::device_vector<int> d_in{6, 8, 7, 5, 3, 0};
c2h::device_vector<int> d_out(3);

// Determine temporary device storage requirements
void* d_temp_storage      = nullptr;
size_t temp_storage_bytes = 0;
cub::DeviceSegmentedReduce::Sum(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, segment_size);

c2h::device_vector<std::uint8_t> temp_storage(temp_storage_bytes);
d_temp_storage = thrust::raw_pointer_cast(temp_storage.data());

// Run reduction
cub::DeviceSegmentedReduce::Sum(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, segment_size);

c2h::device_vector<int> d_expected{14, 12, 3};

Template Parameters:

InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)

Parameters:

d_temp_storage – [in] Temporary storage for this operation. If d_temp_storage is nullptr, the required size is written to temp_storage_bytes without dereferencing iterators or launching kernels. Otherwise, d_temp_storage must point to a device-accessible allocation of at least temp_storage_bytes bytes. No special alignment is required. See :ref:device-temp-storage for usage guidance.
temp_storage_bytes – [inout] Reference to size in bytes of d_temp_storage allocation
d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
segment_size – [in] The fixed segment size of each segment
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream₀.

template<typename InputIteratorT, typename OutputIteratorT, typename EnvT = ::cuda::std::execution::env<>> static inline cudaError_t Sum( InputIteratorT d_in, OutputIteratorT d_out, ::cuda::std::int64_t num_segments, int segment_size, EnvT env = {} )#

Computes a device-wide segmented sum using the addition (+) operator and a fixed segment size.

Added in version 3.4.0: First appears in CUDA Toolkit 13.4.

Uses 0 as the initial value of the reduction for each segment.
Can use a specific stream or cuda memory resource through the env parameter

Snippet#

int num_segments = 2;
int segment_size = 3;
thrust::device_vector<int> d_in{8, 6, 7, 5, 3, 0};
thrust::device_vector<int> d_out(2);

cuda::stream stream{cuda::devices[0]};
cuda::stream_ref stream_ref{stream};

auto error = cub::DeviceSegmentedReduce::Sum(d_in.begin(), d_out.begin(), num_segments, segment_size, stream_ref);
thrust::device_vector<int> expected{21, 8};

if (error != cudaSuccess)
{
  std::cerr << "cub::DeviceSegmentedReduce::Sum (fixed-size) failed with status: " << error << '\n';
}

Template Parameters:

InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
EnvT – [inferred] Execution environment type. Default is cuda::std::execution::env<>.

Parameters:

d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
segment_size – [in] The fixed segment size of each segment
env – [in]
[optional] Execution environment. Default is cuda::std::execution::env{}.

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT, typename = ::cuda::std::void_t<typename ::cuda::std::iterator_traits<BeginOffsetIteratorT>::value_type, typename ::cuda::std::iterator_traits<EndOffsetIteratorT>::value_type>> static inline cudaError_t Min( void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, ::cuda::std::int64_t num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, cudaStream_t stream = nullptr )#

Computes a device-wide segmented minimum using the less-than (<) operator.

Added in version 2.2.0: First appears in CUDA Toolkit 12.3.

Uses ::cuda::std::numeric_limits<T>::max() as the initial value of the reduction for each segment.
When input a contiguous sequence of segments, a single sequence segment_offsets (of length num_segments + 1) can be aliased for both the d_begin_offsets and d_end_offsets parameters (where the latter is specified as segment_offsets + 1).
Does not support < operators that are non-commutative.
Let s be in [0, num_segments). The range [d_out + d_begin_offsets[s], d_out + d_end_offsets[s]) shall not overlap [d_in + d_begin_offsets[s], d_in + d_end_offsets[s]), [d_begin_offsets, d_begin_offsets + num_segments) nor [d_end_offsets, d_end_offsets + num_segments).
Temporary storage for this operation. If d_temp_storage is nullptr, the required size is written to temp_storage_bytes without dereferencing iterators or launching kernels. Otherwise, d_temp_storage must point to a device-accessible allocation of at least temp_storage_bytes bytes. No special alignment is required. See Two-Phase API (explicit temporary storage management) for usage guidance.

Snippet#

The code snippet below illustrates the min-reduction of a device vector of int data elements.

int num_segments                  = 3;
c2h::device_vector<int> d_offsets = {0, 3, 3, 7};
auto d_offsets_it                 = thrust::raw_pointer_cast(d_offsets.data());
c2h::device_vector<int> d_in{8, 6, 7, 5, 3, 0, 9};
c2h::device_vector<int> d_out(3);

// Determine temporary device storage requirements
void* d_temp_storage      = nullptr;
size_t temp_storage_bytes = 0;
cub::DeviceSegmentedReduce::Min(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1);

c2h::device_vector<std::uint8_t> temp_storage(temp_storage_bytes);
d_temp_storage = thrust::raw_pointer_cast(temp_storage.data());

// Run reduction
cub::DeviceSegmentedReduce::Min(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1);

c2h::device_vector<int> expected{6, INT_MAX, 0};

Template Parameters:

InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)

Parameters:

d_temp_storage – [in] Temporary storage for this operation. If d_temp_storage is nullptr, the required size is written to temp_storage_bytes without dereferencing iterators or launching kernels. Otherwise, d_temp_storage must point to a device-accessible allocation of at least temp_storage_bytes bytes. No special alignment is required. See :ref:device-temp-storage for usage guidance.
temp_storage_bytes – [inout] Reference to size in bytes of d_temp_storage allocation
d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_begin_offsets[i] is the first element of the i^th data segment in d_in
d_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_end_offsets[i] - 1 is the last element of the i^th data segment in d_in. If d_end_offsets[i] - 1 <= d_begin_offsets[i], the i^th is considered empty.
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream₀.

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT, typename EnvT = ::cuda::std::execution::env<>> static inline cudaError_t Min( InputIteratorT d_in, OutputIteratorT d_out, ::cuda::std::int64_t num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, EnvT env = {} )#

Computes a device-wide segmented minimum using the less-than (<) operator.

Added in version 3.4.0: First appears in CUDA Toolkit 13.4.

Uses ::cuda::std::numeric_limits<T>::max() as the initial value of the reduction for each segment.
When input a contiguous sequence of segments, a single sequence segment_offsets (of length num_segments + 1) can be aliased for both the d_begin_offsets and d_end_offsets parameters (where the latter is specified as segment_offsets + 1).
Does not support < operators that are non-commutative.
Let s be in [0, num_segments). The range [d_out + d_begin_offsets[s], d_out + d_end_offsets[s]) shall not overlap [d_in + d_begin_offsets[s], d_in + d_end_offsets[s]), [d_begin_offsets, d_begin_offsets + num_segments) nor [d_end_offsets, d_end_offsets + num_segments).
Can use a specific stream or cuda memory resource through the env parameter.

Snippet#

The code snippet below illustrates the min-reduction of a device vector of int data elements.

int num_segments                     = 3;
thrust::device_vector<int> d_offsets = {0, 3, 3, 7};
auto d_offsets_it                    = thrust::raw_pointer_cast(d_offsets.data());
thrust::device_vector<int> d_in{8, 6, 7, 5, 3, 0, 9};
thrust::device_vector<int> d_out(3);

cuda::stream stream{cuda::devices[0]};
cuda::stream_ref stream_ref{stream};

auto error = cub::DeviceSegmentedReduce::Min(
  d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1, stream_ref);
thrust::device_vector<int> expected{6, std::numeric_limits<int>::max(), 0};

if (error != cudaSuccess)
{
  std::cerr << "cub::DeviceSegmentedReduce::Min failed with status: " << error << '\n';
}

Template Parameters:

InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
EnvT – [inferred] Execution environment type. Default is cuda::std::execution::env<>.

Parameters:

d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_begin_offsets[i] is the first element of the i^th data segment in d_in
d_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_end_offsets[i] - 1 is the last element of the i^th data segment in d_in. If d_end_offsets[i] - 1 <= d_begin_offsets[i], the i^th is considered empty.
env – [in]
[optional] Execution environment. Default is cuda::std::execution::env{}.

template<typename InputIteratorT, typename OutputIteratorT> static inline cudaError_t Min( void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, ::cuda::std::int64_t num_segments, int segment_size, cudaStream_t stream = nullptr )#

Computes a device-wide segmented minimum using the less-than (<) operator.

Added in version 3.2.0: First appears in CUDA Toolkit 13.2.

Uses ::cuda::std::numeric_limits<T>::max() as the initial value of the reduction for each segment.

Snippet#

The code snippet below illustrates the min-reduction of a device vector of int data elements.

int num_segments = 3;
int segment_size = 2;
c2h::device_vector<int> d_in{6, 8, 7, 5, 3, 0};
c2h::device_vector<int> d_out(3);

// Determine temporary device storage requirements
void* d_temp_storage      = nullptr;
size_t temp_storage_bytes = 0;
cub::DeviceSegmentedReduce::Min(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, segment_size);

c2h::device_vector<std::uint8_t> temp_storage(temp_storage_bytes);
d_temp_storage = thrust::raw_pointer_cast(temp_storage.data());

// Run reduction
cub::DeviceSegmentedReduce::Min(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, segment_size);

c2h::device_vector<int> d_expected{6, 5, 0};

Template Parameters:

InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)

Parameters:

d_temp_storage – [in] Temporary storage for this operation. If d_temp_storage is nullptr, the required size is written to temp_storage_bytes without dereferencing iterators or launching kernels. Otherwise, d_temp_storage must point to a device-accessible allocation of at least temp_storage_bytes bytes. No special alignment is required. See :ref:device-temp-storage for usage guidance.
temp_storage_bytes – [inout] Reference to size in bytes of d_temp_storage allocation
d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
segment_size – [in] The fixed segment size of each segment
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream₀.

template<typename InputIteratorT, typename OutputIteratorT, typename EnvT = ::cuda::std::execution::env<>> static inline cudaError_t Min( InputIteratorT d_in, OutputIteratorT d_out, ::cuda::std::int64_t num_segments, int segment_size, EnvT env = {} )#

Computes a device-wide segmented minimum using the less-than (<) operator and a fixed segment size.

Added in version 3.4.0: First appears in CUDA Toolkit 13.4.

Uses ::cuda::std::numeric_limits<T>::max() as the initial value of the reduction for each segment.
Can use a specific stream or cuda memory resource through the env parameter

Snippet#

int num_segments = 2;
int segment_size = 3;
thrust::device_vector<int> d_in{8, 6, 7, 5, 3, 0};
thrust::device_vector<int> d_out(2);

cuda::stream stream{cuda::devices[0]};
cuda::stream_ref stream_ref{stream};

auto error = cub::DeviceSegmentedReduce::Min(d_in.begin(), d_out.begin(), num_segments, segment_size, stream_ref);
thrust::device_vector<int> expected{6, 0};

if (error != cudaSuccess)
{
  std::cerr << "cub::DeviceSegmentedReduce::Min (fixed-size) failed with status: " << error << '\n';
}

Template Parameters:

InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
EnvT – [inferred] Execution environment type. Default is cuda::std::execution::env<>.

Parameters:

d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
segment_size – [in] The fixed segment size of each segment
env – [in]
[optional] Execution environment. Default is cuda::std::execution::env{}.

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT, typename = ::cuda::std::void_t<typename ::cuda::std::iterator_traits<BeginOffsetIteratorT>::value_type, typename ::cuda::std::iterator_traits<EndOffsetIteratorT>::value_type>> static inline cudaError_t ArgMin( void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, ::cuda::std::int64_t num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, cudaStream_t stream = nullptr )#

Finds the first device-wide minimum in each segment using the less-than (<) operator, also returning the in-segment index of that item.

Added in version 2.2.0: First appears in CUDA Toolkit 12.3.

The output value type of d_out is cub::KeyValuePair<int, T> (assuming the value type of d_in is T)
- The minimum of the i^th segment is written to d_out[i].value and its offset in that segment is written to d_out[i].key.
- The {1, ::cuda::std::numeric_limits<T>::max()} tuple is produced for zero-length inputs
When input a contiguous sequence of segments, a single sequence segment_offsets (of length num_segments + 1) can be aliased for both the d_begin_offsets and d_end_offsets parameters (where the latter is specified as segment_offsets + 1).
Does not support < operators that are non-commutative.
Let s be in [0, num_segments). The range [d_out + d_begin_offsets[s], d_out + d_end_offsets[s]) shall not overlap [d_in + d_begin_offsets[s], d_in + d_end_offsets[s]), [d_begin_offsets, d_begin_offsets + num_segments) nor [d_end_offsets, d_end_offsets + num_segments).
Temporary storage for this operation. If d_temp_storage is nullptr, the required size is written to temp_storage_bytes without dereferencing iterators or launching kernels. Otherwise, d_temp_storage must point to a device-accessible allocation of at least temp_storage_bytes bytes. No special alignment is required. See Two-Phase API (explicit temporary storage management) for usage guidance.

Snippet#

The code snippet below illustrates the argmin-reduction of a device vector of int data elements.

int num_segments                  = 3;
c2h::device_vector<int> d_offsets = {0, 3, 3, 7};
auto d_offsets_it                 = thrust::raw_pointer_cast(d_offsets.data());
c2h::device_vector<int> d_in{8, 6, 7, 5, 3, 0, 9};
c2h::device_vector<cub::KeyValuePair<int, int>> d_out(3);

// Determine temporary device storage requirements
void* d_temp_storage      = nullptr;
size_t temp_storage_bytes = 0;
cub::DeviceSegmentedReduce::ArgMin(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1);

c2h::device_vector<std::uint8_t> temp_storage(temp_storage_bytes);
d_temp_storage = thrust::raw_pointer_cast(temp_storage.data());

// Run reduction
cub::DeviceSegmentedReduce::ArgMin(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1);

c2h::device_vector<cub::KeyValuePair<int, int>> expected{{1, 6}, {1, INT_MAX}, {2, 0}};

Template Parameters:

InputIteratorT – [inferred] Random-access input iterator type for reading input items (of some type T) (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (having value type KeyValuePair<int, T>) (may be a simple pointer type)
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)

Parameters:

d_temp_storage – [in] Temporary storage for this operation. If d_temp_storage is nullptr, the required size is written to temp_storage_bytes without dereferencing iterators or launching kernels. Otherwise, d_temp_storage must point to a device-accessible allocation of at least temp_storage_bytes bytes. No special alignment is required. See :ref:device-temp-storage for usage guidance.
temp_storage_bytes – [inout] Reference to size in bytes of d_temp_storage allocation
d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_begin_offsets[i] is the first element of the i^th data segment in d_in
d_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_end_offsets[i] - 1 is the last element of the i^th data segment in d_in. If d_end_offsets[i] - 1 <= d_begin_offsets[i], the i^th is considered empty.
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream₀.

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT, typename EnvT = ::cuda::std::execution::env<>> static inline cudaError_t ArgMin( InputIteratorT d_in, OutputIteratorT d_out, ::cuda::std::int64_t num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, EnvT env = {} )#

Finds the first device-wide minimum in each segment using the less-than (<) operator, also returning the in-segment index of that item.

Added in version 3.4.0: First appears in CUDA Toolkit 13.4.

The output value type of d_out is cub::KeyValuePair<int, T> (assuming the value type of d_in is T)
- The minimum of the i^th segment is written to d_out[i].value and its offset in that segment is written to d_out[i].key.
- The {1, ::cuda::std::numeric_limits<T>::max()} tuple is produced for zero-length inputs
When input a contiguous sequence of segments, a single sequence segment_offsets (of length num_segments + 1) can be aliased for both the d_begin_offsets and d_end_offsets parameters (where the latter is specified as segment_offsets + 1).
Does not support < operators that are non-commutative.
Let s be in [0, num_segments). The range [d_out + d_begin_offsets[s], d_out + d_end_offsets[s]) shall not overlap [d_in + d_begin_offsets[s], d_in + d_end_offsets[s]), [d_begin_offsets, d_begin_offsets + num_segments) nor [d_end_offsets, d_end_offsets + num_segments).
Can use a specific stream or cuda memory resource through the env parameter.

Snippet#

The code snippet below illustrates the argmin-reduction of a device vector of int data elements.

int num_segments                     = 3;
thrust::device_vector<int> d_offsets = {0, 3, 3, 7};
auto d_offsets_it                    = thrust::raw_pointer_cast(d_offsets.data());
thrust::device_vector<int> d_in{8, 6, 7, 5, 3, 0, 9};
thrust::device_vector<cub::KeyValuePair<int, int>> d_out(3);

cuda::stream stream{cuda::devices[0]};
cuda::stream_ref stream_ref{stream};

auto error = cub::DeviceSegmentedReduce::ArgMin(
  d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1, stream_ref);

if (error != cudaSuccess)
{
  std::cerr << "cub::DeviceSegmentedReduce::ArgMin failed with status: " << error << '\n';
}

Template Parameters:

InputIteratorT – [inferred] Random-access input iterator type for reading input items (of some type T) (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (having value type cub::KeyValuePair<int, T>) (may be a simple pointer type)
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
EnvT – [inferred] Execution environment type. Default is cuda::std::execution::env<>.

Parameters:

d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_begin_offsets[i] is the first element of the i^th data segment in d_in
d_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_end_offsets[i] - 1 is the last element of the i^th data segment in d_in. If d_end_offsets[i] - 1 <= d_begin_offsets[i], the i^th is considered empty.
env – [in]
[optional] Execution environment. Default is cuda::std::execution::env{}.

template<typename InputIteratorT, typename OutputIteratorT> static inline cudaError_t ArgMin( void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, ::cuda::std::int64_t num_segments, int segment_size, cudaStream_t stream = nullptr )#

Finds the first device-wide minimum in each segment using the less-than (<) operator, also returning the in-segment index of that item.

Added in version 3.2.0: First appears in CUDA Toolkit 13.2.

The output value type of d_out is ::cuda::std::pair<int, T> (assuming the value type of d_in is T)
- The minimum of the i^th segment is written to d_out[i].second and its offset in that segment is written to d_out[i].first.
- The {1, ::cuda::std::numeric_limits<T>::max()} tuple is produced for zero-length inputs

Snippet#

The code snippet below illustrates the argmin-reduction of a device vector of int data elements.

int num_segments = 3;
int segment_size = 2;
c2h::device_vector<int> d_in{6, 8, 7, 5, 3, 0};
c2h::device_vector<cuda::std::pair<int, int>> d_out(3);

// Determine temporary device storage requirements
void* d_temp_storage      = nullptr;
size_t temp_storage_bytes = 0;
cub::DeviceSegmentedReduce::ArgMin(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, segment_size);

c2h::device_vector<std::uint8_t> temp_storage(temp_storage_bytes);
d_temp_storage = thrust::raw_pointer_cast(temp_storage.data());

// Run reduction
cub::DeviceSegmentedReduce::ArgMin(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, segment_size);

c2h::host_vector<cuda::std::pair<int, int>> h_expected{{0, 6}, {1, 5}, {1, 0}};

Template Parameters:

InputIteratorT – [inferred] Random-access input iterator type for reading input items (of some type T) (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (having value type cuda::std::pair<int, T>) (may be a simple pointer type)

Parameters:

d_temp_storage – [in] Temporary storage for this operation. If d_temp_storage is nullptr, the required size is written to temp_storage_bytes without dereferencing iterators or launching kernels. Otherwise, d_temp_storage must point to a device-accessible allocation of at least temp_storage_bytes bytes. No special alignment is required. See :ref:device-temp-storage for usage guidance.
temp_storage_bytes – [inout] Reference to size in bytes of d_temp_storage allocation
d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
segment_size – [in] The fixed segment size of each segment
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream₀.

template<typename InputIteratorT, typename OutputIteratorT, typename EnvT = ::cuda::std::execution::env<>> static inline cudaError_t ArgMin( InputIteratorT d_in, OutputIteratorT d_out, ::cuda::std::int64_t num_segments, int segment_size, EnvT env = {} )#

Finds the first device-wide minimum in each segment using the less-than (<) operator, also returning the in-segment index of that item, with a fixed segment size.

Added in version 3.4.0: First appears in CUDA Toolkit 13.4.

The output value type of d_out is ::cuda::std::pair<int, T> (assuming the value type of d_in is T)
- The minimum of the i^th segment is written to d_out[i].second and its offset in that segment is written to d_out[i].first.
- The {1, ::cuda::std::numeric_limits<T>::max()} tuple is produced for zero-length inputs
Can use a specific stream or cuda memory resource through the env parameter

Snippet#

int num_segments = 2;
int segment_size = 3;
thrust::device_vector<int> d_in{8, 6, 7, 5, 3, 0};
thrust::device_vector<cuda::std::pair<int, int>> d_out(2);

cuda::stream stream{cuda::devices[0]};
cuda::stream_ref stream_ref{stream};

auto error = cub::DeviceSegmentedReduce::ArgMin(d_in.begin(), d_out.begin(), num_segments, segment_size, stream_ref);

if (error != cudaSuccess)
{
  std::cerr << "cub::DeviceSegmentedReduce::ArgMin (fixed-size) failed with status: " << error << '\n';
}

Template Parameters:

InputIteratorT – [inferred] Random-access input iterator type for reading input items (of some type T) (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (having value type cuda::std::pair<int, T>) (may be a simple pointer type)
EnvT – [inferred] Execution environment type. Default is cuda::std::execution::env<>.

Parameters:

d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
segment_size – [in] The fixed segment size of each segment
env – [in]
[optional] Execution environment. Default is cuda::std::execution::env{}.

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT, typename = ::cuda::std::void_t<typename ::cuda::std::iterator_traits<BeginOffsetIteratorT>::value_type, typename ::cuda::std::iterator_traits<EndOffsetIteratorT>::value_type>> static inline cudaError_t Max( void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, ::cuda::std::int64_t num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, cudaStream_t stream = nullptr )#

Computes a device-wide segmented maximum using the greater-than (>) operator.

Added in version 2.2.0: First appears in CUDA Toolkit 12.3.

Uses ::cuda::std::numeric_limits<T>::lowest() as the initial value of the reduction.
When input a contiguous sequence of segments, a single sequence segment_offsets (of length num_segments + 1) can be aliased for both the d_begin_offsets and d_end_offsets parameters (where the latter is specified as segment_offsets + 1).
Does not support > operators that are non-commutative.
Let s be in [0, num_segments). The range [d_out + d_begin_offsets[s], d_out + d_end_offsets[s]) shall not overlap [d_in + d_begin_offsets[s], d_in + d_end_offsets[s]), [d_begin_offsets, d_begin_offsets + num_segments) nor [d_end_offsets, d_end_offsets + num_segments).
Temporary storage for this operation. If d_temp_storage is nullptr, the required size is written to temp_storage_bytes without dereferencing iterators or launching kernels. Otherwise, d_temp_storage must point to a device-accessible allocation of at least temp_storage_bytes bytes. No special alignment is required. See Two-Phase API (explicit temporary storage management) for usage guidance.

Snippet#

The code snippet below illustrates the max-reduction of a device vector of int data elements.

int num_segments                  = 3;
c2h::device_vector<int> d_offsets = {0, 3, 3, 7};
auto d_offsets_it                 = thrust::raw_pointer_cast(d_offsets.data());
c2h::device_vector<int> d_in{8, 6, 7, 5, 3, 0, 9};
c2h::device_vector<int> d_out(3);

// Determine temporary device storage requirements
void* d_temp_storage      = nullptr;
size_t temp_storage_bytes = 0;
cub::DeviceSegmentedReduce::Max(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1);

c2h::device_vector<std::uint8_t> temp_storage(temp_storage_bytes);
d_temp_storage = thrust::raw_pointer_cast(temp_storage.data());

// Run reduction
cub::DeviceSegmentedReduce::Max(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1);

c2h::device_vector<int> expected{8, INT_MIN, 9};

Template Parameters:

InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)

Parameters:

d_temp_storage – [in] Temporary storage for this operation. If d_temp_storage is nullptr, the required size is written to temp_storage_bytes without dereferencing iterators or launching kernels. Otherwise, d_temp_storage must point to a device-accessible allocation of at least temp_storage_bytes bytes. No special alignment is required. See :ref:device-temp-storage for usage guidance.
temp_storage_bytes – [inout] Reference to size in bytes of d_temp_storage allocation
d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_begin_offsets[i] is the first element of the i^th data segment in d_in
d_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_end_offsets[i] - 1 is the last element of the i^th data segment in d_in. If d_end_offsets[i] - 1 <= d_begin_offsets[i], the i^th is considered empty.
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream₀.

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT, typename EnvT = ::cuda::std::execution::env<>> static inline cudaError_t Max( InputIteratorT d_in, OutputIteratorT d_out, ::cuda::std::int64_t num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, EnvT env = {} )#

Computes a device-wide segmented maximum using the greater-than (>) operator.

Added in version 3.4.0: First appears in CUDA Toolkit 13.4.

Uses ::cuda::std::numeric_limits<T>::lowest() as the initial value of the reduction.
When input a contiguous sequence of segments, a single sequence segment_offsets (of length num_segments + 1) can be aliased for both the d_begin_offsets and d_end_offsets parameters (where the latter is specified as segment_offsets + 1).
Does not support > operators that are non-commutative.
Let s be in [0, num_segments). The range [d_out + d_begin_offsets[s], d_out + d_end_offsets[s]) shall not overlap [d_in + d_begin_offsets[s], d_in + d_end_offsets[s]), [d_begin_offsets, d_begin_offsets + num_segments) nor [d_end_offsets, d_end_offsets + num_segments).
Can use a specific stream or cuda memory resource through the env parameter.

Snippet#

The code snippet below illustrates the max-reduction of a device vector of int data elements.

int num_segments                     = 3;
thrust::device_vector<int> d_offsets = {0, 3, 3, 7};
auto d_offsets_it                    = thrust::raw_pointer_cast(d_offsets.data());
thrust::device_vector<int> d_in{8, 6, 7, 5, 3, 0, 9};
thrust::device_vector<int> d_out(3);

cuda::stream stream{cuda::devices[0]};
cuda::stream_ref stream_ref{stream};

auto error = cub::DeviceSegmentedReduce::Max(
  d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1, stream_ref);
thrust::device_vector<int> expected{8, std::numeric_limits<int>::lowest(), 9};

if (error != cudaSuccess)
{
  std::cerr << "cub::DeviceSegmentedReduce::Max failed with status: " << error << '\n';
}

Template Parameters:

InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
EnvT – [inferred] Execution environment type. Default is cuda::std::execution::env<>.

Parameters:

d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_begin_offsets[i] is the first element of the i^th data segment in d_in
d_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_end_offsets[i] - 1 is the last element of the i^th data segment in d_in. If d_end_offsets[i] - 1 <= d_begin_offsets[i], the i^th is considered empty.
env – [in]
[optional] Execution environment. Default is cuda::std::execution::env{}.

template<typename InputIteratorT, typename OutputIteratorT> static inline cudaError_t Max( void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, ::cuda::std::int64_t num_segments, int segment_size, cudaStream_t stream = nullptr )#

Computes a device-wide segmented maximum using the greater-than (>) operator.

Added in version 3.2.0: First appears in CUDA Toolkit 13.2.

Uses ::cuda::std::numeric_limits<T>::lowest() as the initial value of the reduction.

Snippet#

The code snippet below illustrates the max-reduction of a device vector of int data elements.

int num_segments = 3;
int segment_size = 2;

c2h::device_vector<int> d_in{6, 8, 7, 5, 3, 0};
c2h::device_vector<int> d_out(3);

// Determine temporary device storage requirements
void* d_temp_storage      = nullptr;
size_t temp_storage_bytes = 0;
cub::DeviceSegmentedReduce::Max(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, segment_size);

c2h::device_vector<std::uint8_t> temp_storage(temp_storage_bytes);
d_temp_storage = thrust::raw_pointer_cast(temp_storage.data());

// Run reduction
cub::DeviceSegmentedReduce::Max(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, segment_size);

c2h::device_vector<int> d_expected{8, 7, 3};

Template Parameters:

InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)

Parameters:

d_temp_storage – [in] Temporary storage for this operation. If d_temp_storage is nullptr, the required size is written to temp_storage_bytes without dereferencing iterators or launching kernels. Otherwise, d_temp_storage must point to a device-accessible allocation of at least temp_storage_bytes bytes. No special alignment is required. See :ref:device-temp-storage for usage guidance.
temp_storage_bytes – [inout] Reference to size in bytes of d_temp_storage allocation
d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
segment_size – [in] The fixed segment size of each segment
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream₀.

template<typename InputIteratorT, typename OutputIteratorT, typename EnvT = ::cuda::std::execution::env<>> static inline cudaError_t Max( InputIteratorT d_in, OutputIteratorT d_out, ::cuda::std::int64_t num_segments, int segment_size, EnvT env = {} )#

Computes a device-wide segmented maximum using the greater-than (>) operator and a fixed segment size.

Added in version 3.4.0: First appears in CUDA Toolkit 13.4.

Uses ::cuda::std::numeric_limits<T>::lowest() as the initial value of the reduction.
Can use a specific stream or cuda memory resource through the env parameter

Snippet#

int num_segments = 2;
int segment_size = 3;
thrust::device_vector<int> d_in{8, 6, 7, 5, 3, 0};
thrust::device_vector<int> d_out(2);

cuda::stream stream{cuda::devices[0]};
cuda::stream_ref stream_ref{stream};

auto error = cub::DeviceSegmentedReduce::Max(d_in.begin(), d_out.begin(), num_segments, segment_size, stream_ref);
thrust::device_vector<int> expected{8, 5};

if (error != cudaSuccess)
{
  std::cerr << "cub::DeviceSegmentedReduce::Max (fixed-size) failed with status: " << error << '\n';
}

Template Parameters:

InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
EnvT – [inferred] Execution environment type. Default is cuda::std::execution::env<>.

Parameters:

d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
segment_size – [in] The fixed segment size of each segment
env – [in]
[optional] Execution environment. Default is cuda::std::execution::env{}.

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT, typename = ::cuda::std::void_t<typename ::cuda::std::iterator_traits<BeginOffsetIteratorT>::value_type, typename ::cuda::std::iterator_traits<EndOffsetIteratorT>::value_type>> static inline cudaError_t ArgMax( void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, ::cuda::std::int64_t num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, cudaStream_t stream = nullptr )#

Finds the first device-wide maximum in each segment using the greater-than (>) operator, also returning the in-segment index of that item

Added in version 2.2.0: First appears in CUDA Toolkit 12.3.

The output value type of d_out is cub::KeyValuePair<int, T> (assuming the value type of d_in is T)
- The maximum of the i^th segment is written to d_out[i].value and its offset in that segment is written to d_out[i].key.
- The {1, ::cuda::std::numeric_limits<T>::lowest()} tuple is produced for zero-length inputs
When input a contiguous sequence of segments, a single sequence segment_offsets (of length num_segments + 1) can be aliased for both the d_begin_offsets and d_end_offsets parameters (where the latter is specified as segment_offsets + 1).
Does not support > operators that are non-commutative.
Let s be in [0, num_segments). The range [d_out + d_begin_offsets[s], d_out + d_end_offsets[s]) shall not overlap [d_in + d_begin_offsets[s], d_in + d_end_offsets[s]), [d_begin_offsets, d_begin_offsets + num_segments) nor [d_end_offsets, d_end_offsets + num_segments).
Temporary storage for this operation. If d_temp_storage is nullptr, the required size is written to temp_storage_bytes without dereferencing iterators or launching kernels. Otherwise, d_temp_storage must point to a device-accessible allocation of at least temp_storage_bytes bytes. No special alignment is required. See Two-Phase API (explicit temporary storage management) for usage guidance.

Snippet#

The code snippet below illustrates the argmax-reduction of a device vector of int data elements.

int num_segments                  = 3;
c2h::device_vector<int> d_offsets = {0, 3, 3, 7};
auto d_offsets_it                 = thrust::raw_pointer_cast(d_offsets.data());
c2h::device_vector<int> d_in{8, 6, 7, 5, 3, 0, 9};
c2h::device_vector<cub::KeyValuePair<int, int>> d_out(3);

// Determine temporary device storage requirements
void* d_temp_storage      = nullptr;
size_t temp_storage_bytes = 0;
cub::DeviceSegmentedReduce::ArgMax(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1);

c2h::device_vector<std::uint8_t> temp_storage(temp_storage_bytes);
d_temp_storage = thrust::raw_pointer_cast(temp_storage.data());

// Run reduction
cub::DeviceSegmentedReduce::ArgMax(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1);

c2h::device_vector<cub::KeyValuePair<int, int>> expected{{0, 8}, {1, INT_MIN}, {3, 9}};

Template Parameters:

InputIteratorT – [inferred] Random-access input iterator type for reading input items (of some type T) (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (having value type KeyValuePair<int, T>) (may be a simple pointer type)
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)

Parameters:

d_temp_storage – [in] Temporary storage for this operation. If d_temp_storage is nullptr, the required size is written to temp_storage_bytes without dereferencing iterators or launching kernels. Otherwise, d_temp_storage must point to a device-accessible allocation of at least temp_storage_bytes bytes. No special alignment is required. See :ref:device-temp-storage for usage guidance.
temp_storage_bytes – [inout] Reference to size in bytes of d_temp_storage allocation
d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_begin_offsets[i] is the first element of the i^th data segment in d_in
d_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_end_offsets[i] - 1 is the last element of the i^th data segment in d_in. If d_end_offsets[i] - 1 <= d_begin_offsets[i], the i^th is considered empty.
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream₀.

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT, typename EnvT = ::cuda::std::execution::env<>> static inline cudaError_t ArgMax( InputIteratorT d_in, OutputIteratorT d_out, ::cuda::std::int64_t num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, EnvT env = {} )#

Finds the first device-wide maximum in each segment using the greater-than (>) operator, also returning the in-segment index of that item

Added in version 3.4.0: First appears in CUDA Toolkit 13.4.

The output value type of d_out is cub::KeyValuePair<int, T> (assuming the value type of d_in is T)
- The maximum of the i^th segment is written to d_out[i].value and its offset in that segment is written to d_out[i].key.
- The {1, ::cuda::std::numeric_limits<T>::lowest()} tuple is produced for zero-length inputs
When input a contiguous sequence of segments, a single sequence segment_offsets (of length num_segments + 1) can be aliased for both the d_begin_offsets and d_end_offsets parameters (where the latter is specified as segment_offsets + 1).
Does not support > operators that are non-commutative.
Let s be in [0, num_segments). The range [d_out + d_begin_offsets[s], d_out + d_end_offsets[s]) shall not overlap [d_in + d_begin_offsets[s], d_in + d_end_offsets[s]), [d_begin_offsets, d_begin_offsets + num_segments) nor [d_end_offsets, d_end_offsets + num_segments).
Can use a specific stream or cuda memory resource through the env parameter.

Snippet#

The code snippet below illustrates the argmax-reduction of a device vector of int data elements.

int num_segments                     = 3;
thrust::device_vector<int> d_offsets = {0, 3, 3, 7};
auto d_offsets_it                    = thrust::raw_pointer_cast(d_offsets.data());
thrust::device_vector<int> d_in{8, 6, 7, 5, 3, 0, 9};
thrust::device_vector<cub::KeyValuePair<int, int>> d_out(3);

cuda::stream stream{cuda::devices[0]};
cuda::stream_ref stream_ref{stream};

auto error = cub::DeviceSegmentedReduce::ArgMax(
  d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1, stream_ref);

if (error != cudaSuccess)
{
  std::cerr << "cub::DeviceSegmentedReduce::ArgMax failed with status: " << error << '\n';
}

Template Parameters:

InputIteratorT – [inferred] Random-access input iterator type for reading input items (of some type T) (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (having value type cub::KeyValuePair<int, T>) (may be a simple pointer type)
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
EnvT – [inferred] Execution environment type. Default is cuda::std::execution::env<>.

Parameters:

d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_begin_offsets[i] is the first element of the i^th data segment in d_in
d_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_end_offsets[i] - 1 is the last element of the i^th data segment in d_in. If d_end_offsets[i] - 1 <= d_begin_offsets[i], the i^th is considered empty.
env – [in]
[optional] Execution environment. Default is cuda::std::execution::env{}.

template<typename InputIteratorT, typename OutputIteratorT> static inline cudaError_t ArgMax( void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, ::cuda::std::int64_t num_segments, int segment_size, cudaStream_t stream = nullptr )#

Finds the first device-wide maximum in each segment using the greater-than (>) operator, also returning the in-segment index of that item

Added in version 3.2.0: First appears in CUDA Toolkit 13.2.

The output value type of d_out is ::cuda::std::pair<int, T> (assuming the value type of d_in is T)
- The maximum of the i^th segment is written to d_out[i].second and its offset in that segment is written to d_out[i].first.
- The {1, ::cuda::std::numeric_limits<T>::lowest()} tuple is produced for zero-length inputs

Snippet#

The code snippet below illustrates the argmax-reduction of a device vector of int data elements.

int num_segments = 3;
int segment_size = 2;
c2h::device_vector<int> d_in{6, 8, 7, 5, 3, 0};
c2h::device_vector<cuda::std::pair<int, int>> d_out(3);

// Determine temporary device storage requirements
void* d_temp_storage      = nullptr;
size_t temp_storage_bytes = 0;
cub::DeviceSegmentedReduce::ArgMax(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, segment_size);

c2h::device_vector<std::uint8_t> temp_storage(temp_storage_bytes);
d_temp_storage = thrust::raw_pointer_cast(temp_storage.data());

// Run reduction
cub::DeviceSegmentedReduce::ArgMax(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, segment_size);

c2h::host_vector<cuda::std::pair<int, int>> h_expected{{1, 8}, {0, 7}, {0, 3}};

Template Parameters:

InputIteratorT – [inferred] Random-access input iterator type for reading input items (of some type T) (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (having value type cuda::std::pair<int, T>) (may be a simple pointer type)

Parameters:

d_temp_storage – [in] Temporary storage for this operation. If d_temp_storage is nullptr, the required size is written to temp_storage_bytes without dereferencing iterators or launching kernels. Otherwise, d_temp_storage must point to a device-accessible allocation of at least temp_storage_bytes bytes. No special alignment is required. See :ref:device-temp-storage for usage guidance.
temp_storage_bytes – [inout] Reference to size in bytes of d_temp_storage allocation
d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
segment_size – [in] The fixed segment size of each segment
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream₀.

template<typename InputIteratorT, typename OutputIteratorT, typename EnvT = ::cuda::std::execution::env<>> static inline cudaError_t ArgMax( InputIteratorT d_in, OutputIteratorT d_out, ::cuda::std::int64_t num_segments, int segment_size, EnvT env = {} )#

Finds the first device-wide maximum in each segment using the greater-than (>) operator, also returning the in-segment index of that item, with a fixed segment size.

Added in version 3.4.0: First appears in CUDA Toolkit 13.4.

The output value type of d_out is ::cuda::std::pair<int, T> (assuming the value type of d_in is T)
- The maximum of the i^th segment is written to d_out[i].second and its offset in that segment is written to d_out[i].first.
- The {1, ::cuda::std::numeric_limits<T>::lowest()} tuple is produced for zero-length inputs
Can use a specific stream or cuda memory resource through the env parameter

Snippet#

int num_segments = 2;
int segment_size = 3;
thrust::device_vector<int> d_in{8, 6, 7, 5, 3, 0};
thrust::device_vector<cuda::std::pair<int, int>> d_out(2);

cuda::stream stream{cuda::devices[0]};
cuda::stream_ref stream_ref{stream};

auto error = cub::DeviceSegmentedReduce::ArgMax(d_in.begin(), d_out.begin(), num_segments, segment_size, stream_ref);

if (error != cudaSuccess)
{
  std::cerr << "cub::DeviceSegmentedReduce::ArgMax (fixed-size) failed with status: " << error << '\n';
}

Template Parameters:

InputIteratorT – [inferred] Random-access input iterator type for reading input items (of some type T) (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (having value type cuda::std::pair<int, T>) (may be a simple pointer type)
EnvT – [inferred] Execution environment type. Default is cuda::std::execution::env<>.

Parameters:

d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the segmented reduction data
segment_size – [in] The fixed segment size of each segment
env – [in]
[optional] Execution environment. Default is cuda::std::execution::env{}.