cub::DeviceSegmentedReduce

Defined in cub/device/device_segmented_reduce.cuh

struct DeviceSegmentedReduce

DeviceSegmentedReduce provides device-wide, parallel operations for computing a reduction across multiple sequences of data items residing within device-accessible memory.

Overview

A reduction (or fold) uses a binary combining operator to compute a single aggregate from a sequence of input elements.

Usage Considerations

  • Dynamic parallelism. DeviceSegmentedReduce methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.

Public Static Functions

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT, typename ReductionOpT, typename T>
static inline cudaError_t Reduce(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, ReductionOpT reduction_op, T initial_value, cudaStream_t stream = 0)

Computes a device-wide segmented reduction using the specified binary reduction_op functor.

  • Does not support binary reduction operators that are non-commutative.

  • Provides “run-to-run” determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.

  • When input a contiguous sequence of segments, a single sequence segment_offsets (of length num_segments + 1) can be aliased for both the d_begin_offsets and d_end_offsets parameters (where the latter is specified as segment_offsets + 1).

  • Let s be in [0, num_segments). The range [d_out + d_begin_offsets[s], d_out + d_end_offsets[s]) shall not overlap [d_in + d_begin_offsets[s], d_in + d_end_offsets[s]), [d_begin_offsets, d_begin_offsets + num_segments) nor [d_end_offsets, d_end_offsets + num_segments).

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates a custom min-reduction of a device vector of int data elements.

int num_segments                     = 3;
thrust::device_vector<int> d_offsets = {0, 3, 3, 7};
auto d_offsets_it                    = thrust::raw_pointer_cast(d_offsets.data());
thrust::device_vector<int> d_in{8, 6, 7, 5, 3, 0, 9};
thrust::device_vector<int> d_out(3);
CustomMin min_op;
int initial_value{INT_MAX};

// Determine temporary device storage requirements
void* d_temp_storage      = nullptr;
size_t temp_storage_bytes = 0;
cub::DeviceSegmentedReduce::Reduce(
  d_temp_storage,
  temp_storage_bytes,
  d_in.begin(),
  d_out.begin(),
  num_segments,
  d_offsets_it,
  d_offsets_it + 1,
  min_op,
  initial_value);

thrust::device_vector<std::uint8_t> temp_storage(temp_storage_bytes);
d_temp_storage = thrust::raw_pointer_cast(temp_storage.data());

// Run reduction
cub::DeviceSegmentedReduce::Reduce(
  d_temp_storage,
  temp_storage_bytes,
  d_in.begin(),
  d_out.begin(),
  num_segments,
  d_offsets_it,
  d_offsets_it + 1,
  min_op,
  initial_value);

thrust::device_vector<int> expected{6, INT_MAX, 0};

Template Parameters
  • InputIteratorT[inferred] Random-access input iterator type for reading input items (may be a simple pointer type)

  • OutputIteratorT[inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)

  • BeginOffsetIteratorT[inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)

  • EndOffsetIteratorT[inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)

  • ReductionOpT[inferred] Binary reduction functor type having member T operator()(const T &a, const T &b)

  • T[inferred] Data element type that is convertible to the value type of InputIteratorT

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Pointer to the input sequence of data items

  • d_out[out] Pointer to the output aggregate

  • num_segments[in] The number of segments that comprise the sorting data

  • d_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_begin_offsets[i] is the first element of the ith data segment in d_keys_* and d_values_*

  • d_end_offsets[in]

    Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_end_offsets[i] - 1 is the last element of the ith data segment in d_keys_* and d_values_*. If d_end_offsets[i] - 1 <= d_begin_offsets[i], the ith is considered empty.

  • reduction_op[in] Binary reduction functor

  • initial_value[in] Initial value of the reduction for each segment

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT>
static inline cudaError_t Sum(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, cudaStream_t stream = 0)

Computes a device-wide segmented sum using the addition (+) operator.

  • Uses 0 as the initial value of the reduction for each segment.

  • When input a contiguous sequence of segments, a single sequence segment_offsets (of length num_segments + 1) can be aliased for both the d_begin_offsets and d_end_offsets parameters (where the latter is specified as segment_offsets + 1).

  • Does not support + operators that are non-commutative.

  • Let s be in [0, num_segments). The range [d_out + d_begin_offsets[s], d_out + d_end_offsets[s]) shall not overlap [d_in + d_begin_offsets[s], d_in + d_end_offsets[s]), [d_begin_offsets, d_begin_offsets + num_segments) nor [d_end_offsets, d_end_offsets + num_segments).

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the sum reduction of a device vector of int data elements.

int num_segments                     = 3;
thrust::device_vector<int> d_offsets = {0, 3, 3, 7};
auto d_offsets_it                    = thrust::raw_pointer_cast(d_offsets.data());
thrust::device_vector<int> d_in{8, 6, 7, 5, 3, 0, 9};
thrust::device_vector<int> d_out(3);

// Determine temporary device storage requirements
void* d_temp_storage      = nullptr;
size_t temp_storage_bytes = 0;
cub::DeviceSegmentedReduce::Sum(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1);

thrust::device_vector<std::uint8_t> temp_storage(temp_storage_bytes);
d_temp_storage = thrust::raw_pointer_cast(temp_storage.data());

// Run reduction
cub::DeviceSegmentedReduce::Sum(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1);

thrust::device_vector<int> expected{21, 0, 17};

Template Parameters
  • InputIteratorT[inferred] Random-access input iterator type for reading input items (may be a simple pointer type)

  • OutputIteratorT[inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)

  • BeginOffsetIteratorT[inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)

  • EndOffsetIteratorT[inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Pointer to the input sequence of data items

  • d_out[out] Pointer to the output aggregate

  • num_segments[in] The number of segments that comprise the sorting data

  • d_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments`, such that ``d_begin_offsets[i] is the first element of the ith data segment in d_keys_* and d_values_*

  • d_end_offsets[in]

    Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_end_offsets[i] - 1 is the last element of the ith data segment in d_keys_* and d_values_*. If d_end_offsets[i] - 1 <= d_begin_offsets[i], the ith is considered empty.

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT>
static inline cudaError_t Min(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, cudaStream_t stream = 0)

Computes a device-wide segmented minimum using the less-than (<) operator.

  • Uses std::numeric_limits<T>::max() as the initial value of the reduction for each segment.

  • When input a contiguous sequence of segments, a single sequence segment_offsets (of length num_segments + 1) can be aliased for both the d_begin_offsets and d_end_offsets parameters (where the latter is specified as segment_offsets + 1).

  • Does not support < operators that are non-commutative.

  • Let s be in [0, num_segments). The range [d_out + d_begin_offsets[s], d_out + d_end_offsets[s]) shall not overlap [d_in + d_begin_offsets[s], d_in + d_end_offsets[s]), [d_begin_offsets, d_begin_offsets + num_segments) nor [d_end_offsets, d_end_offsets + num_segments).

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the min-reduction of a device vector of int data elements.

struct CustomMin
{
  template <typename T>
  __device__ __forceinline__ T operator()(const T& a, const T& b) const
  {
    return (b < a) ? b : a;
  }
};
int num_segments                     = 3;
thrust::device_vector<int> d_offsets = {0, 3, 3, 7};
auto d_offsets_it                    = thrust::raw_pointer_cast(d_offsets.data());
thrust::device_vector<int> d_in{8, 6, 7, 5, 3, 0, 9};
thrust::device_vector<int> d_out(3);

// Determine temporary device storage requirements
void* d_temp_storage      = nullptr;
size_t temp_storage_bytes = 0;
cub::DeviceSegmentedReduce::Min(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1);

thrust::device_vector<std::uint8_t> temp_storage(temp_storage_bytes);
d_temp_storage = thrust::raw_pointer_cast(temp_storage.data());

// Run reduction
cub::DeviceSegmentedReduce::Min(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1);

thrust::device_vector<int> expected{6, INT_MAX, 0};

Template Parameters
  • InputIteratorT[inferred] Random-access input iterator type for reading input items (may be a simple pointer type)

  • OutputIteratorT[inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)

  • BeginOffsetIteratorT[inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)

  • EndOffsetIteratorT[inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Pointer to the input sequence of data items

  • d_out[out] Pointer to the output aggregate

  • num_segments[in] The number of segments that comprise the sorting data

  • d_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_begin_offsets[i] is the first element of the ith data segment in d_keys_* and d_values_*

  • d_end_offsets[in]

    Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_end_offsets[i] - 1 is the last element of the ith data segment in d_keys_* and d_values_*. If d_end_offsets[i] - 1 <= d_begin_offsets[i], the ith is considered empty.

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT>
static inline cudaError_t ArgMin(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, cudaStream_t stream = 0)

Finds the first device-wide minimum in each segment using the less-than (<) operator, also returning the in-segment index of that item.

  • The output value type of d_out is cub::KeyValuePair<int, T> (assuming the value type of d_in is T)

    • The minimum of the ith segment is written to d_out[i].value and its offset in that segment is written to d_out[i].key.

    • The {1, std::numeric_limits<T>::max()} tuple is produced for zero-length inputs

  • When input a contiguous sequence of segments, a single sequence segment_offsets (of length num_segments + 1) can be aliased for both the d_begin_offsets and d_end_offsets parameters (where the latter is specified as segment_offsets + 1).

  • Does not support < operators that are non-commutative.

  • Let s be in [0, num_segments). The range [d_out + d_begin_offsets[s], d_out + d_end_offsets[s]) shall not overlap [d_in + d_begin_offsets[s], d_in + d_end_offsets[s]), [d_begin_offsets, d_begin_offsets + num_segments) nor [d_end_offsets, d_end_offsets + num_segments).

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the argmin-reduction of a device vector of int data elements.

int num_segments                     = 3;
thrust::device_vector<int> d_offsets = {0, 3, 3, 7};
auto d_offsets_it                    = thrust::raw_pointer_cast(d_offsets.data());
thrust::device_vector<int> d_in{8, 6, 7, 5, 3, 0, 9};
thrust::device_vector<cub::KeyValuePair<int, int>> d_out(3);

// Determine temporary device storage requirements
void* d_temp_storage      = nullptr;
size_t temp_storage_bytes = 0;
cub::DeviceSegmentedReduce::ArgMin(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1);

thrust::device_vector<std::uint8_t> temp_storage(temp_storage_bytes);
d_temp_storage = thrust::raw_pointer_cast(temp_storage.data());

// Run reduction
cub::DeviceSegmentedReduce::ArgMin(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1);

thrust::device_vector<cub::KeyValuePair<int, int>> expected{{1, 6}, {1, INT_MAX}, {2, 0}};

Template Parameters
  • InputIteratorT[inferred] Random-access input iterator type for reading input items (of some type T) (may be a simple pointer type)

  • OutputIteratorT[inferred] Output iterator type for recording the reduced aggregate (having value type KeyValuePair<int, T>) (may be a simple pointer type)

  • BeginOffsetIteratorT[inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)

  • EndOffsetIteratorT[inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Pointer to the input sequence of data items

  • d_out[out] Pointer to the output aggregate

  • num_segments[in] The number of segments that comprise the sorting data

  • d_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_begin_offsets[i] is the first element of the ith data segment in d_keys_* and d_values_*

  • d_end_offsets[in]

    Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_end_offsets[i] - 1 is the last element of the ith data segment in d_keys_* and d_values_*. If d_end_offsets[i] - 1 <= d_begin_offsets[i], the ith is considered empty.

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT>
static inline cudaError_t Max(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, cudaStream_t stream = 0)

Computes a device-wide segmented maximum using the greater-than (>) operator.

  • Uses std::numeric_limits<T>::lowest() as the initial value of the reduction.

  • When input a contiguous sequence of segments, a single sequence segment_offsets (of length num_segments + 1) can be aliased for both the d_begin_offsets and d_end_offsets parameters (where the latter is specified as segment_offsets + 1).

  • Does not support > operators that are non-commutative.

  • Let s be in [0, num_segments). The range [d_out + d_begin_offsets[s], d_out + d_end_offsets[s]) shall not overlap [d_in + d_begin_offsets[s], d_in + d_end_offsets[s]), [d_begin_offsets, d_begin_offsets + num_segments) nor [d_end_offsets, d_end_offsets + num_segments).

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the max-reduction of a device vector of int data elements.

int num_segments                     = 3;
thrust::device_vector<int> d_offsets = {0, 3, 3, 7};
auto d_offsets_it                    = thrust::raw_pointer_cast(d_offsets.data());
thrust::device_vector<int> d_in{8, 6, 7, 5, 3, 0, 9};
thrust::device_vector<int> d_out(3);

// Determine temporary device storage requirements
void* d_temp_storage      = nullptr;
size_t temp_storage_bytes = 0;
cub::DeviceSegmentedReduce::Max(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1);

thrust::device_vector<std::uint8_t> temp_storage(temp_storage_bytes);
d_temp_storage = thrust::raw_pointer_cast(temp_storage.data());

// Run reduction
cub::DeviceSegmentedReduce::Max(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1);

thrust::device_vector<int> expected{8, INT_MIN, 9};

Template Parameters
  • InputIteratorT[inferred] Random-access input iterator type for reading input items (may be a simple pointer type)

  • OutputIteratorT[inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)

  • BeginOffsetIteratorT[inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)

  • EndOffsetIteratorT[inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Pointer to the input sequence of data items

  • d_out[out] Pointer to the output aggregate

  • num_segments[in] The number of segments that comprise the sorting data

  • d_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_begin_offsets[i] is the first element of the ith data segment in d_keys_* and d_values_*

  • d_end_offsets[in]

    Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_end_offsets[i] - 1 is the last element of the ith data segment in d_keys_* and d_values_*. If d_end_offsets[i] - 1 <= d_begin_offsets[i], the ith is considered empty.

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT>
static inline cudaError_t ArgMax(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, cudaStream_t stream = 0)

Finds the first device-wide maximum in each segment using the greater-than (>) operator, also returning the in-segment index of that item

  • The output value type of d_out is cub::KeyValuePair<int, T> (assuming the value type of d_in is T)

    • The maximum of the ith segment is written to d_out[i].value and its offset in that segment is written to d_out[i].key.

    • The {1, std::numeric_limits<T>::lowest()} tuple is produced for zero-length inputs

  • When input a contiguous sequence of segments, a single sequence segment_offsets (of length num_segments + 1) can be aliased for both the d_begin_offsets and d_end_offsets parameters (where the latter is specified as segment_offsets + 1).

  • Does not support > operators that are non-commutative.

  • Let s be in [0, num_segments). The range [d_out + d_begin_offsets[s], d_out + d_end_offsets[s]) shall not overlap [d_in + d_begin_offsets[s], d_in + d_end_offsets[s]), [d_begin_offsets, d_begin_offsets + num_segments) nor [d_end_offsets, d_end_offsets + num_segments).

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the argmax-reduction of a device vector of int data elements.

int num_segments                     = 3;
thrust::device_vector<int> d_offsets = {0, 3, 3, 7};
auto d_offsets_it                    = thrust::raw_pointer_cast(d_offsets.data());
thrust::device_vector<int> d_in{8, 6, 7, 5, 3, 0, 9};
thrust::device_vector<cub::KeyValuePair<int, int>> d_out(3);

// Determine temporary device storage requirements
void* d_temp_storage      = nullptr;
size_t temp_storage_bytes = 0;
cub::DeviceSegmentedReduce::ArgMax(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1);

thrust::device_vector<std::uint8_t> temp_storage(temp_storage_bytes);
d_temp_storage = thrust::raw_pointer_cast(temp_storage.data());

// Run reduction
cub::DeviceSegmentedReduce::ArgMax(
  d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1);

thrust::device_vector<cub::KeyValuePair<int, int>> expected{{0, 8}, {1, INT_MIN}, {3, 9}};

Template Parameters
  • InputIteratorT[inferred] Random-access input iterator type for reading input items (of some type T) (may be a simple pointer type)

  • OutputIteratorT[inferred] Output iterator type for recording the reduced aggregate (having value type KeyValuePair<int, T>) (may be a simple pointer type)

  • BeginOffsetIteratorT[inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)

  • EndOffsetIteratorT[inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Pointer to the input sequence of data items

  • d_out[out] Pointer to the output aggregate

  • num_segments[in] The number of segments that comprise the sorting data

  • d_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_begin_offsets[i] is the first element of the ith data segment in d_keys_* and d_values_*

  • d_end_offsets[in]

    Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_end_offsets[i] - 1 is the last element of the ith data segment in d_keys_* and d_values_*. If d_end_offsets[i] - 1 <= d_begin_offsets[i], the ith is considered empty.

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.