cub::DeviceSegmentedReduce
Defined in cub/device/device_segmented_reduce.cuh
-
struct DeviceSegmentedReduce
DeviceSegmentedReduce provides device-wide, parallel operations for computing a reduction across multiple sequences of data items residing within device-accessible memory.
Overview
A reduction (or fold) uses a binary combining operator to compute a single aggregate from a sequence of input elements.
Usage Considerations
Dynamic parallelism. DeviceSegmentedReduce methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.
Public Static Functions
-
template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT, typename ReductionOpT, typename T>
static inline cudaError_t Reduce(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, ReductionOpT reduction_op, T initial_value, cudaStream_t stream = 0) Computes a device-wide segmented reduction using the specified binary
reduction_op
functor.Does not support binary reduction operators that are non-commutative.
Provides “run-to-run” determinism for pseudo-associative reduction (e.g., addition of floating point types) on the same GPU device. However, results for pseudo-associative reduction may be inconsistent from one device to a another device of a different compute-capability because CUB can employ different tile-sizing for different architectures.
When input a contiguous sequence of segments, a single sequence
segment_offsets
(of lengthnum_segments + 1
) can be aliased for both thed_begin_offsets
andd_end_offsets
parameters (where the latter is specified assegment_offsets + 1
).Let
s
be in[0, num_segments)
. The range[d_out + d_begin_offsets[s], d_out + d_end_offsets[s])
shall not overlap[d_in + d_begin_offsets[s], d_in + d_end_offsets[s])
,[d_begin_offsets, d_begin_offsets + num_segments)
nor[d_end_offsets, d_end_offsets + num_segments)
.When
d_temp_storage
isnullptr
, no work is done and the required allocation size is returned intemp_storage_bytes
.
Snippet
The code snippet below illustrates a custom min-reduction of a device vector of
int
data elements.int num_segments = 3; thrust::device_vector<int> d_offsets = {0, 3, 3, 7}; auto d_offsets_it = thrust::raw_pointer_cast(d_offsets.data()); thrust::device_vector<int> d_in{8, 6, 7, 5, 3, 0, 9}; thrust::device_vector<int> d_out(3); CustomMin min_op; int initial_value{INT_MAX}; // Determine temporary device storage requirements void* d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceSegmentedReduce::Reduce( d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1, min_op, initial_value); thrust::device_vector<std::uint8_t> temp_storage(temp_storage_bytes); d_temp_storage = thrust::raw_pointer_cast(temp_storage.data()); // Run reduction cub::DeviceSegmentedReduce::Reduce( d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1, min_op, initial_value); thrust::device_vector<int> expected{6, INT_MAX, 0};
- Template Parameters
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
ReductionOpT – [inferred] Binary reduction functor type having member
T operator()(const T &a, const T &b)
T – [inferred] Data element type that is convertible to the
value
type ofInputIteratorT
- Parameters
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storage
allocationd_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the sorting data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments
, such thatd_begin_offsets[i]
is the first element of the ith data segment ind_keys_*
andd_values_*
d_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments
, such thatd_end_offsets[i] - 1
is the last element of the ith data segment ind_keys_*
andd_values_*
. Ifd_end_offsets[i] - 1 <= d_begin_offsets[i]
, the ith is considered empty.reduction_op – [in] Binary reduction functor
initial_value – [in] Initial value of the reduction for each segment
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT>
static inline cudaError_t Sum(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, cudaStream_t stream = 0) Computes a device-wide segmented sum using the addition (
+
) operator.Uses
0
as the initial value of the reduction for each segment.When input a contiguous sequence of segments, a single sequence
segment_offsets
(of lengthnum_segments + 1
) can be aliased for both thed_begin_offsets
andd_end_offsets
parameters (where the latter is specified assegment_offsets + 1
).Does not support
+
operators that are non-commutative.Let
s
be in[0, num_segments)
. The range[d_out + d_begin_offsets[s], d_out + d_end_offsets[s])
shall not overlap[d_in + d_begin_offsets[s], d_in + d_end_offsets[s])
,[d_begin_offsets, d_begin_offsets + num_segments)
nor[d_end_offsets, d_end_offsets + num_segments)
.When
d_temp_storage
isnullptr
, no work is done and the required allocation size is returned intemp_storage_bytes
.
Snippet
The code snippet below illustrates the sum reduction of a device vector of
int
data elements.int num_segments = 3; thrust::device_vector<int> d_offsets = {0, 3, 3, 7}; auto d_offsets_it = thrust::raw_pointer_cast(d_offsets.data()); thrust::device_vector<int> d_in{8, 6, 7, 5, 3, 0, 9}; thrust::device_vector<int> d_out(3); // Determine temporary device storage requirements void* d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceSegmentedReduce::Sum( d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1); thrust::device_vector<std::uint8_t> temp_storage(temp_storage_bytes); d_temp_storage = thrust::raw_pointer_cast(temp_storage.data()); // Run reduction cub::DeviceSegmentedReduce::Sum( d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1); thrust::device_vector<int> expected{21, 0, 17};
- Template Parameters
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
- Parameters
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storage
allocationd_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the sorting data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments`, such that ``d_begin_offsets[i]
is the first element of the ith data segment ind_keys_*
andd_values_*
d_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments
, such thatd_end_offsets[i] - 1
is the last element of the ith data segment ind_keys_*
andd_values_*
. Ifd_end_offsets[i] - 1 <= d_begin_offsets[i]
, the ith is considered empty.stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT>
static inline cudaError_t Min(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, cudaStream_t stream = 0) Computes a device-wide segmented minimum using the less-than (
<
) operator.Uses
std::numeric_limits<T>::max()
as the initial value of the reduction for each segment.When input a contiguous sequence of segments, a single sequence
segment_offsets
(of lengthnum_segments + 1
) can be aliased for both thed_begin_offsets
andd_end_offsets
parameters (where the latter is specified assegment_offsets + 1
).Does not support
<
operators that are non-commutative.Let
s
be in[0, num_segments)
. The range[d_out + d_begin_offsets[s], d_out + d_end_offsets[s])
shall not overlap[d_in + d_begin_offsets[s], d_in + d_end_offsets[s])
,[d_begin_offsets, d_begin_offsets + num_segments)
nor[d_end_offsets, d_end_offsets + num_segments)
.When
d_temp_storage
isnullptr
, no work is done and the required allocation size is returned intemp_storage_bytes
.
Snippet
The code snippet below illustrates the min-reduction of a device vector of
int
data elements.struct CustomMin { template <typename T> __device__ __forceinline__ T operator()(const T& a, const T& b) const { return (b < a) ? b : a; } };
int num_segments = 3; thrust::device_vector<int> d_offsets = {0, 3, 3, 7}; auto d_offsets_it = thrust::raw_pointer_cast(d_offsets.data()); thrust::device_vector<int> d_in{8, 6, 7, 5, 3, 0, 9}; thrust::device_vector<int> d_out(3); // Determine temporary device storage requirements void* d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceSegmentedReduce::Min( d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1); thrust::device_vector<std::uint8_t> temp_storage(temp_storage_bytes); d_temp_storage = thrust::raw_pointer_cast(temp_storage.data()); // Run reduction cub::DeviceSegmentedReduce::Min( d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1); thrust::device_vector<int> expected{6, INT_MAX, 0};
- Template Parameters
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
- Parameters
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storage
allocationd_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the sorting data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments
, such thatd_begin_offsets[i]
is the first element of the ith data segment ind_keys_*
andd_values_*
d_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments
, such thatd_end_offsets[i] - 1
is the last element of the ith data segment ind_keys_*
andd_values_*
. Ifd_end_offsets[i] - 1 <= d_begin_offsets[i]
, the ith is considered empty.stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT>
static inline cudaError_t ArgMin(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, cudaStream_t stream = 0) Finds the first device-wide minimum in each segment using the less-than (
<
) operator, also returning the in-segment index of that item.The output value type of
d_out
iscub::KeyValuePair<int, T>
(assuming the value type ofd_in
isT
)The minimum of the ith segment is written to
d_out[i].value
and its offset in that segment is written tod_out[i].key
.The
{1, std::numeric_limits<T>::max()}
tuple is produced for zero-length inputs
When input a contiguous sequence of segments, a single sequence
segment_offsets
(of lengthnum_segments + 1
) can be aliased for both thed_begin_offsets
andd_end_offsets
parameters (where the latter is specified assegment_offsets + 1
).Does not support
<
operators that are non-commutative.Let
s
be in[0, num_segments)
. The range[d_out + d_begin_offsets[s], d_out + d_end_offsets[s])
shall not overlap[d_in + d_begin_offsets[s], d_in + d_end_offsets[s])
,[d_begin_offsets, d_begin_offsets + num_segments)
nor[d_end_offsets, d_end_offsets + num_segments)
.When
d_temp_storage
isnullptr
, no work is done and the required allocation size is returned intemp_storage_bytes
.
Snippet
The code snippet below illustrates the argmin-reduction of a device vector of
int
data elements.int num_segments = 3; thrust::device_vector<int> d_offsets = {0, 3, 3, 7}; auto d_offsets_it = thrust::raw_pointer_cast(d_offsets.data()); thrust::device_vector<int> d_in{8, 6, 7, 5, 3, 0, 9}; thrust::device_vector<cub::KeyValuePair<int, int>> d_out(3); // Determine temporary device storage requirements void* d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceSegmentedReduce::ArgMin( d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1); thrust::device_vector<std::uint8_t> temp_storage(temp_storage_bytes); d_temp_storage = thrust::raw_pointer_cast(temp_storage.data()); // Run reduction cub::DeviceSegmentedReduce::ArgMin( d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1); thrust::device_vector<cub::KeyValuePair<int, int>> expected{{1, 6}, {1, INT_MAX}, {2, 0}};
- Template Parameters
InputIteratorT – [inferred] Random-access input iterator type for reading input items (of some type
T
) (may be a simple pointer type)OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (having value type
KeyValuePair<int, T>
) (may be a simple pointer type)BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
- Parameters
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storage
allocationd_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the sorting data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments
, such thatd_begin_offsets[i]
is the first element of the ith data segment ind_keys_*
andd_values_*
d_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments
, such thatd_end_offsets[i] - 1
is the last element of the ith data segment ind_keys_*
andd_values_*
. Ifd_end_offsets[i] - 1 <= d_begin_offsets[i]
, the ith is considered empty.stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT>
static inline cudaError_t Max(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, cudaStream_t stream = 0) Computes a device-wide segmented maximum using the greater-than (
>
) operator.Uses
std::numeric_limits<T>::lowest()
as the initial value of the reduction.When input a contiguous sequence of segments, a single sequence
segment_offsets
(of lengthnum_segments + 1
) can be aliased for both thed_begin_offsets
andd_end_offsets
parameters (where the latter is specified assegment_offsets + 1
).Does not support
>
operators that are non-commutative.Let
s
be in[0, num_segments)
. The range[d_out + d_begin_offsets[s], d_out + d_end_offsets[s])
shall not overlap[d_in + d_begin_offsets[s], d_in + d_end_offsets[s])
,[d_begin_offsets, d_begin_offsets + num_segments)
nor[d_end_offsets, d_end_offsets + num_segments)
.When
d_temp_storage
isnullptr
, no work is done and the required allocation size is returned intemp_storage_bytes
.
Snippet
The code snippet below illustrates the max-reduction of a device vector of
int
data elements.int num_segments = 3; thrust::device_vector<int> d_offsets = {0, 3, 3, 7}; auto d_offsets_it = thrust::raw_pointer_cast(d_offsets.data()); thrust::device_vector<int> d_in{8, 6, 7, 5, 3, 0, 9}; thrust::device_vector<int> d_out(3); // Determine temporary device storage requirements void* d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceSegmentedReduce::Max( d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1); thrust::device_vector<std::uint8_t> temp_storage(temp_storage_bytes); d_temp_storage = thrust::raw_pointer_cast(temp_storage.data()); // Run reduction cub::DeviceSegmentedReduce::Max( d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1); thrust::device_vector<int> expected{8, INT_MIN, 9};
- Template Parameters
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (may be a simple pointer type)
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
- Parameters
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storage
allocationd_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the sorting data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments
, such thatd_begin_offsets[i]
is the first element of the ith data segment ind_keys_*
andd_values_*
d_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments
, such thatd_end_offsets[i] - 1
is the last element of the ith data segment ind_keys_*
andd_values_*
. Ifd_end_offsets[i] - 1 <= d_begin_offsets[i]
, the ith is considered empty.stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT>
static inline cudaError_t ArgMax(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, cudaStream_t stream = 0) Finds the first device-wide maximum in each segment using the greater-than (
>
) operator, also returning the in-segment index of that itemThe output value type of
d_out
iscub::KeyValuePair<int, T>
(assuming the value type ofd_in
isT
)The maximum of the ith segment is written to
d_out[i].value
and its offset in that segment is written tod_out[i].key
.The
{1, std::numeric_limits<T>::lowest()}
tuple is produced for zero-length inputs
When input a contiguous sequence of segments, a single sequence
segment_offsets
(of lengthnum_segments + 1
) can be aliased for both thed_begin_offsets
andd_end_offsets
parameters (where the latter is specified assegment_offsets + 1
).Does not support
>
operators that are non-commutative.Let
s
be in[0, num_segments)
. The range[d_out + d_begin_offsets[s], d_out + d_end_offsets[s])
shall not overlap[d_in + d_begin_offsets[s], d_in + d_end_offsets[s])
,[d_begin_offsets, d_begin_offsets + num_segments)
nor[d_end_offsets, d_end_offsets + num_segments)
.When
d_temp_storage
isnullptr
, no work is done and the required allocation size is returned intemp_storage_bytes
.
Snippet
The code snippet below illustrates the argmax-reduction of a device vector of int data elements.
int num_segments = 3; thrust::device_vector<int> d_offsets = {0, 3, 3, 7}; auto d_offsets_it = thrust::raw_pointer_cast(d_offsets.data()); thrust::device_vector<int> d_in{8, 6, 7, 5, 3, 0, 9}; thrust::device_vector<cub::KeyValuePair<int, int>> d_out(3); // Determine temporary device storage requirements void* d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceSegmentedReduce::ArgMax( d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1); thrust::device_vector<std::uint8_t> temp_storage(temp_storage_bytes); d_temp_storage = thrust::raw_pointer_cast(temp_storage.data()); // Run reduction cub::DeviceSegmentedReduce::ArgMax( d_temp_storage, temp_storage_bytes, d_in.begin(), d_out.begin(), num_segments, d_offsets_it, d_offsets_it + 1); thrust::device_vector<cub::KeyValuePair<int, int>> expected{{0, 8}, {1, INT_MIN}, {3, 9}};
- Template Parameters
InputIteratorT – [inferred] Random-access input iterator type for reading input items (of some type
T
) (may be a simple pointer type)OutputIteratorT – [inferred] Output iterator type for recording the reduced aggregate (having value type
KeyValuePair<int, T>
) (may be a simple pointer type)BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
- Parameters
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storage
allocationd_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output aggregate
num_segments – [in] The number of segments that comprise the sorting data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length num_segments, such that
d_begin_offsets[i]
is the first element of the ith data segment ind_keys_*
andd_values_*
d_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments
, such thatd_end_offsets[i] - 1
is the last element of the ith data segment ind_keys_*
andd_values_*
. Ifd_end_offsets[i] - 1 <= d_begin_offsets[i]
, the ith is considered empty.stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.