cub::DeviceSegmentedScan#
-
struct DeviceSegmentedScan#
DeviceSegmentedScan provides device-wide, parallel operations for computing a batched prefix scan across multiple sequences of data items residing within device-accessible memory.
Overview#
Given a sequence of input elements and a binary reduction operator, a prefix scan produces an output sequence where each element is computed to be the reduction of the elements occurring earlier in the input sequence. Prefix sum connotes a prefix scan with the addition operator. The term inclusive indicates that the ith output reduction incorporates the ith input. The term exclusive indicates the ith input is not incorporated into the ith output reduction. When the input and output sequences are the same, the scan is performed in-place.
In order to provide an efficient parallel implementation, the binary reduction operator must be associative. That is,
op(op(a, b), c)must be equivalent toop(a, op(b, c))for any input valuesa,b, andc.Usage Considerations#
Dynamic parallelism. DeviceSegmentedScan methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.
Public Static Functions
-
template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorInputT, typename EndOffsetIteratorInputT>
static inline cudaError_t ExclusiveSegmentedSum( - void *d_temp_storage,
- size_t &temp_storage_bytes,
- InputIteratorT d_in,
- OutputIteratorT d_out,
- BeginOffsetIteratorInputT d_in_begin_offsets,
- EndOffsetIteratorInputT d_in_end_offsets,
- ::cuda::std::int64_t num_segments,
- cudaStream_t stream = 0,
Computes a device-wide segmented exclusive prefix sum.
Results are not deterministic for computation of prefix sum on floating-point types and may vary from run to run.
When
d_inandd_outare equal, the scan is performed in-place. The input and output sequences shall not overlap in any other way.When
d_temp_storageisnullptr, no work is done and the required allocation size is returned intemp_storage_bytes.
Preconditions#
When
d_inandd_outare equal, the segmented scan is performed in-place. The range[d_in, d_in + num_items_in)and[d_out, d_out + num_items_out)shall not overlap in any other way.d_inandd_outmust not be null pointers
Snippet#
The code snippet below illustrates the exclusive segmented prefix sum of an
intdevice vector.#include <cub/cub.cuh> // or, equivalently // #include <cub/device/device_segmented_scan.cuh> // Declare, allocate, and initialize device-accessible pointers for // input and output int num_segments; // e.g., 3 int *d_in; // e.g., [8, 6, 7, 5, 3, -2, 9] int *d_offsets; // e.g., [0, 2, 5, 7] int *d_out; // e.g., [ , , , , , , ] ... // Determine temporary device storage requirements void *d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceScan::ExclusiveSegmentedSum( d_temp_storage, temp_storage_bytes, d_in, d_out, d_offsets, d_offsets + 1, num_segments); // Allocate temporary storage cudaMalloc(&d_temp_storage, temp_storage_bytes); // Run exclusive prefix sum cub::DeviceScan::ExclusiveSegmentedSum( d_temp_storage, temp_storage_bytes, d_in, d_out, d_offsets, d_offsets + 1, num_segments); // d_out <-- [0, 8, 0, 7, 12, 0, -2]
- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading segmented scan inputs (may be a simple pointer type)
OutputIteratorT – [inferred] Random-access output iterator type for writing segmented scan outputs (may be a simple pointer type)
BeginOffsetIteratorInputT – [inferred] Random-access input iterator type for reading segment beginning offsets in the input data sequence (may be a simple pointer type)
EndOffsetIteratorInputT – [inferred] Random-access input iterator type for reading segment ending offsets in the input data sequence (may be a simple pointer type)
- Parameters:
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr, the required allocation size is written totemp_storage_bytesand no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storageallocationd_in – [in] Random-access iterator to the input sequence of data items
d_out – [out] Random-access iterator to the output sequence of data items
d_in_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments, such thatd_in_begin_offsets[i]is the first element of the ith data segment ind_inand ind_out.d_in_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments, such thatd_in_end_offsets[i] - 1is the last element of the ith data segment ind_in. Ifd_in_end_offsets[i] - 1 <= d_in_begin_offsets[i], the ith is considered empty.num_segments – [in] The number of segments that comprise the segmented prefix scan data.
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorInputT, typename EndOffsetIteratorInputT, typename BeginOffsetIteratorOutputT>
static inline cudaError_t ExclusiveSegmentedSum( - void *d_temp_storage,
- size_t &temp_storage_bytes,
- InputIteratorT d_in,
- OutputIteratorT d_out,
- BeginOffsetIteratorInputT d_in_begin_offsets,
- EndOffsetIteratorInputT d_in_end_offsets,
- BeginOffsetIteratorOutputT d_out_begin_offsets,
- ::cuda::std::int64_t num_segments,
- cudaStream_t stream = 0,
Computes a device-wide segmented exclusive prefix sum.
Results are not deterministic for computation of prefix sum on floating-point types and may vary from run to run.
When
d_inandd_outare equal, the scan is performed in-place. The input and output sequences shall not overlap in any other way.When
d_temp_storageisnullptr, no work is done and the required allocation size is returned intemp_storage_bytes.
Snippet#
The code snippet below illustrates the exclusive segmented prefix sum of an
intdevice vector.// Sequence of 16 values, representing 4x4 matrix in row-major layout auto input = thrust::device_vector<int>{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}; // offsets to starts of each of 4 rows size_t row_size = 4; auto in_begin_offsets = thrust::device_vector<size_t>{0, row_size, 2 * row_size}; auto num_segments = in_begin_offsets.size(); // Perform row-wise sum for 3-by-3 principal sub-matrix size_t segment_size = 3; auto in_end_offsets = thrust::device_vector<size_t>{ 0 * row_size + segment_size, 1 * row_size + segment_size, 2 * row_size + segment_size}; auto output = thrust::device_vector<int>(num_segments * segment_size, thrust::no_init); auto out_begin_offsets = thrust::device_vector<size_t>{0, segment_size, 2 * segment_size}; void* temp_storage = nullptr; size_t temp_storage_bytes = 0; auto d_in_beg_offsets = in_begin_offsets.begin(); auto d_in_end_offsets = in_end_offsets.begin(); auto d_out_beg_offsets = out_begin_offsets.begin(); auto d_in = input.begin(); auto d_out = output.begin(); // get size of required storage and allocate auto status = cub::DeviceSegmentedScan::ExclusiveSegmentedSum( temp_storage, temp_storage_bytes, d_in, d_out, d_in_beg_offsets, d_in_end_offsets, d_out_beg_offsets, num_segments); check_execution_status(status, algo_name); status = cudaMalloc(&temp_storage, temp_storage_bytes); check_execution_status(status, "cudaMalloc"); // run the algorithm status = cub::DeviceSegmentedScan::ExclusiveSegmentedSum( temp_storage, temp_storage_bytes, d_in, d_out, d_in_beg_offsets, d_in_end_offsets, d_out_beg_offsets, num_segments); check_execution_status(status, algo_name); thrust::device_vector<int> expected{0, 1, 3, 0, 5, 11, 0, 9, 19};
- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading segmented scan inputs (may be a simple pointer type)
OutputIteratorT – [inferred] Random-access output iterator type for writing segmented scan outputs (may be a simple pointer type)
BeginOffsetIteratorInputT – [inferred] Random-access input iterator type for reading segment beginning offsets in the input data sequence (may be a simple pointer type)
EndOffsetIteratorInputT – [inferred] Random-access input iterator type for reading segment ending offsets in the input data sequence (may be a simple pointer type)
BeginOffsetIteratorOutputT – [inferred] Random-access input iterator type for reading segment beginning offsets in the output sequence (may be a simple pointer type)
- Parameters:
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr, the required allocation size is written totemp_storage_bytesand no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storageallocationd_in – [in] Random-access iterator to the input sequence of data items
d_out – [out] Random-access iterator to the output sequence of data items
d_in_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments, such thatd_in_begin_offsets[i]is the first element of the ith data segment ind_ind_in_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments, such thatd_in_end_offsets[i] - 1is the last element of the ith data segment ind_in. Ifd_in_end_offsets[i] - 1 <= d_in_begin_offsets[i], the ith is considered empty.d_out_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments, such thatd_out_begin_offsets[i]is the first element of the ith data segment ind_outnum_segments – [in] The number of segments that comprise the segmented prefix scan data.
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorInputT, typename EndOffsetIteratorInputT, typename ScanOpT, typename InitValueT>
static inline cudaError_t ExclusiveSegmentedScan( - void *d_temp_storage,
- size_t &temp_storage_bytes,
- InputIteratorT d_in,
- OutputIteratorT d_out,
- BeginOffsetIteratorInputT d_in_begin_offsets,
- EndOffsetIteratorInputT d_in_end_offsets,
- ::cuda::std::int64_t num_segments,
- ScanOpT scan_op,
- InitValueT init_value,
- cudaStream_t stream = 0,
Computes a device-wide segmented exclusive prefix scan using the specified binary associative
scan_opfunctor. Theinit_valuevalue is applied as the initial value, and is assigned to the first element in each output segment.Supports non-commutative scan operators.
Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run.
When
d_inandd_outare equal, the scan is performed in-place. The input and output sequences shall not overlap in any other way.When
d_temp_storageisnullptr, no work is done and the required allocation size is returned intemp_storage_bytes.
Snippet#
The code snippet below illustrates the exclusive segmented prefix scan of an
intdevice vector./* Compute exclusive scan using addition of GF(2) field represented * over boolean values stored as bits in unsigned integer, where addition is bitwise XOR. * Each unsigned integer represents 32-long tuple of GF(2) values */ auto scan_op = [] __host__ __device__(unsigned v1, unsigned v2) -> unsigned { return v1 ^ v2; }; unsigned init_value = 0u; // 128 input elements // auto input = thrust::device_vector<unsigned>{0x64b40b1b, 0x7bf23c0c, 0xaa982e07, ... }; // 4 segments auto offsets = thrust::device_vector<unsigned>{0, 40, 77, 101, 128}; auto output = thrust::device_vector<unsigned>(input.size(), thrust::no_init); void* temp_storage = nullptr; size_t temp_storage_bytes = 0; auto d_in = input.begin(); auto d_out = output.begin(); auto begin_offsets = offsets.begin(); auto end_offsets = offsets.begin() + 1; size_t num_segments = offsets.size() - 1; // inquire size of needed temporary storage and allocate auto status = cub::DeviceSegmentedScan::ExclusiveSegmentedScan( temp_storage, temp_storage_bytes, d_in, d_out, begin_offsets, end_offsets, num_segments, scan_op, init_value); check_execution_status(status, algo_name); status = cudaMalloc(&temp_storage, temp_storage_bytes); check_execution_status(status, "cudaMalloc"); // run the algorithm status = cub::DeviceSegmentedScan::ExclusiveSegmentedScan( temp_storage, temp_storage_bytes, d_in, d_out, begin_offsets, end_offsets, num_segments, scan_op, init_value); check_execution_status(status, algo_name);
- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading segmented scan inputs (may be a simple pointer type)
OutputIteratorT – [inferred] Random-access output iterator type for writing segmented scan outputs (may be a simple pointer type)
BeginOffsetIteratorInputT – [inferred] Random-access input iterator type for reading segment beginning offsets in the input data sequence (may be a simple pointer type)
EndOffsetIteratorInputT – [inferred] Random-access input iterator type for reading segment ending offsets in the input data sequence (may be a simple pointer type)
ScanOpT – [inferred] Binary associative scan functor type having member
T operator()(const T &a, const T &b)InitValueT – [inferred] Type of the
init_value
- Parameters:
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr, the required allocation size is written totemp_storage_bytesand no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storageallocationd_in – [in] Random-access iterator to the input sequence of data items
d_out – [out] Random-access iterator to the output sequence of data items
d_in_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments, such thatd_in_begin_offsets[i]is the first element of the ith data segment ind_inand ind_outd_in_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments, such thatd_in_end_offsets[i] - 1is the last element of the ith data segment ind_in. Ifd_in_end_offsets[i] - 1 <= d_in_begin_offsets[i], the ith is considered empty.num_segments – [in] The number of segments that comprise the segmented prefix scan data.
scan_op – [in] Binary associative scan functor
init_value – [in] Initial value to seed the exclusive scan for each segment in the output sequence
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorInputT, typename EndOffsetIteratorInputT, typename BeginOffsetIteratorOutputT, typename ScanOpT, typename InitValueT>
static inline cudaError_t ExclusiveSegmentedScan( - void *d_temp_storage,
- size_t &temp_storage_bytes,
- InputIteratorT d_in,
- OutputIteratorT d_out,
- BeginOffsetIteratorInputT d_in_begin_offsets,
- EndOffsetIteratorInputT d_in_end_offsets,
- BeginOffsetIteratorOutputT d_out_begin_offsets,
- ::cuda::std::int64_t num_segments,
- ScanOpT scan_op,
- InitValueT init_value,
- cudaStream_t stream = 0,
Computes a device-wide segmented exclusive prefix scan using the specified binary associative
scan_opfunctor. Theinit_valuevalue is applied as the initial value, and is assigned to the first element in each output segment.Supports non-commutative scan operators.
Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run.
When
d_inandd_outare equal, the scan is performed in-place. The input and output sequences shall not overlap in any other way.When
d_temp_storageisnullptr, no work is done and the required allocation size is returned intemp_storage_bytes.
- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading segmented scan inputs (may be a simple pointer type)
OutputIteratorT – [inferred] Random-access output iterator type for writing segmented scan outputs (may be a simple pointer type)
BeginOffsetIteratorInputT – [inferred] Random-access input iterator type for reading segment beginning offsets in the input data sequence (may be a simple pointer type)
EndOffsetIteratorInputT – [inferred] Random-access input iterator type for reading segment ending offsets in the input data sequence (may be a simple pointer type)
BeginOffsetIteratorOutputT – [inferred] Random-access input iterator type for reading segment beginning offsets in the output sequence (may be a simple pointer type)
ScanOpT – [inferred] Binary associative scan functor type having member
T operator()(const T &a, const T &b)InitValueT – [inferred] Type of the
init_value
- Parameters:
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr, the required allocation size is written totemp_storage_bytesand no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storageallocationd_in – [in] Random-access iterator to the input sequence of data items
d_out – [out] Random-access iterator to the output sequence of data items
d_in_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments, such thatd_in_begin_offsets[i]is the first element of the ith data segment ind_ind_in_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments, such thatd_in_end_offsets[i] - 1is the last element of the ith data segment ind_in. Ifd_in_end_offsets[i] - 1 <= d_in_begin_offsets[i], the ith is considered empty.d_out_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments, such thatd_out_begin_offsets[i]is the first element of the ith data segment ind_outnum_segments – [in] The number of segments that comprise the segmented prefix scan data.
scan_op – [in] Binary associative scan functor
init_value – [in] Initial value to seed the exclusive scan for each segment in the output sequence
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorInputT, typename EndOffsetIteratorInputT>
static inline cudaError_t InclusiveSegmentedSum( - void *d_temp_storage,
- size_t &temp_storage_bytes,
- InputIteratorT d_in,
- OutputIteratorT d_out,
- BeginOffsetIteratorInputT d_in_begin_offsets,
- EndOffsetIteratorInputT d_in_end_offsets,
- ::cuda::std::int64_t num_segments,
- cudaStream_t stream = 0,
Computes a device-wide segmented inclusive prefix sum.
Results are not deterministic for computation of prefix sum on floating-point types and may vary from run to run.
When
d_inandd_outare equal, the scan is performed in-place. The input and output sequences shall not overlap in any other way.When
d_temp_storageisnullptr, no work is done and the required allocation size is returned intemp_storage_bytes.
Snippet#
The code snippet below illustrates the inclusive segmented prefix sum of an
intdevice vector.auto input = thrust::device_vector<int>{2, 1, 1, 2, 1, 2, 1, 1}; auto offsets = thrust::device_vector<size_t>{0, 3, 5, 8}; void* temp_storage = nullptr; size_t temp_storage_bytes = 0; auto begin_offsets = offsets.begin(); auto end_offsets = begin_offsets + 1; auto num_segments = offsets.size() - 1; auto d_in = input.begin(); // get size of requires storage and allocate auto status = cub::DeviceSegmentedScan::InclusiveSegmentedSum( temp_storage, temp_storage_bytes, d_in, d_in, begin_offsets, end_offsets, num_segments); check_execution_status(status, algo_name); status = cudaMalloc(&temp_storage, temp_storage_bytes); check_execution_status(status, "cudaMalloc"); // execute the algorithm status = cub::DeviceSegmentedScan::InclusiveSegmentedSum( temp_storage, temp_storage_bytes, d_in, d_in, begin_offsets, end_offsets, num_segments); check_execution_status(status, algo_name); thrust::device_vector<int> expected{2, 3, 4, 2, 3, 2, 3, 4};
- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading segmented scan inputs (may be a simple pointer type)
OutputIteratorT – [inferred] Random-access output iterator type for writing segmented scan outputs (may be a simple pointer type)
BeginOffsetIteratorInputT – [inferred] Random-access input iterator type for reading segment beginning offsets in the input data sequence (may be a simple pointer type)
EndOffsetIteratorInputT – [inferred] Random-access input iterator type for reading segment ending offsets in the input data sequence (may be a simple pointer type)
ScanOpT – [inferred] Binary associative scan functor type having member
T operator()(const T &a, const T &b)
- Parameters:
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr, the required allocation size is written totemp_storage_bytesand no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storageallocationd_in – [in] Random-access iterator to the input sequence of data items
d_out – [out] Random-access iterator to the output sequence of data items
d_in_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments, such thatd_in_begin_offsets[i]is the first element of the ith data segment ind_inand ind_outd_in_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments, such thatd_in_end_offsets[i] - 1is the last element of the ith data segment ind_in. Ifd_in_end_offsets[i] - 1 <= d_in_begin_offsets[i], the ith is considered empty.num_segments – [in] The number of segments that comprise the segmented prefix scan data.
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorInputT, typename EndOffsetIteratorInputT, typename BeginOffsetIteratorOutputT>
static inline cudaError_t InclusiveSegmentedSum( - void *d_temp_storage,
- size_t &temp_storage_bytes,
- InputIteratorT d_in,
- OutputIteratorT d_out,
- BeginOffsetIteratorInputT d_in_begin_offsets,
- EndOffsetIteratorInputT d_in_end_offsets,
- BeginOffsetIteratorOutputT d_out_begin_offsets,
- ::cuda::std::int64_t num_segments,
- cudaStream_t stream = 0,
Computes a device-wide segmented inclusive prefix sum.
Results are not deterministic for computation of prefix sum on floating-point types and may vary from run to run.
When
d_inandd_outare equal, the scan is performed in-place. The input and output sequences shall not overlap in any other way.When
d_temp_storageisnullptr, no work is done and the required allocation size is returned intemp_storage_bytes.
Snippet#
The code snippet below illustrates the inclusive segmented prefix sum of an
intdevice vector.// Sequence of 16 values, representing 4x4 matrix in row-major layout auto input = thrust::device_vector<int>{1, 1, 1, 1, -1, -1, -1, -1, 2, 2, 2, 2, -2, -2, -2, -2}; // begin offsets for each of 4 rows size_t m = 4; auto row_offsets = thrust::device_vector<size_t>{0, m, 2 * m, 3 * m, 4 * m}; auto num_segments = row_offsets.size() - 1; // Allocate m rows of m + 1 filled with zero-initialized values auto output = thrust::device_vector<int>((m + 1) * m, 0); // begin offsets to second element of each row size_t lda = m + 1; auto out_begin_offsets = thrust::device_vector<size_t>{1, lda + 1, 2 * lda + 1, 3 * lda + 1}; void* temp_storage = nullptr; size_t temp_storage_bytes = 0; auto d_in_beg_offsets = row_offsets.begin(); auto d_in_end_offsets = row_offsets.begin() + 1; auto d_out_beg_offsets = out_begin_offsets.begin(); auto d_in = input.begin(); auto d_out = output.begin(); // get size of temporary storage and allocate auto status = cub::DeviceSegmentedScan::InclusiveSegmentedSum( temp_storage, temp_storage_bytes, d_in, d_out, d_in_beg_offsets, d_in_end_offsets, d_out_beg_offsets, num_segments); check_execution_status(status, algo_name); status = cudaMalloc(&temp_storage, temp_storage_bytes); check_execution_status(status, algo_name); // Compute inclusive sum for each row prepended with 0 status = cub::DeviceSegmentedScan::InclusiveSegmentedSum( temp_storage, temp_storage_bytes, d_in, d_out, d_in_beg_offsets, d_in_end_offsets, d_out_beg_offsets, num_segments); check_execution_status(status, algo_name); std::vector<int> h_expected{}; h_expected.reserve(output.size()); std::vector<std::vector<int>> expected_rows{ {0, 1, 2, 3, 4}, {0, -1, -2, -3, -4}, {0, 2, 4, 6, 8}, {0, -2, -4, -6, -8}}; for (const auto& row : expected_rows) { h_expected.insert(h_expected.end(), row.begin(), row.end()); } auto expected = thrust::device_vector<int>{h_expected};
- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading segmented scan inputs (may be a simple pointer type)
OutputIteratorT – [inferred] Random-access output iterator type for writing segmented scan outputs (may be a simple pointer type)
BeginOffsetIteratorInputT – [inferred] Random-access input iterator type for reading segment beginning offsets in the input data sequence (may be a simple pointer type)
EndOffsetIteratorInputT – [inferred] Random-access input iterator type for reading segment ending offsets in the input data sequence (may be a simple pointer type)
BeginOffsetIteratorOutputT – [inferred] Random-access input iterator type for reading segment beginning offsets in the output sequence (may be a simple pointer type)
ScanOpT – [inferred] Binary associative scan functor type having member
T operator()(const T &a, const T &b)
- Parameters:
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr, the required allocation size is written totemp_storage_bytesand no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storageallocationd_in – [in] Random-access iterator to the input sequence of data items
d_out – [out] Random-access iterator to the output sequence of data items
d_in_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments, such thatd_in_begin_offsets[i]is the first element of the ith data segment ind_ind_in_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments, such thatd_in_end_offsets[i] - 1is the last element of the ith data segment ind_in. Ifd_in_end_offsets[i] - 1 <= d_in_begin_offsets[i], the ith is considered empty.d_out_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments, such thatd_out_begin_offsets[i]is the first element of the ith data segment ind_outnum_segments – [in] The number of segments that comprise the segmented prefix scan data.
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorInputT, typename EndOffsetIteratorInputT, typename ScanOpT>
static inline cudaError_t InclusiveSegmentedScan( - void *d_temp_storage,
- size_t &temp_storage_bytes,
- InputIteratorT d_in,
- OutputIteratorT d_out,
- BeginOffsetIteratorInputT d_in_begin_offsets,
- EndOffsetIteratorInputT d_in_end_offsets,
- ::cuda::std::int64_t num_segments,
- ScanOpT scan_op,
- cudaStream_t stream = 0,
Computes a device-wide segmented inclusive prefix scan using the specified binary associative
scan_opfunctor.Supports non-commutative scan operators.
Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run.
When
d_inandd_outare equal, the scan is performed in-place. The input and output sequences shall not overlap in any other way.When
d_temp_storageisnullptr, no work is done and the required allocation size is returned intemp_storage_bytes.
- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading segmented scan inputs (may be a simple pointer type)
OutputIteratorT – [inferred] Random-access output iterator type for writing segmented scan outputs (may be a simple pointer type)
BeginOffsetIteratorInputT – [inferred] Random-access input iterator type for reading segment beginning offsets in the input data sequence (may be a simple pointer type)
EndOffsetIteratorInputT – [inferred] Random-access input iterator type for reading segment ending offsets in the input data sequence (may be a simple pointer type)
ScanOpT – [inferred] Binary associative scan functor type having member
T operator()(const T &a, const T &b)
- Parameters:
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr, the required allocation size is written totemp_storage_bytesand no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storageallocationd_in – [in] Random-access iterator to the input sequence of data items
d_out – [out] Random-access iterator to the output sequence of data items
d_in_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments, such thatd_in_begin_offsets[i]is the first element of the ith data segment ind_inand ind_outd_in_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments, such thatd_in_end_offsets[i] - 1is the last element of the ith data segment ind_in. Ifd_in_end_offsets[i] - 1 <= d_in_begin_offsets[i], the ith is considered empty.num_segments – [in] The number of segments that comprise the segmented prefix scan data.
scan_op – [in] Binary associative scan functor
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorInputT, typename EndOffsetIteratorInputT, typename BeginOffsetIteratorOutputT, typename ScanOpT>
static inline cudaError_t InclusiveSegmentedScan( - void *d_temp_storage,
- size_t &temp_storage_bytes,
- InputIteratorT d_in,
- OutputIteratorT d_out,
- BeginOffsetIteratorInputT d_in_begin_offsets,
- EndOffsetIteratorInputT d_in_end_offsets,
- BeginOffsetIteratorOutputT d_out_begin_offsets,
- ::cuda::std::int64_t num_segments,
- ScanOpT scan_op,
- cudaStream_t stream = 0,
Computes a device-wide segmented inclusive prefix scan using the specified binary associative
scan_opfunctor.Supports non-commutative scan operators.
Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run.
When
d_inandd_outare equal, the scan is performed in-place. The input and output sequences shall not overlap in any other way.When
d_temp_storageisnullptr, no work is done and the required allocation size is returned intemp_storage_bytes.
Snippet#
The code snippet below illustrates the exclusive segmented prefix sum of an
intdevice vector.size_t n = 8; thrust::device_vector<float> input{0.21f, 0.33f, 0.17f, 0.56f, 0.31f, 0.25f, 1.0f, 0.72f}; constexpr unsigned _zero{0}; auto _n = static_cast<unsigned>(n); auto counting_it = cuda::counting_iterator(_zero); auto in_begin_offsets = counting_it; auto in_end_offsets = cuda::constant_iterator(_n); // use stride n + 1 is the distance between consecutive diagonal elements in C-contiguous layout auto out_begin_offsets = cuda::strided_iterator(counting_it, _n + 1); // allocate and zero-initialize output matrix in C-contiguous layout auto output = thrust::device_vector<float>(n * n, 0.0f); auto d_in = input.begin(); auto d_out = output.begin(); auto scan_op = [] __host__ __device__(float v1, float v2) noexcept -> float { return cuda::maximum<>{}(v1, v2); }; void* temp_storage = nullptr; size_t temp_storage_bytes; // determine size of required temporary storage and allocate auto status = cub::DeviceSegmentedScan::InclusiveSegmentedScan( temp_storage, temp_storage_bytes, d_in, d_out, in_begin_offsets, in_end_offsets, out_begin_offsets, n, scan_op); check_execution_status(status, algo_name); status = cudaMalloc(&temp_storage, temp_storage_bytes); check_execution_status(status, "cudaMalloc"); // run the algorithm status = cub::DeviceSegmentedScan::InclusiveSegmentedScan( temp_storage, temp_storage_bytes, d_in, d_out, in_begin_offsets, in_end_offsets, out_begin_offsets, n, scan_op); check_execution_status(status, algo_name); thrust::device_vector<float> expected{ 0.21f, 0.33f, 0.33f, 0.56f, 0.56f, 0.56f, 1.00f, 1.00f, // row 0 0.00f, 0.33f, 0.33f, 0.56f, 0.56f, 0.56f, 1.00f, 1.00f, // row 1 0.00f, 0.00f, 0.17f, 0.56f, 0.56f, 0.56f, 1.00f, 1.00f, // row 2 0.00f, 0.00f, 0.00f, 0.56f, 0.56f, 0.56f, 1.00f, 1.00f, // row 3 0.00f, 0.00f, 0.00f, 0.00f, 0.31f, 0.31f, 1.00f, 1.00f, // row 4 0.00f, 0.00f, 0.00f, 0.00f, 0.00f, 0.25f, 1.00f, 1.00f, // row 5 0.00f, 0.00f, 0.00f, 0.00f, 0.00f, 0.00f, 1.00f, 1.00f, // row 6 0.00f, 0.00f, 0.00f, 0.00f, 0.00f, 0.00f, 0.00f, 0.72f // row 7 };
- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading segmented scan inputs (may be a simple pointer type)
OutputIteratorT – [inferred] Random-access output iterator type for writing segmented scan outputs (may be a simple pointer type)
BeginOffsetIteratorInputT – [inferred] Random-access input iterator type for reading segment beginning offsets in the input data sequence (may be a simple pointer type)
EndOffsetIteratorInputT – [inferred] Random-access input iterator type for reading segment ending offsets in the input data sequence (may be a simple pointer type)
BeginOffsetIteratorOutputT – [inferred] Random-access input iterator type for reading segment beginning offsets in the output sequence (may be a simple pointer type)
ScanOpT – [inferred] Binary associative scan functor type having member
T operator()(const T &a, const T &b)
- Parameters:
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr, the required allocation size is written totemp_storage_bytesand no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storageallocationd_in – [in] Random-access iterator to the input sequence of data items
d_out – [out] Random-access iterator to the output sequence of data items
d_in_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments, such thatd_in_begin_offsets[i]is the first element of the ith data segment ind_ind_in_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments, such thatd_in_end_offsets[i] - 1is the last element of the ith data segment ind_in. Ifd_in_end_offsets[i] - 1 <= d_in_begin_offsets[i], the ith is considered empty.d_out_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments, such thatd_out_begin_offsets[i]is the first element of the ith data segment ind_outnum_segments – [in] The number of segments that comprise the segmented prefix scan data.
scan_op – [in] Binary associative scan functor
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorInputT, typename EndOffsetIteratorInputT, typename ScanOpT, typename InitValueT>
static inline cudaError_t InclusiveSegmentedScanInit( - void *d_temp_storage,
- size_t &temp_storage_bytes,
- InputIteratorT d_in,
- OutputIteratorT d_out,
- BeginOffsetIteratorInputT d_in_begin_offsets,
- EndOffsetIteratorInputT d_in_end_offsets,
- ::cuda::std::int64_t num_segments,
- ScanOpT scan_op,
- InitValueT init_value,
- cudaStream_t stream = 0,
Computes a device-wide segmented inclusive prefix scan using the specified binary associative
scan_opfunctor. The result of applying thescan_opbinary operator toinit_valuevalue and the first value in each input segment is assigned to the first value of the corresponding output segment.Supports non-commutative scan operators.
Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run.
When
d_inandd_outare equal, the scan is performed in-place. The input and output sequences shall not overlap in any other way.When
d_temp_storageisnullptr, no work is done and the required allocation size is returned intemp_storage_bytes.
Snippet#
The code snippet below illustrates the exclusive segmented prefix scan of an
intdevice vector.int prime = 7; auto input = thrust::device_vector<int>{ 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6}; auto row_size = static_cast<size_t>(prime); auto row_offsets = thrust::device_vector<size_t>{0, row_size, 2 * row_size, 3 * row_size, 4 * row_size, 5 * row_size}; size_t num_segments = row_offsets.size() - 1; thrust::device_vector<int> output(input.size(), thrust::no_init); auto m_p = cuda::fast_mod_div(prime); // Binary operator to multiply arguments using modular arithmetic auto scan_op = [=] __host__ __device__(int v1, int v2) -> int { const auto proj_v1 = (v1 % m_p); const auto proj_v2 = (v2 % m_p); return (proj_v1 * proj_v2) % m_p; }; int init_value = 1; auto d_in = input.begin(); auto d_out = output.begin(); auto d_in_beg_offsets = row_offsets.begin(); auto d_in_end_offsets = row_offsets.begin() + 1; void* temp_storage = nullptr; size_t temp_storage_bytes = 0; // get size of temporary storage and allocate auto status = cub::DeviceSegmentedScan::InclusiveSegmentedScanInit( temp_storage, temp_storage_bytes, d_in, d_out, d_in_beg_offsets, d_in_end_offsets, num_segments, scan_op, init_value); check_execution_status(status, algo_name); status = cudaMalloc(&temp_storage, temp_storage_bytes); check_execution_status(status, "cudaMalloc"); // run the algorithm status = cub::DeviceSegmentedScan::InclusiveSegmentedScanInit( temp_storage, temp_storage_bytes, d_in, d_out, d_in_beg_offsets, d_in_end_offsets, num_segments, scan_op, init_value); check_execution_status(status, algo_name); std::vector<int> h_expected{}; h_expected.reserve(output.size()); std::vector<std::vector<int>> expected_rows{ {2, 4, 1, 2, 4, 1, 2}, {3, 2, 6, 4, 5, 1, 3}, {4, 2, 1, 4, 2, 1, 4}, {5, 4, 6, 2, 3, 1, 5}, {6, 1, 6, 1, 6, 1, 6}}; for (const auto& row : expected_rows) { h_expected.insert(h_expected.end(), row.begin(), row.end()); } auto expected = thrust::device_vector<int>{h_expected};
- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading segmented scan inputs (may be a simple pointer type)
OutputIteratorT – [inferred] Random-access output iterator type for writing segmented scan outputs (may be a simple pointer type)
BeginOffsetIteratorInputT – [inferred] Random-access input iterator type for reading segment beginning offsets in the input data sequence (may be a simple pointer type)
EndOffsetIteratorInputT – [inferred] Random-access input iterator type for reading segment ending offsets in the input data sequence (may be a simple pointer type)
ScanOpT – [inferred] Binary associative scan functor type having member
T operator()(const T &a, const T &b)InitValueT – [inferred] Type of the
init_value
- Parameters:
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr, the required allocation size is written totemp_storage_bytesand no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storageallocationd_in – [in] Random-access iterator to the input sequence of data items
d_out – [out] Random-access iterator to the output sequence of data items
d_in_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments, such thatd_in_begin_offsets[i]is the first element of the ith data segment ind_inand ind_outd_in_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments, such thatd_in_end_offsets[i] - 1is the last element of the ith data segment ind_in. Ifd_in_end_offsets[i] - 1 <= d_in_begin_offsets[i], the ith is considered empty.num_segments – [in] The number of segments that comprise the segmented prefix scan data.
scan_op – [in] Binary associative scan functor
init_value – [in] Initial value to seed the exclusive scan for each segment in the output sequence
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorInputT, typename EndOffsetIteratorInputT, typename BeginOffsetIteratorOutputT, typename ScanOpT, typename InitValueT>
static inline cudaError_t InclusiveSegmentedScanInit( - void *d_temp_storage,
- size_t &temp_storage_bytes,
- InputIteratorT d_in,
- OutputIteratorT d_out,
- BeginOffsetIteratorInputT d_in_begin_offsets,
- EndOffsetIteratorInputT d_in_end_offsets,
- BeginOffsetIteratorOutputT d_out_begin_offsets,
- ::cuda::std::int64_t num_segments,
- ScanOpT scan_op,
- InitValueT init_value,
- cudaStream_t stream = 0,
Computes a device-wide segmented inclusive prefix scan using the specified binary associative
scan_opfunctor. The result of applying thescan_opbinary operator toinit_valuevalue and the first value in each input segment is assigned to the first value of the corresponding output segment.Supports non-commutative scan operators.
Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run.
When
d_inandd_outare equal, the scan is performed in-place. The input and output sequences shall not overlap in any other way.When
d_temp_storageisnullptr, no work is done and the required allocation size is returned intemp_storage_bytes.
- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading segmented scan inputs (may be a simple pointer type)
OutputIteratorT – [inferred] Random-access output iterator type for writing segmented scan outputs (may be a simple pointer type)
BeginOffsetIteratorInputT – [inferred] Random-access input iterator type for reading segment beginning offsets in the input data sequence (may be a simple pointer type)
EndOffsetIteratorInputT – [inferred] Random-access input iterator type for reading segment ending offsets in the input data sequence (may be a simple pointer type)
BeginOffsetIteratorOutputT – [inferred] Random-access input iterator type for reading segment beginning offsets in the output sequence (may be a simple pointer type)
ScanOpT – [inferred] Binary associative scan functor type having member
T operator()(const T &a, const T &b)InitValueT – [inferred] Type of the
init_value
- Parameters:
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr, the required allocation size is written totemp_storage_bytesand no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storageallocationd_in – [in] Random-access iterator to the input sequence of data items
d_out – [out] Random-access iterator to the output sequence of data items
d_in_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments, such thatd_in_begin_offsets[i]is the first element of the ith data segment ind_ind_in_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments, such thatd_in_end_offsets[i] - 1is the last element of the ith data segment ind_in. Ifd_in_end_offsets[i] - 1 <= d_in_begin_offsets[i], the ith is considered empty.d_out_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments, such thatd_out_begin_offsets[i]is the first element of the ith data segment ind_outnum_segments – [in] The number of segments that comprise the segmented prefix scan data.
scan_op – [in] Binary associative scan functor
init_value – [in] Initial value to seed the exclusive scan for each segment in the output sequence
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.