cub::DeviceSegmentedScan#

struct DeviceSegmentedScan#

DeviceSegmentedScan provides device-wide, parallel operations for computing a batched prefix scan across multiple sequences of data items residing within device-accessible memory.

Overview#

Given a sequence of input elements and a binary reduction operator, a prefix scan produces an output sequence where each element is computed to be the reduction of the elements occurring earlier in the input sequence. Prefix sum connotes a prefix scan with the addition operator. The term inclusive indicates that the ith output reduction incorporates the ith input. The term exclusive indicates the ith input is not incorporated into the ith output reduction. When the input and output sequences are the same, the scan is performed in-place.

In order to provide an efficient parallel implementation, the binary reduction operator must be associative. That is, op(op(a, b), c) must be equivalent to op(a, op(b, c)) for any input values a, b, and c.

Usage Considerations#

  • Dynamic parallelism. DeviceSegmentedScan methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.

Public Static Functions

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorInputT, typename EndOffsetIteratorInputT>
static inline cudaError_t ExclusiveSegmentedSum(
void *d_temp_storage,
size_t &temp_storage_bytes,
InputIteratorT d_in,
OutputIteratorT d_out,
BeginOffsetIteratorInputT d_in_begin_offsets,
EndOffsetIteratorInputT d_in_end_offsets,
::cuda::std::int64_t num_segments,
cudaStream_t stream = 0
)#

Computes a device-wide segmented exclusive prefix sum.

  • Results are not deterministic for computation of prefix sum on floating-point types and may vary from run to run.

  • When d_in and d_out are equal, the scan is performed in-place. The input and output sequences shall not overlap in any other way.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.

Preconditions#

  • When d_in and d_out are equal, the segmented scan is performed in-place. The range [d_in, d_in + num_items_in) and [d_out, d_out + num_items_out) shall not overlap in any other way.

  • d_in and d_out must not be null pointers

Snippet#

The code snippet below illustrates the exclusive segmented prefix sum of an int device vector.

#include <cub/cub.cuh>
// or, equivalently
// #include <cub/device/device_segmented_scan.cuh>

// Declare, allocate, and initialize device-accessible pointers for
// input and output
int  num_segments;   // e.g., 3
int  *d_in;          // e.g., [8, 6, 7, 5, 3, -2, 9]
int  *d_offsets;     // e.g., [0, 2, 5, 7]
int  *d_out;         // e.g., [ ,  ,  ,  ,  ,  ,  ]
...

// Determine temporary device storage requirements
void     *d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceScan::ExclusiveSegmentedSum(
  d_temp_storage, temp_storage_bytes,
  d_in, d_out, d_offsets, d_offsets + 1, num_segments);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run exclusive prefix sum
cub::DeviceScan::ExclusiveSegmentedSum(
  d_temp_storage, temp_storage_bytes,
  d_in, d_out, d_offsets, d_offsets + 1, num_segments);

// d_out <-- [0, 8, 0, 7, 12, 0, -2]

Template Parameters:
  • InputIteratorT[inferred] Random-access input iterator type for reading segmented scan inputs (may be a simple pointer type)

  • OutputIteratorT[inferred] Random-access output iterator type for writing segmented scan outputs (may be a simple pointer type)

  • BeginOffsetIteratorInputT[inferred] Random-access input iterator type for reading segment beginning offsets in the input data sequence (may be a simple pointer type)

  • EndOffsetIteratorInputT[inferred] Random-access input iterator type for reading segment ending offsets in the input data sequence (may be a simple pointer type)

Parameters:
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Random-access iterator to the input sequence of data items

  • d_out[out] Random-access iterator to the output sequence of data items

  • d_in_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_in_begin_offsets[i] is the first element of the ith data segment in d_in and in d_out.

  • d_in_end_offsets[in]

    Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_in_end_offsets[i] - 1 is the last element of the ith data segment in d_in. If d_in_end_offsets[i] - 1 <= d_in_begin_offsets[i], the ith is considered empty.

  • num_segments[in] The number of segments that comprise the segmented prefix scan data.

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorInputT, typename EndOffsetIteratorInputT, typename BeginOffsetIteratorOutputT>
static inline cudaError_t ExclusiveSegmentedSum(
void *d_temp_storage,
size_t &temp_storage_bytes,
InputIteratorT d_in,
OutputIteratorT d_out,
BeginOffsetIteratorInputT d_in_begin_offsets,
EndOffsetIteratorInputT d_in_end_offsets,
BeginOffsetIteratorOutputT d_out_begin_offsets,
::cuda::std::int64_t num_segments,
cudaStream_t stream = 0
)#

Computes a device-wide segmented exclusive prefix sum.

  • Results are not deterministic for computation of prefix sum on floating-point types and may vary from run to run.

  • When d_in and d_out are equal, the scan is performed in-place. The input and output sequences shall not overlap in any other way.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.

Snippet#

The code snippet below illustrates the exclusive segmented prefix sum of an int device vector.

Template Parameters:
  • InputIteratorT[inferred] Random-access input iterator type for reading segmented scan inputs (may be a simple pointer type)

  • OutputIteratorT[inferred] Random-access output iterator type for writing segmented scan outputs (may be a simple pointer type)

  • BeginOffsetIteratorInputT[inferred] Random-access input iterator type for reading segment beginning offsets in the input data sequence (may be a simple pointer type)

  • EndOffsetIteratorInputT[inferred] Random-access input iterator type for reading segment ending offsets in the input data sequence (may be a simple pointer type)

  • BeginOffsetIteratorOutputT[inferred] Random-access input iterator type for reading segment beginning offsets in the output sequence (may be a simple pointer type)

Parameters:
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Random-access iterator to the input sequence of data items

  • d_out[out] Random-access iterator to the output sequence of data items

  • d_in_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_in_begin_offsets[i] is the first element of the ith data segment in d_in

  • d_in_end_offsets[in]

    Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_in_end_offsets[i] - 1 is the last element of the ith data segment in d_in. If d_in_end_offsets[i] - 1 <= d_in_begin_offsets[i], the ith is considered empty.

  • d_out_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_out_begin_offsets[i] is the first element of the ith data segment in d_out

  • num_segments[in] The number of segments that comprise the segmented prefix scan data.

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorInputT, typename EndOffsetIteratorInputT, typename ScanOpT, typename InitValueT>
static inline cudaError_t ExclusiveSegmentedScan(
void *d_temp_storage,
size_t &temp_storage_bytes,
InputIteratorT d_in,
OutputIteratorT d_out,
BeginOffsetIteratorInputT d_in_begin_offsets,
EndOffsetIteratorInputT d_in_end_offsets,
::cuda::std::int64_t num_segments,
ScanOpT scan_op,
InitValueT init_value,
cudaStream_t stream = 0
)#

Computes a device-wide segmented exclusive prefix scan using the specified binary associative scan_op functor. The init_value value is applied as the initial value, and is assigned to the first element in each output segment.

  • Supports non-commutative scan operators.

  • Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run.

  • When d_in and d_out are equal, the scan is performed in-place. The input and output sequences shall not overlap in any other way.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.

Snippet#

The code snippet below illustrates the exclusive segmented prefix scan of an int device vector.

Template Parameters:
  • InputIteratorT[inferred] Random-access input iterator type for reading segmented scan inputs (may be a simple pointer type)

  • OutputIteratorT[inferred] Random-access output iterator type for writing segmented scan outputs (may be a simple pointer type)

  • BeginOffsetIteratorInputT[inferred] Random-access input iterator type for reading segment beginning offsets in the input data sequence (may be a simple pointer type)

  • EndOffsetIteratorInputT[inferred] Random-access input iterator type for reading segment ending offsets in the input data sequence (may be a simple pointer type)

  • ScanOpT[inferred] Binary associative scan functor type having member T operator()(const T &a, const T &b)

  • InitValueT[inferred] Type of the init_value

Parameters:
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Random-access iterator to the input sequence of data items

  • d_out[out] Random-access iterator to the output sequence of data items

  • d_in_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_in_begin_offsets[i] is the first element of the ith data segment in d_in and in d_out

  • d_in_end_offsets[in]

    Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_in_end_offsets[i] - 1 is the last element of the ith data segment in d_in. If d_in_end_offsets[i] - 1 <= d_in_begin_offsets[i], the ith is considered empty.

  • num_segments[in] The number of segments that comprise the segmented prefix scan data.

  • scan_op[in] Binary associative scan functor

  • init_value[in] Initial value to seed the exclusive scan for each segment in the output sequence

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorInputT, typename EndOffsetIteratorInputT, typename BeginOffsetIteratorOutputT, typename ScanOpT, typename InitValueT>
static inline cudaError_t ExclusiveSegmentedScan(
void *d_temp_storage,
size_t &temp_storage_bytes,
InputIteratorT d_in,
OutputIteratorT d_out,
BeginOffsetIteratorInputT d_in_begin_offsets,
EndOffsetIteratorInputT d_in_end_offsets,
BeginOffsetIteratorOutputT d_out_begin_offsets,
::cuda::std::int64_t num_segments,
ScanOpT scan_op,
InitValueT init_value,
cudaStream_t stream = 0
)#

Computes a device-wide segmented exclusive prefix scan using the specified binary associative scan_op functor. The init_value value is applied as the initial value, and is assigned to the first element in each output segment.

  • Supports non-commutative scan operators.

  • Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run.

  • When d_in and d_out are equal, the scan is performed in-place. The input and output sequences shall not overlap in any other way.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.

Template Parameters:
  • InputIteratorT[inferred] Random-access input iterator type for reading segmented scan inputs (may be a simple pointer type)

  • OutputIteratorT[inferred] Random-access output iterator type for writing segmented scan outputs (may be a simple pointer type)

  • BeginOffsetIteratorInputT[inferred] Random-access input iterator type for reading segment beginning offsets in the input data sequence (may be a simple pointer type)

  • EndOffsetIteratorInputT[inferred] Random-access input iterator type for reading segment ending offsets in the input data sequence (may be a simple pointer type)

  • BeginOffsetIteratorOutputT[inferred] Random-access input iterator type for reading segment beginning offsets in the output sequence (may be a simple pointer type)

  • ScanOpT[inferred] Binary associative scan functor type having member T operator()(const T &a, const T &b)

  • InitValueT[inferred] Type of the init_value

Parameters:
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Random-access iterator to the input sequence of data items

  • d_out[out] Random-access iterator to the output sequence of data items

  • d_in_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_in_begin_offsets[i] is the first element of the ith data segment in d_in

  • d_in_end_offsets[in]

    Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_in_end_offsets[i] - 1 is the last element of the ith data segment in d_in. If d_in_end_offsets[i] - 1 <= d_in_begin_offsets[i], the ith is considered empty.

  • d_out_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_out_begin_offsets[i] is the first element of the ith data segment in d_out

  • num_segments[in] The number of segments that comprise the segmented prefix scan data.

  • scan_op[in] Binary associative scan functor

  • init_value[in] Initial value to seed the exclusive scan for each segment in the output sequence

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorInputT, typename EndOffsetIteratorInputT>
static inline cudaError_t InclusiveSegmentedSum(
void *d_temp_storage,
size_t &temp_storage_bytes,
InputIteratorT d_in,
OutputIteratorT d_out,
BeginOffsetIteratorInputT d_in_begin_offsets,
EndOffsetIteratorInputT d_in_end_offsets,
::cuda::std::int64_t num_segments,
cudaStream_t stream = 0
)#

Computes a device-wide segmented inclusive prefix sum.

  • Results are not deterministic for computation of prefix sum on floating-point types and may vary from run to run.

  • When d_in and d_out are equal, the scan is performed in-place. The input and output sequences shall not overlap in any other way.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.

Snippet#

The code snippet below illustrates the inclusive segmented prefix sum of an int device vector.

Template Parameters:
  • InputIteratorT[inferred] Random-access input iterator type for reading segmented scan inputs (may be a simple pointer type)

  • OutputIteratorT[inferred] Random-access output iterator type for writing segmented scan outputs (may be a simple pointer type)

  • BeginOffsetIteratorInputT[inferred] Random-access input iterator type for reading segment beginning offsets in the input data sequence (may be a simple pointer type)

  • EndOffsetIteratorInputT[inferred] Random-access input iterator type for reading segment ending offsets in the input data sequence (may be a simple pointer type)

  • ScanOpT[inferred] Binary associative scan functor type having member T operator()(const T &a, const T &b)

Parameters:
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Random-access iterator to the input sequence of data items

  • d_out[out] Random-access iterator to the output sequence of data items

  • d_in_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_in_begin_offsets[i] is the first element of the ith data segment in d_in and in d_out

  • d_in_end_offsets[in]

    Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_in_end_offsets[i] - 1 is the last element of the ith data segment in d_in. If d_in_end_offsets[i] - 1 <= d_in_begin_offsets[i], the ith is considered empty.

  • num_segments[in] The number of segments that comprise the segmented prefix scan data.

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorInputT, typename EndOffsetIteratorInputT, typename BeginOffsetIteratorOutputT>
static inline cudaError_t InclusiveSegmentedSum(
void *d_temp_storage,
size_t &temp_storage_bytes,
InputIteratorT d_in,
OutputIteratorT d_out,
BeginOffsetIteratorInputT d_in_begin_offsets,
EndOffsetIteratorInputT d_in_end_offsets,
BeginOffsetIteratorOutputT d_out_begin_offsets,
::cuda::std::int64_t num_segments,
cudaStream_t stream = 0
)#

Computes a device-wide segmented inclusive prefix sum.

  • Results are not deterministic for computation of prefix sum on floating-point types and may vary from run to run.

  • When d_in and d_out are equal, the scan is performed in-place. The input and output sequences shall not overlap in any other way.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.

Snippet#

The code snippet below illustrates the inclusive segmented prefix sum of an int device vector.

Template Parameters:
  • InputIteratorT[inferred] Random-access input iterator type for reading segmented scan inputs (may be a simple pointer type)

  • OutputIteratorT[inferred] Random-access output iterator type for writing segmented scan outputs (may be a simple pointer type)

  • BeginOffsetIteratorInputT[inferred] Random-access input iterator type for reading segment beginning offsets in the input data sequence (may be a simple pointer type)

  • EndOffsetIteratorInputT[inferred] Random-access input iterator type for reading segment ending offsets in the input data sequence (may be a simple pointer type)

  • BeginOffsetIteratorOutputT[inferred] Random-access input iterator type for reading segment beginning offsets in the output sequence (may be a simple pointer type)

  • ScanOpT[inferred] Binary associative scan functor type having member T operator()(const T &a, const T &b)

Parameters:
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Random-access iterator to the input sequence of data items

  • d_out[out] Random-access iterator to the output sequence of data items

  • d_in_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_in_begin_offsets[i] is the first element of the ith data segment in d_in

  • d_in_end_offsets[in]

    Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_in_end_offsets[i] - 1 is the last element of the ith data segment in d_in. If d_in_end_offsets[i] - 1 <= d_in_begin_offsets[i], the ith is considered empty.

  • d_out_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_out_begin_offsets[i] is the first element of the ith data segment in d_out

  • num_segments[in] The number of segments that comprise the segmented prefix scan data.

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorInputT, typename EndOffsetIteratorInputT, typename ScanOpT>
static inline cudaError_t InclusiveSegmentedScan(
void *d_temp_storage,
size_t &temp_storage_bytes,
InputIteratorT d_in,
OutputIteratorT d_out,
BeginOffsetIteratorInputT d_in_begin_offsets,
EndOffsetIteratorInputT d_in_end_offsets,
::cuda::std::int64_t num_segments,
ScanOpT scan_op,
cudaStream_t stream = 0
)#

Computes a device-wide segmented inclusive prefix scan using the specified binary associative scan_op functor.

  • Supports non-commutative scan operators.

  • Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run.

  • When d_in and d_out are equal, the scan is performed in-place. The input and output sequences shall not overlap in any other way.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.

Template Parameters:
  • InputIteratorT[inferred] Random-access input iterator type for reading segmented scan inputs (may be a simple pointer type)

  • OutputIteratorT[inferred] Random-access output iterator type for writing segmented scan outputs (may be a simple pointer type)

  • BeginOffsetIteratorInputT[inferred] Random-access input iterator type for reading segment beginning offsets in the input data sequence (may be a simple pointer type)

  • EndOffsetIteratorInputT[inferred] Random-access input iterator type for reading segment ending offsets in the input data sequence (may be a simple pointer type)

  • ScanOpT[inferred] Binary associative scan functor type having member T operator()(const T &a, const T &b)

Parameters:
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Random-access iterator to the input sequence of data items

  • d_out[out] Random-access iterator to the output sequence of data items

  • d_in_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_in_begin_offsets[i] is the first element of the ith data segment in d_in and in d_out

  • d_in_end_offsets[in]

    Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_in_end_offsets[i] - 1 is the last element of the ith data segment in d_in. If d_in_end_offsets[i] - 1 <= d_in_begin_offsets[i], the ith is considered empty.

  • num_segments[in] The number of segments that comprise the segmented prefix scan data.

  • scan_op[in] Binary associative scan functor

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorInputT, typename EndOffsetIteratorInputT, typename BeginOffsetIteratorOutputT, typename ScanOpT>
static inline cudaError_t InclusiveSegmentedScan(
void *d_temp_storage,
size_t &temp_storage_bytes,
InputIteratorT d_in,
OutputIteratorT d_out,
BeginOffsetIteratorInputT d_in_begin_offsets,
EndOffsetIteratorInputT d_in_end_offsets,
BeginOffsetIteratorOutputT d_out_begin_offsets,
::cuda::std::int64_t num_segments,
ScanOpT scan_op,
cudaStream_t stream = 0
)#

Computes a device-wide segmented inclusive prefix scan using the specified binary associative scan_op functor.

  • Supports non-commutative scan operators.

  • Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run.

  • When d_in and d_out are equal, the scan is performed in-place. The input and output sequences shall not overlap in any other way.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.

Snippet#

The code snippet below illustrates the exclusive segmented prefix sum of an int device vector.

Template Parameters:
  • InputIteratorT[inferred] Random-access input iterator type for reading segmented scan inputs (may be a simple pointer type)

  • OutputIteratorT[inferred] Random-access output iterator type for writing segmented scan outputs (may be a simple pointer type)

  • BeginOffsetIteratorInputT[inferred] Random-access input iterator type for reading segment beginning offsets in the input data sequence (may be a simple pointer type)

  • EndOffsetIteratorInputT[inferred] Random-access input iterator type for reading segment ending offsets in the input data sequence (may be a simple pointer type)

  • BeginOffsetIteratorOutputT[inferred] Random-access input iterator type for reading segment beginning offsets in the output sequence (may be a simple pointer type)

  • ScanOpT[inferred] Binary associative scan functor type having member T operator()(const T &a, const T &b)

Parameters:
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Random-access iterator to the input sequence of data items

  • d_out[out] Random-access iterator to the output sequence of data items

  • d_in_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_in_begin_offsets[i] is the first element of the ith data segment in d_in

  • d_in_end_offsets[in]

    Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_in_end_offsets[i] - 1 is the last element of the ith data segment in d_in. If d_in_end_offsets[i] - 1 <= d_in_begin_offsets[i], the ith is considered empty.

  • d_out_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_out_begin_offsets[i] is the first element of the ith data segment in d_out

  • num_segments[in] The number of segments that comprise the segmented prefix scan data.

  • scan_op[in] Binary associative scan functor

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorInputT, typename EndOffsetIteratorInputT, typename ScanOpT, typename InitValueT>
static inline cudaError_t InclusiveSegmentedScanInit(
void *d_temp_storage,
size_t &temp_storage_bytes,
InputIteratorT d_in,
OutputIteratorT d_out,
BeginOffsetIteratorInputT d_in_begin_offsets,
EndOffsetIteratorInputT d_in_end_offsets,
::cuda::std::int64_t num_segments,
ScanOpT scan_op,
InitValueT init_value,
cudaStream_t stream = 0
)#

Computes a device-wide segmented inclusive prefix scan using the specified binary associative scan_op functor. The result of applying the scan_op binary operator to init_value value and the first value in each input segment is assigned to the first value of the corresponding output segment.

  • Supports non-commutative scan operators.

  • Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run.

  • When d_in and d_out are equal, the scan is performed in-place. The input and output sequences shall not overlap in any other way.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.

Snippet#

The code snippet below illustrates the exclusive segmented prefix scan of an int device vector.

Template Parameters:
  • InputIteratorT[inferred] Random-access input iterator type for reading segmented scan inputs (may be a simple pointer type)

  • OutputIteratorT[inferred] Random-access output iterator type for writing segmented scan outputs (may be a simple pointer type)

  • BeginOffsetIteratorInputT[inferred] Random-access input iterator type for reading segment beginning offsets in the input data sequence (may be a simple pointer type)

  • EndOffsetIteratorInputT[inferred] Random-access input iterator type for reading segment ending offsets in the input data sequence (may be a simple pointer type)

  • ScanOpT[inferred] Binary associative scan functor type having member T operator()(const T &a, const T &b)

  • InitValueT[inferred] Type of the init_value

Parameters:
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Random-access iterator to the input sequence of data items

  • d_out[out] Random-access iterator to the output sequence of data items

  • d_in_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_in_begin_offsets[i] is the first element of the ith data segment in d_in and in d_out

  • d_in_end_offsets[in]

    Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_in_end_offsets[i] - 1 is the last element of the ith data segment in d_in. If d_in_end_offsets[i] - 1 <= d_in_begin_offsets[i], the ith is considered empty.

  • num_segments[in] The number of segments that comprise the segmented prefix scan data.

  • scan_op[in] Binary associative scan functor

  • init_value[in] Initial value to seed the exclusive scan for each segment in the output sequence

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorInputT, typename EndOffsetIteratorInputT, typename BeginOffsetIteratorOutputT, typename ScanOpT, typename InitValueT>
static inline cudaError_t InclusiveSegmentedScanInit(
void *d_temp_storage,
size_t &temp_storage_bytes,
InputIteratorT d_in,
OutputIteratorT d_out,
BeginOffsetIteratorInputT d_in_begin_offsets,
EndOffsetIteratorInputT d_in_end_offsets,
BeginOffsetIteratorOutputT d_out_begin_offsets,
::cuda::std::int64_t num_segments,
ScanOpT scan_op,
InitValueT init_value,
cudaStream_t stream = 0
)#

Computes a device-wide segmented inclusive prefix scan using the specified binary associative scan_op functor. The result of applying the scan_op binary operator to init_value value and the first value in each input segment is assigned to the first value of the corresponding output segment.

  • Supports non-commutative scan operators.

  • Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run.

  • When d_in and d_out are equal, the scan is performed in-place. The input and output sequences shall not overlap in any other way.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.

Template Parameters:
  • InputIteratorT[inferred] Random-access input iterator type for reading segmented scan inputs (may be a simple pointer type)

  • OutputIteratorT[inferred] Random-access output iterator type for writing segmented scan outputs (may be a simple pointer type)

  • BeginOffsetIteratorInputT[inferred] Random-access input iterator type for reading segment beginning offsets in the input data sequence (may be a simple pointer type)

  • EndOffsetIteratorInputT[inferred] Random-access input iterator type for reading segment ending offsets in the input data sequence (may be a simple pointer type)

  • BeginOffsetIteratorOutputT[inferred] Random-access input iterator type for reading segment beginning offsets in the output sequence (may be a simple pointer type)

  • ScanOpT[inferred] Binary associative scan functor type having member T operator()(const T &a, const T &b)

  • InitValueT[inferred] Type of the init_value

Parameters:
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Random-access iterator to the input sequence of data items

  • d_out[out] Random-access iterator to the output sequence of data items

  • d_in_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_in_begin_offsets[i] is the first element of the ith data segment in d_in

  • d_in_end_offsets[in]

    Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_in_end_offsets[i] - 1 is the last element of the ith data segment in d_in. If d_in_end_offsets[i] - 1 <= d_in_begin_offsets[i], the ith is considered empty.

  • d_out_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_out_begin_offsets[i] is the first element of the ith data segment in d_out

  • num_segments[in] The number of segments that comprise the segmented prefix scan data.

  • scan_op[in] Binary associative scan functor

  • init_value[in] Initial value to seed the exclusive scan for each segment in the output sequence

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.