cub::DeviceSegmentedScan#

struct DeviceSegmentedScan#

DeviceSegmentedScan provides device-wide, parallel operations for computing a batched prefix scan across multiple sequences of data items residing within device-accessible memory.

Overview#

Given a sequence of input elements and a binary reduction operator, a prefix scan produces an output sequence where each element is computed to be the reduction of the elements occurring earlier in the input sequence. Prefix sum connotes a prefix scan with the addition operator. The term inclusive indicates that the ith output reduction incorporates the ith input. The term exclusive indicates the ith input is not incorporated into the ith output reduction. When the input and output sequences are the same, the scan is performed in-place.

In order to provide an efficient parallel implementation, the binary reduction operator must be associative. That is, op(op(a, b), c) must be equivalent to op(a, op(b, c)) for any input values a, b, and c.

Usage Considerations#

  • Dynamic parallelism. DeviceSegmentedScan methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.

Public Static Functions

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorInputT, typename EndOffsetIteratorInputT>
static inline cudaError_t ExclusiveSegmentedSum(
void *d_temp_storage,
size_t &temp_storage_bytes,
InputIteratorT d_in,
OutputIteratorT d_out,
BeginOffsetIteratorInputT d_in_begin_offsets,
EndOffsetIteratorInputT d_in_end_offsets,
::cuda::std::int64_t num_segments,
cudaStream_t stream = 0,
)#

Computes a device-wide segmented exclusive prefix sum.

  • Results are not deterministic for computation of prefix sum on floating-point types and may vary from run to run.

  • When d_in and d_out are equal, the scan is performed in-place. The input and output sequences shall not overlap in any other way.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Preconditions#

  • When d_in and d_out are equal, the segmented scan is performed in-place. The range [d_in, d_in + num_items_in) and [d_out, d_out + num_items_out) shall not overlap in any other way.

  • d_in and d_out must not be null pointers

Snippet#

The code snippet below illustrates the exclusive segmented prefix sum of an int device vector.

#include <cub/cub.cuh>
// or, equivalently
// #include <cub/device/device_segmented_scan.cuh>

// Declare, allocate, and initialize device-accessible pointers for
// input and output
int  num_segments;   // e.g., 3
int  *d_in;          // e.g., [8, 6, 7, 5, 3, -2, 9]
int  *d_offsets;     // e.g., [0, 2, 5, 7]
int  *d_out;         // e.g., [ ,  ,  ,  ,  ,  ,  ]
...

// Determine temporary device storage requirements
void     *d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceScan::ExclusiveSegmentedSum(
  d_temp_storage, temp_storage_bytes,
  d_in, d_out, d_offsets, d_offsets + 1, num_segments);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run exclusive prefix sum
cub::DeviceScan::ExclusiveSegmentedSum(
  d_temp_storage, temp_storage_bytes,
  d_in, d_out, d_offsets, d_offsets + 1, num_segments);

// d_out <-- [0, 8, 0, 7, 12, 0, -2]

Template Parameters:
  • InputIteratorT[inferred] Random-access input iterator type for reading segmented scan inputs (may be a simple pointer type)

  • OutputIteratorT[inferred] Random-access output iterator type for writing segmented scan outputs (may be a simple pointer type)

  • BeginOffsetIteratorInputT[inferred] Random-access input iterator type for reading segment beginning offsets in the input data sequence (may be a simple pointer type)

  • EndOffsetIteratorInputT[inferred] Random-access input iterator type for reading segment ending offsets in the input data sequence (may be a simple pointer type)

Parameters:
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Random-access iterator to the input sequence of data items

  • d_out[out] Random-access iterator to the output sequence of data items

  • d_in_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_in_begin_offsets[i] is the first element of the ith data segment in d_in and in d_out.

  • d_in_end_offsets[in]

    Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_in_end_offsets[i] - 1 is the last element of the ith data segment in d_in. If d_in_end_offsets[i] - 1 <= d_in_begin_offsets[i], the ith is considered empty.

  • num_segments[in] The number of segments that comprise the segmented prefix scan data.

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorInputT, typename EndOffsetIteratorInputT, typename BeginOffsetIteratorOutputT>
static inline cudaError_t ExclusiveSegmentedSum(
void *d_temp_storage,
size_t &temp_storage_bytes,
InputIteratorT d_in,
OutputIteratorT d_out,
BeginOffsetIteratorInputT d_in_begin_offsets,
EndOffsetIteratorInputT d_in_end_offsets,
BeginOffsetIteratorOutputT d_out_begin_offsets,
::cuda::std::int64_t num_segments,
cudaStream_t stream = 0,
)#

Computes a device-wide segmented exclusive prefix sum.

  • Results are not deterministic for computation of prefix sum on floating-point types and may vary from run to run.

  • When d_in and d_out are equal, the scan is performed in-place. The input and output sequences shall not overlap in any other way.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet#

The code snippet below illustrates the exclusive segmented prefix sum of an int device vector.

// Sequence of 16 values, representing 4x4 matrix in row-major layout
auto input = thrust::device_vector<int>{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16};

// offsets to starts of each of 4 rows
size_t row_size       = 4;
auto in_begin_offsets = thrust::device_vector<size_t>{0, row_size, 2 * row_size};
auto num_segments     = in_begin_offsets.size();
// Perform row-wise sum for 3-by-3 principal sub-matrix
size_t segment_size = 3;

auto in_end_offsets = thrust::device_vector<size_t>{
  0 * row_size + segment_size, 1 * row_size + segment_size, 2 * row_size + segment_size};

auto output            = thrust::device_vector<int>(num_segments * segment_size, thrust::no_init);
auto out_begin_offsets = thrust::device_vector<size_t>{0, segment_size, 2 * segment_size};

void* temp_storage        = nullptr;
size_t temp_storage_bytes = 0;

auto d_in_beg_offsets  = in_begin_offsets.begin();
auto d_in_end_offsets  = in_end_offsets.begin();
auto d_out_beg_offsets = out_begin_offsets.begin();

auto d_in  = input.begin();
auto d_out = output.begin();

// get size of required storage and allocate
auto status = cub::DeviceSegmentedScan::ExclusiveSegmentedSum(
  temp_storage, temp_storage_bytes, d_in, d_out, d_in_beg_offsets, d_in_end_offsets, d_out_beg_offsets, num_segments);
check_execution_status(status, algo_name);

status = cudaMalloc(&temp_storage, temp_storage_bytes);
check_execution_status(status, "cudaMalloc");

// run the algorithm
status = cub::DeviceSegmentedScan::ExclusiveSegmentedSum(
  temp_storage, temp_storage_bytes, d_in, d_out, d_in_beg_offsets, d_in_end_offsets, d_out_beg_offsets, num_segments);
check_execution_status(status, algo_name);

thrust::device_vector<int> expected{0, 1, 3, 0, 5, 11, 0, 9, 19};

Template Parameters:
  • InputIteratorT[inferred] Random-access input iterator type for reading segmented scan inputs (may be a simple pointer type)

  • OutputIteratorT[inferred] Random-access output iterator type for writing segmented scan outputs (may be a simple pointer type)

  • BeginOffsetIteratorInputT[inferred] Random-access input iterator type for reading segment beginning offsets in the input data sequence (may be a simple pointer type)

  • EndOffsetIteratorInputT[inferred] Random-access input iterator type for reading segment ending offsets in the input data sequence (may be a simple pointer type)

  • BeginOffsetIteratorOutputT[inferred] Random-access input iterator type for reading segment beginning offsets in the output sequence (may be a simple pointer type)

Parameters:
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Random-access iterator to the input sequence of data items

  • d_out[out] Random-access iterator to the output sequence of data items

  • d_in_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_in_begin_offsets[i] is the first element of the ith data segment in d_in

  • d_in_end_offsets[in]

    Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_in_end_offsets[i] - 1 is the last element of the ith data segment in d_in. If d_in_end_offsets[i] - 1 <= d_in_begin_offsets[i], the ith is considered empty.

  • d_out_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_out_begin_offsets[i] is the first element of the ith data segment in d_out

  • num_segments[in] The number of segments that comprise the segmented prefix scan data.

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorInputT, typename EndOffsetIteratorInputT, typename ScanOpT, typename InitValueT>
static inline cudaError_t ExclusiveSegmentedScan(
void *d_temp_storage,
size_t &temp_storage_bytes,
InputIteratorT d_in,
OutputIteratorT d_out,
BeginOffsetIteratorInputT d_in_begin_offsets,
EndOffsetIteratorInputT d_in_end_offsets,
::cuda::std::int64_t num_segments,
ScanOpT scan_op,
InitValueT init_value,
cudaStream_t stream = 0,
)#

Computes a device-wide segmented exclusive prefix scan using the specified binary associative scan_op functor. The init_value value is applied as the initial value, and is assigned to the first element in each output segment.

  • Supports non-commutative scan operators.

  • Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run.

  • When d_in and d_out are equal, the scan is performed in-place. The input and output sequences shall not overlap in any other way.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet#

The code snippet below illustrates the exclusive segmented prefix scan of an int device vector.

/* Compute exclusive scan using addition of GF(2) field represented
 * over boolean values stored as bits in unsigned integer, where addition is bitwise XOR.
 * Each unsigned integer represents 32-long tuple of GF(2) values
 */
auto scan_op        = [] __host__ __device__(unsigned v1, unsigned v2) -> unsigned { return v1 ^ v2; };
unsigned init_value = 0u;

// 128 input elements
// auto input = thrust::device_vector<unsigned>{0x64b40b1b, 0x7bf23c0c, 0xaa982e07, ... };

// 4 segments
auto offsets = thrust::device_vector<unsigned>{0, 40, 77, 101, 128};
auto output  = thrust::device_vector<unsigned>(input.size(), thrust::no_init);

void* temp_storage        = nullptr;
size_t temp_storage_bytes = 0;

auto d_in           = input.begin();
auto d_out          = output.begin();
auto begin_offsets  = offsets.begin();
auto end_offsets    = offsets.begin() + 1;
size_t num_segments = offsets.size() - 1;

// inquire size of needed temporary storage and allocate
auto status = cub::DeviceSegmentedScan::ExclusiveSegmentedScan(
  temp_storage, temp_storage_bytes, d_in, d_out, begin_offsets, end_offsets, num_segments, scan_op, init_value);
check_execution_status(status, algo_name);

status = cudaMalloc(&temp_storage, temp_storage_bytes);
check_execution_status(status, "cudaMalloc");

// run the algorithm
status = cub::DeviceSegmentedScan::ExclusiveSegmentedScan(
  temp_storage, temp_storage_bytes, d_in, d_out, begin_offsets, end_offsets, num_segments, scan_op, init_value);
check_execution_status(status, algo_name);

Template Parameters:
  • InputIteratorT[inferred] Random-access input iterator type for reading segmented scan inputs (may be a simple pointer type)

  • OutputIteratorT[inferred] Random-access output iterator type for writing segmented scan outputs (may be a simple pointer type)

  • BeginOffsetIteratorInputT[inferred] Random-access input iterator type for reading segment beginning offsets in the input data sequence (may be a simple pointer type)

  • EndOffsetIteratorInputT[inferred] Random-access input iterator type for reading segment ending offsets in the input data sequence (may be a simple pointer type)

  • ScanOpT[inferred] Binary associative scan functor type having member T operator()(const T &a, const T &b)

  • InitValueT[inferred] Type of the init_value

Parameters:
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Random-access iterator to the input sequence of data items

  • d_out[out] Random-access iterator to the output sequence of data items

  • d_in_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_in_begin_offsets[i] is the first element of the ith data segment in d_in and in d_out

  • d_in_end_offsets[in]

    Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_in_end_offsets[i] - 1 is the last element of the ith data segment in d_in. If d_in_end_offsets[i] - 1 <= d_in_begin_offsets[i], the ith is considered empty.

  • num_segments[in] The number of segments that comprise the segmented prefix scan data.

  • scan_op[in] Binary associative scan functor

  • init_value[in] Initial value to seed the exclusive scan for each segment in the output sequence

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorInputT, typename EndOffsetIteratorInputT, typename BeginOffsetIteratorOutputT, typename ScanOpT, typename InitValueT>
static inline cudaError_t ExclusiveSegmentedScan(
void *d_temp_storage,
size_t &temp_storage_bytes,
InputIteratorT d_in,
OutputIteratorT d_out,
BeginOffsetIteratorInputT d_in_begin_offsets,
EndOffsetIteratorInputT d_in_end_offsets,
BeginOffsetIteratorOutputT d_out_begin_offsets,
::cuda::std::int64_t num_segments,
ScanOpT scan_op,
InitValueT init_value,
cudaStream_t stream = 0,
)#

Computes a device-wide segmented exclusive prefix scan using the specified binary associative scan_op functor. The init_value value is applied as the initial value, and is assigned to the first element in each output segment.

  • Supports non-commutative scan operators.

  • Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run.

  • When d_in and d_out are equal, the scan is performed in-place. The input and output sequences shall not overlap in any other way.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Template Parameters:
  • InputIteratorT[inferred] Random-access input iterator type for reading segmented scan inputs (may be a simple pointer type)

  • OutputIteratorT[inferred] Random-access output iterator type for writing segmented scan outputs (may be a simple pointer type)

  • BeginOffsetIteratorInputT[inferred] Random-access input iterator type for reading segment beginning offsets in the input data sequence (may be a simple pointer type)

  • EndOffsetIteratorInputT[inferred] Random-access input iterator type for reading segment ending offsets in the input data sequence (may be a simple pointer type)

  • BeginOffsetIteratorOutputT[inferred] Random-access input iterator type for reading segment beginning offsets in the output sequence (may be a simple pointer type)

  • ScanOpT[inferred] Binary associative scan functor type having member T operator()(const T &a, const T &b)

  • InitValueT[inferred] Type of the init_value

Parameters:
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Random-access iterator to the input sequence of data items

  • d_out[out] Random-access iterator to the output sequence of data items

  • d_in_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_in_begin_offsets[i] is the first element of the ith data segment in d_in

  • d_in_end_offsets[in]

    Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_in_end_offsets[i] - 1 is the last element of the ith data segment in d_in. If d_in_end_offsets[i] - 1 <= d_in_begin_offsets[i], the ith is considered empty.

  • d_out_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_out_begin_offsets[i] is the first element of the ith data segment in d_out

  • num_segments[in] The number of segments that comprise the segmented prefix scan data.

  • scan_op[in] Binary associative scan functor

  • init_value[in] Initial value to seed the exclusive scan for each segment in the output sequence

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorInputT, typename EndOffsetIteratorInputT>
static inline cudaError_t InclusiveSegmentedSum(
void *d_temp_storage,
size_t &temp_storage_bytes,
InputIteratorT d_in,
OutputIteratorT d_out,
BeginOffsetIteratorInputT d_in_begin_offsets,
EndOffsetIteratorInputT d_in_end_offsets,
::cuda::std::int64_t num_segments,
cudaStream_t stream = 0,
)#

Computes a device-wide segmented inclusive prefix sum.

  • Results are not deterministic for computation of prefix sum on floating-point types and may vary from run to run.

  • When d_in and d_out are equal, the scan is performed in-place. The input and output sequences shall not overlap in any other way.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet#

The code snippet below illustrates the inclusive segmented prefix sum of an int device vector.

auto input   = thrust::device_vector<int>{2, 1, 1, 2, 1, 2, 1, 1};
auto offsets = thrust::device_vector<size_t>{0, 3, 5, 8};

void* temp_storage        = nullptr;
size_t temp_storage_bytes = 0;

auto begin_offsets = offsets.begin();
auto end_offsets   = begin_offsets + 1;
auto num_segments  = offsets.size() - 1;

auto d_in = input.begin();

// get size of requires storage and allocate
auto status = cub::DeviceSegmentedScan::InclusiveSegmentedSum(
  temp_storage, temp_storage_bytes, d_in, d_in, begin_offsets, end_offsets, num_segments);
check_execution_status(status, algo_name);

status = cudaMalloc(&temp_storage, temp_storage_bytes);
check_execution_status(status, "cudaMalloc");

// execute the algorithm
status = cub::DeviceSegmentedScan::InclusiveSegmentedSum(
  temp_storage, temp_storage_bytes, d_in, d_in, begin_offsets, end_offsets, num_segments);
check_execution_status(status, algo_name);

thrust::device_vector<int> expected{2, 3, 4, 2, 3, 2, 3, 4};

Template Parameters:
  • InputIteratorT[inferred] Random-access input iterator type for reading segmented scan inputs (may be a simple pointer type)

  • OutputIteratorT[inferred] Random-access output iterator type for writing segmented scan outputs (may be a simple pointer type)

  • BeginOffsetIteratorInputT[inferred] Random-access input iterator type for reading segment beginning offsets in the input data sequence (may be a simple pointer type)

  • EndOffsetIteratorInputT[inferred] Random-access input iterator type for reading segment ending offsets in the input data sequence (may be a simple pointer type)

  • ScanOpT[inferred] Binary associative scan functor type having member T operator()(const T &a, const T &b)

Parameters:
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Random-access iterator to the input sequence of data items

  • d_out[out] Random-access iterator to the output sequence of data items

  • d_in_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_in_begin_offsets[i] is the first element of the ith data segment in d_in and in d_out

  • d_in_end_offsets[in]

    Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_in_end_offsets[i] - 1 is the last element of the ith data segment in d_in. If d_in_end_offsets[i] - 1 <= d_in_begin_offsets[i], the ith is considered empty.

  • num_segments[in] The number of segments that comprise the segmented prefix scan data.

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorInputT, typename EndOffsetIteratorInputT, typename BeginOffsetIteratorOutputT>
static inline cudaError_t InclusiveSegmentedSum(
void *d_temp_storage,
size_t &temp_storage_bytes,
InputIteratorT d_in,
OutputIteratorT d_out,
BeginOffsetIteratorInputT d_in_begin_offsets,
EndOffsetIteratorInputT d_in_end_offsets,
BeginOffsetIteratorOutputT d_out_begin_offsets,
::cuda::std::int64_t num_segments,
cudaStream_t stream = 0,
)#

Computes a device-wide segmented inclusive prefix sum.

  • Results are not deterministic for computation of prefix sum on floating-point types and may vary from run to run.

  • When d_in and d_out are equal, the scan is performed in-place. The input and output sequences shall not overlap in any other way.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet#

The code snippet below illustrates the inclusive segmented prefix sum of an int device vector.

// Sequence of 16 values, representing 4x4 matrix in row-major layout
auto input = thrust::device_vector<int>{1, 1, 1, 1, -1, -1, -1, -1, 2, 2, 2, 2, -2, -2, -2, -2};

// begin offsets for each of 4 rows
size_t m          = 4;
auto row_offsets  = thrust::device_vector<size_t>{0, m, 2 * m, 3 * m, 4 * m};
auto num_segments = row_offsets.size() - 1;

// Allocate m rows of m + 1 filled with zero-initialized values
auto output = thrust::device_vector<int>((m + 1) * m, 0);
// begin offsets to second element of each row
size_t lda             = m + 1;
auto out_begin_offsets = thrust::device_vector<size_t>{1, lda + 1, 2 * lda + 1, 3 * lda + 1};

void* temp_storage        = nullptr;
size_t temp_storage_bytes = 0;

auto d_in_beg_offsets  = row_offsets.begin();
auto d_in_end_offsets  = row_offsets.begin() + 1;
auto d_out_beg_offsets = out_begin_offsets.begin();

auto d_in  = input.begin();
auto d_out = output.begin();

// get size of temporary storage and allocate
auto status = cub::DeviceSegmentedScan::InclusiveSegmentedSum(
  temp_storage, temp_storage_bytes, d_in, d_out, d_in_beg_offsets, d_in_end_offsets, d_out_beg_offsets, num_segments);
check_execution_status(status, algo_name);

status = cudaMalloc(&temp_storage, temp_storage_bytes);
check_execution_status(status, algo_name);

// Compute inclusive sum for each row prepended with 0
status = cub::DeviceSegmentedScan::InclusiveSegmentedSum(
  temp_storage, temp_storage_bytes, d_in, d_out, d_in_beg_offsets, d_in_end_offsets, d_out_beg_offsets, num_segments);
check_execution_status(status, algo_name);

std::vector<int> h_expected{};
h_expected.reserve(output.size());
std::vector<std::vector<int>> expected_rows{
  {0, 1, 2, 3, 4}, {0, -1, -2, -3, -4}, {0, 2, 4, 6, 8}, {0, -2, -4, -6, -8}};
for (const auto& row : expected_rows)
{
  h_expected.insert(h_expected.end(), row.begin(), row.end());
}

auto expected = thrust::device_vector<int>{h_expected};

Template Parameters:
  • InputIteratorT[inferred] Random-access input iterator type for reading segmented scan inputs (may be a simple pointer type)

  • OutputIteratorT[inferred] Random-access output iterator type for writing segmented scan outputs (may be a simple pointer type)

  • BeginOffsetIteratorInputT[inferred] Random-access input iterator type for reading segment beginning offsets in the input data sequence (may be a simple pointer type)

  • EndOffsetIteratorInputT[inferred] Random-access input iterator type for reading segment ending offsets in the input data sequence (may be a simple pointer type)

  • BeginOffsetIteratorOutputT[inferred] Random-access input iterator type for reading segment beginning offsets in the output sequence (may be a simple pointer type)

  • ScanOpT[inferred] Binary associative scan functor type having member T operator()(const T &a, const T &b)

Parameters:
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Random-access iterator to the input sequence of data items

  • d_out[out] Random-access iterator to the output sequence of data items

  • d_in_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_in_begin_offsets[i] is the first element of the ith data segment in d_in

  • d_in_end_offsets[in]

    Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_in_end_offsets[i] - 1 is the last element of the ith data segment in d_in. If d_in_end_offsets[i] - 1 <= d_in_begin_offsets[i], the ith is considered empty.

  • d_out_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_out_begin_offsets[i] is the first element of the ith data segment in d_out

  • num_segments[in] The number of segments that comprise the segmented prefix scan data.

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorInputT, typename EndOffsetIteratorInputT, typename ScanOpT>
static inline cudaError_t InclusiveSegmentedScan(
void *d_temp_storage,
size_t &temp_storage_bytes,
InputIteratorT d_in,
OutputIteratorT d_out,
BeginOffsetIteratorInputT d_in_begin_offsets,
EndOffsetIteratorInputT d_in_end_offsets,
::cuda::std::int64_t num_segments,
ScanOpT scan_op,
cudaStream_t stream = 0,
)#

Computes a device-wide segmented inclusive prefix scan using the specified binary associative scan_op functor.

  • Supports non-commutative scan operators.

  • Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run.

  • When d_in and d_out are equal, the scan is performed in-place. The input and output sequences shall not overlap in any other way.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Template Parameters:
  • InputIteratorT[inferred] Random-access input iterator type for reading segmented scan inputs (may be a simple pointer type)

  • OutputIteratorT[inferred] Random-access output iterator type for writing segmented scan outputs (may be a simple pointer type)

  • BeginOffsetIteratorInputT[inferred] Random-access input iterator type for reading segment beginning offsets in the input data sequence (may be a simple pointer type)

  • EndOffsetIteratorInputT[inferred] Random-access input iterator type for reading segment ending offsets in the input data sequence (may be a simple pointer type)

  • ScanOpT[inferred] Binary associative scan functor type having member T operator()(const T &a, const T &b)

Parameters:
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Random-access iterator to the input sequence of data items

  • d_out[out] Random-access iterator to the output sequence of data items

  • d_in_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_in_begin_offsets[i] is the first element of the ith data segment in d_in and in d_out

  • d_in_end_offsets[in]

    Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_in_end_offsets[i] - 1 is the last element of the ith data segment in d_in. If d_in_end_offsets[i] - 1 <= d_in_begin_offsets[i], the ith is considered empty.

  • num_segments[in] The number of segments that comprise the segmented prefix scan data.

  • scan_op[in] Binary associative scan functor

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorInputT, typename EndOffsetIteratorInputT, typename BeginOffsetIteratorOutputT, typename ScanOpT>
static inline cudaError_t InclusiveSegmentedScan(
void *d_temp_storage,
size_t &temp_storage_bytes,
InputIteratorT d_in,
OutputIteratorT d_out,
BeginOffsetIteratorInputT d_in_begin_offsets,
EndOffsetIteratorInputT d_in_end_offsets,
BeginOffsetIteratorOutputT d_out_begin_offsets,
::cuda::std::int64_t num_segments,
ScanOpT scan_op,
cudaStream_t stream = 0,
)#

Computes a device-wide segmented inclusive prefix scan using the specified binary associative scan_op functor.

  • Supports non-commutative scan operators.

  • Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run.

  • When d_in and d_out are equal, the scan is performed in-place. The input and output sequences shall not overlap in any other way.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet#

The code snippet below illustrates the exclusive segmented prefix sum of an int device vector.

size_t n = 8;
thrust::device_vector<float> input{0.21f, 0.33f, 0.17f, 0.56f, 0.31f, 0.25f, 1.0f, 0.72f};

constexpr unsigned _zero{0};
auto _n               = static_cast<unsigned>(n);
auto counting_it      = cuda::counting_iterator(_zero);
auto in_begin_offsets = counting_it;
auto in_end_offsets   = cuda::constant_iterator(_n);

// use stride n + 1 is the distance between consecutive diagonal elements in C-contiguous layout
auto out_begin_offsets = cuda::strided_iterator(counting_it, _n + 1);

// allocate and zero-initialize output matrix in C-contiguous layout
auto output = thrust::device_vector<float>(n * n, 0.0f);

auto d_in  = input.begin();
auto d_out = output.begin();

auto scan_op = [] __host__ __device__(float v1, float v2) noexcept -> float { return cuda::maximum<>{}(v1, v2); };

void* temp_storage = nullptr;
size_t temp_storage_bytes;

// determine size of required temporary storage and allocate
auto status = cub::DeviceSegmentedScan::InclusiveSegmentedScan(
  temp_storage, temp_storage_bytes, d_in, d_out, in_begin_offsets, in_end_offsets, out_begin_offsets, n, scan_op);
check_execution_status(status, algo_name);

status = cudaMalloc(&temp_storage, temp_storage_bytes);
check_execution_status(status, "cudaMalloc");

// run the algorithm
status = cub::DeviceSegmentedScan::InclusiveSegmentedScan(
  temp_storage, temp_storage_bytes, d_in, d_out, in_begin_offsets, in_end_offsets, out_begin_offsets, n, scan_op);
check_execution_status(status, algo_name);

thrust::device_vector<float> expected{
  0.21f, 0.33f, 0.33f, 0.56f, 0.56f, 0.56f, 1.00f, 1.00f, // row 0
  0.00f, 0.33f, 0.33f, 0.56f, 0.56f, 0.56f, 1.00f, 1.00f, // row 1
  0.00f, 0.00f, 0.17f, 0.56f, 0.56f, 0.56f, 1.00f, 1.00f, // row 2
  0.00f, 0.00f, 0.00f, 0.56f, 0.56f, 0.56f, 1.00f, 1.00f, // row 3
  0.00f, 0.00f, 0.00f, 0.00f, 0.31f, 0.31f, 1.00f, 1.00f, // row 4
  0.00f, 0.00f, 0.00f, 0.00f, 0.00f, 0.25f, 1.00f, 1.00f, // row 5
  0.00f, 0.00f, 0.00f, 0.00f, 0.00f, 0.00f, 1.00f, 1.00f, // row 6
  0.00f, 0.00f, 0.00f, 0.00f, 0.00f, 0.00f, 0.00f, 0.72f // row 7
};

Template Parameters:
  • InputIteratorT[inferred] Random-access input iterator type for reading segmented scan inputs (may be a simple pointer type)

  • OutputIteratorT[inferred] Random-access output iterator type for writing segmented scan outputs (may be a simple pointer type)

  • BeginOffsetIteratorInputT[inferred] Random-access input iterator type for reading segment beginning offsets in the input data sequence (may be a simple pointer type)

  • EndOffsetIteratorInputT[inferred] Random-access input iterator type for reading segment ending offsets in the input data sequence (may be a simple pointer type)

  • BeginOffsetIteratorOutputT[inferred] Random-access input iterator type for reading segment beginning offsets in the output sequence (may be a simple pointer type)

  • ScanOpT[inferred] Binary associative scan functor type having member T operator()(const T &a, const T &b)

Parameters:
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Random-access iterator to the input sequence of data items

  • d_out[out] Random-access iterator to the output sequence of data items

  • d_in_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_in_begin_offsets[i] is the first element of the ith data segment in d_in

  • d_in_end_offsets[in]

    Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_in_end_offsets[i] - 1 is the last element of the ith data segment in d_in. If d_in_end_offsets[i] - 1 <= d_in_begin_offsets[i], the ith is considered empty.

  • d_out_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_out_begin_offsets[i] is the first element of the ith data segment in d_out

  • num_segments[in] The number of segments that comprise the segmented prefix scan data.

  • scan_op[in] Binary associative scan functor

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorInputT, typename EndOffsetIteratorInputT, typename ScanOpT, typename InitValueT>
static inline cudaError_t InclusiveSegmentedScanInit(
void *d_temp_storage,
size_t &temp_storage_bytes,
InputIteratorT d_in,
OutputIteratorT d_out,
BeginOffsetIteratorInputT d_in_begin_offsets,
EndOffsetIteratorInputT d_in_end_offsets,
::cuda::std::int64_t num_segments,
ScanOpT scan_op,
InitValueT init_value,
cudaStream_t stream = 0,
)#

Computes a device-wide segmented inclusive prefix scan using the specified binary associative scan_op functor. The result of applying the scan_op binary operator to init_value value and the first value in each input segment is assigned to the first value of the corresponding output segment.

  • Supports non-commutative scan operators.

  • Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run.

  • When d_in and d_out are equal, the scan is performed in-place. The input and output sequences shall not overlap in any other way.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet#

The code snippet below illustrates the exclusive segmented prefix scan of an int device vector.

int prime  = 7;
auto input = thrust::device_vector<int>{
  2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6};

auto row_size    = static_cast<size_t>(prime);
auto row_offsets = thrust::device_vector<size_t>{0, row_size, 2 * row_size, 3 * row_size, 4 * row_size, 5 * row_size};
size_t num_segments = row_offsets.size() - 1;

thrust::device_vector<int> output(input.size(), thrust::no_init);

auto m_p = cuda::fast_mod_div(prime);
// Binary operator to multiply arguments using modular arithmetic
auto scan_op = [=] __host__ __device__(int v1, int v2) -> int {
  const auto proj_v1 = (v1 % m_p);
  const auto proj_v2 = (v2 % m_p);
  return (proj_v1 * proj_v2) % m_p;
};
int init_value = 1;

auto d_in  = input.begin();
auto d_out = output.begin();

auto d_in_beg_offsets = row_offsets.begin();
auto d_in_end_offsets = row_offsets.begin() + 1;

void* temp_storage        = nullptr;
size_t temp_storage_bytes = 0;

// get size of temporary storage and allocate
auto status = cub::DeviceSegmentedScan::InclusiveSegmentedScanInit(
  temp_storage, temp_storage_bytes, d_in, d_out, d_in_beg_offsets, d_in_end_offsets, num_segments, scan_op, init_value);
check_execution_status(status, algo_name);

status = cudaMalloc(&temp_storage, temp_storage_bytes);
check_execution_status(status, "cudaMalloc");

// run the algorithm
status = cub::DeviceSegmentedScan::InclusiveSegmentedScanInit(
  temp_storage, temp_storage_bytes, d_in, d_out, d_in_beg_offsets, d_in_end_offsets, num_segments, scan_op, init_value);
check_execution_status(status, algo_name);

std::vector<int> h_expected{};
h_expected.reserve(output.size());
std::vector<std::vector<int>> expected_rows{
  {2, 4, 1, 2, 4, 1, 2}, {3, 2, 6, 4, 5, 1, 3}, {4, 2, 1, 4, 2, 1, 4}, {5, 4, 6, 2, 3, 1, 5}, {6, 1, 6, 1, 6, 1, 6}};
for (const auto& row : expected_rows)
{
  h_expected.insert(h_expected.end(), row.begin(), row.end());
}

auto expected = thrust::device_vector<int>{h_expected};

Template Parameters:
  • InputIteratorT[inferred] Random-access input iterator type for reading segmented scan inputs (may be a simple pointer type)

  • OutputIteratorT[inferred] Random-access output iterator type for writing segmented scan outputs (may be a simple pointer type)

  • BeginOffsetIteratorInputT[inferred] Random-access input iterator type for reading segment beginning offsets in the input data sequence (may be a simple pointer type)

  • EndOffsetIteratorInputT[inferred] Random-access input iterator type for reading segment ending offsets in the input data sequence (may be a simple pointer type)

  • ScanOpT[inferred] Binary associative scan functor type having member T operator()(const T &a, const T &b)

  • InitValueT[inferred] Type of the init_value

Parameters:
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Random-access iterator to the input sequence of data items

  • d_out[out] Random-access iterator to the output sequence of data items

  • d_in_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_in_begin_offsets[i] is the first element of the ith data segment in d_in and in d_out

  • d_in_end_offsets[in]

    Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_in_end_offsets[i] - 1 is the last element of the ith data segment in d_in. If d_in_end_offsets[i] - 1 <= d_in_begin_offsets[i], the ith is considered empty.

  • num_segments[in] The number of segments that comprise the segmented prefix scan data.

  • scan_op[in] Binary associative scan functor

  • init_value[in] Initial value to seed the exclusive scan for each segment in the output sequence

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename InputIteratorT, typename OutputIteratorT, typename BeginOffsetIteratorInputT, typename EndOffsetIteratorInputT, typename BeginOffsetIteratorOutputT, typename ScanOpT, typename InitValueT>
static inline cudaError_t InclusiveSegmentedScanInit(
void *d_temp_storage,
size_t &temp_storage_bytes,
InputIteratorT d_in,
OutputIteratorT d_out,
BeginOffsetIteratorInputT d_in_begin_offsets,
EndOffsetIteratorInputT d_in_end_offsets,
BeginOffsetIteratorOutputT d_out_begin_offsets,
::cuda::std::int64_t num_segments,
ScanOpT scan_op,
InitValueT init_value,
cudaStream_t stream = 0,
)#

Computes a device-wide segmented inclusive prefix scan using the specified binary associative scan_op functor. The result of applying the scan_op binary operator to init_value value and the first value in each input segment is assigned to the first value of the corresponding output segment.

  • Supports non-commutative scan operators.

  • Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run.

  • When d_in and d_out are equal, the scan is performed in-place. The input and output sequences shall not overlap in any other way.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Template Parameters:
  • InputIteratorT[inferred] Random-access input iterator type for reading segmented scan inputs (may be a simple pointer type)

  • OutputIteratorT[inferred] Random-access output iterator type for writing segmented scan outputs (may be a simple pointer type)

  • BeginOffsetIteratorInputT[inferred] Random-access input iterator type for reading segment beginning offsets in the input data sequence (may be a simple pointer type)

  • EndOffsetIteratorInputT[inferred] Random-access input iterator type for reading segment ending offsets in the input data sequence (may be a simple pointer type)

  • BeginOffsetIteratorOutputT[inferred] Random-access input iterator type for reading segment beginning offsets in the output sequence (may be a simple pointer type)

  • ScanOpT[inferred] Binary associative scan functor type having member T operator()(const T &a, const T &b)

  • InitValueT[inferred] Type of the init_value

Parameters:
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Random-access iterator to the input sequence of data items

  • d_out[out] Random-access iterator to the output sequence of data items

  • d_in_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_in_begin_offsets[i] is the first element of the ith data segment in d_in

  • d_in_end_offsets[in]

    Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_in_end_offsets[i] - 1 is the last element of the ith data segment in d_in. If d_in_end_offsets[i] - 1 <= d_in_begin_offsets[i], the ith is considered empty.

  • d_out_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_out_begin_offsets[i] is the first element of the ith data segment in d_out

  • num_segments[in] The number of segments that comprise the segmented prefix scan data.

  • scan_op[in] Binary associative scan functor

  • init_value[in] Initial value to seed the exclusive scan for each segment in the output sequence

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.