cub::BlockScan

Defined in cub/block/block_scan.cuh

template<typename T, int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int LEGACY_PTX_ARCH = 0>
class BlockScan

The BlockScan class provides collective methods for computing a parallel prefix sum/scan of items partitioned across a CUDA thread block.

Overview

  • Given a list of input elements and a binary reduction operator, a prefix scan produces an output list where each element is computed to be the reduction of the elements occurring earlier in the input list. Prefix sum connotes a prefix scan with the addition operator. The term inclusive indicates that the ith output reduction incorporates the ith input. The term exclusive indicates the ith input is not incorporated into the ith output reduction.

  • For multi-dimensional blocks, threads are linearly ranked in row-major order.

  • BlockScan can be optionally specialized by algorithm to accommodate different workload profiles:

    1. cub::BLOCK_SCAN_RAKING: An efficient (high throughput) “raking reduce-then-scan” prefix scan algorithm.

    2. cub::BLOCK_SCAN_RAKING_MEMOIZE: Similar to cub::BLOCK_SCAN_RAKING, but having higher throughput at the expense of additional register pressure for intermediate storage.

    3. cub::BLOCK_SCAN_WARP_SCANS: A quick (low latency) “tiled warpscans” prefix scan algorithm.

Performance Considerations

  • Efficiency is increased with increased granularity ITEMS_PER_THREAD. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.

  • Uses special instructions when applicable (e.g., warp SHFL)

  • Uses synchronization-free communication between warp lanes when applicable

  • Invokes a minimal number of minimal block-wide synchronization barriers (only one or two depending on algorithm selection)

  • Incurs zero bank conflicts for most types

  • Computation is slightly more efficient (i.e., having lower instruction overhead) for:

    • Prefix sum variants (vs. generic scan)

    • The number of threads in the block is a multiple of the architecture’s warp size

  • See cub::BlockScanAlgorithm for performance details regarding algorithmic alternatives

A Simple Example

Every thread in the block uses the BlockScan class by first specializing the BlockScan type, then instantiating an instance with parameters for communication, and finally invoking one or more collective member functions.

The code snippet below illustrates an exclusive prefix sum of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.

#include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize BlockScan for a 1D block of 128 threads of type int
    using BlockScan = cub::BlockScan<int, 128>;

    // Allocate shared memory for BlockScan
    __shared__ typename BlockScan::TempStorage temp_storage;

    // Obtain a segment of consecutive items that are blocked across threads
    int thread_data[4];
    ...

    // Collectively compute the block-wide exclusive prefix sum
    BlockScan(temp_storage).ExclusiveSum(thread_data, thread_data);

Suppose the set of input thread_data across the block of threads is {[1,1,1,1], [1,1,1,1], ..., [1,1,1,1]}. The corresponding output thread_data in those threads will be {[0,1,2,3], [4,5,6,7], ..., [508,509,510,511]}.

Re-using dynamically allocating shared memory

The block/example_block_reduce_dyn_smem.cu example illustrates usage of dynamically shared memory with BlockReduce and how to re-purpose the same memory region. This example can be easily adapted to the storage required by BlockScan.

Template Parameters
  • T – Data type being scanned

  • BLOCK_DIM_X – The thread block length in threads along the X dimension

  • ALGORITHM[optional] cub::BlockScanAlgorithm enumerator specifying the underlying algorithm to use (default: cub::BLOCK_SCAN_RAKING)

  • BLOCK_DIM_Y[optional] The thread block length in threads along the Y dimension (default: 1)

  • BLOCK_DIM_Z[optional] The thread block length in threads along the Z dimension (default: 1)

  • LEGACY_PTX_ARCH[optional] Unused.

Collective constructors

inline BlockScan()

Collective constructor using a private static allocation of shared memory as temporary storage.

inline BlockScan(TempStorage &temp_storage)

Collective constructor using the specified memory allocation as temporary storage.

Parameters

temp_storage[in] Reference to memory allocation having layout type TempStorage

Exclusive prefix sum operations

inline void ExclusiveSum(T input, T &output)

Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. The value of 0 is applied as the initial value, and is assigned to output in thread0.

  • This operation assumes the value of obtained by the T’s default constructor (or by zero-initialization if no user-defined default constructor exists) is suitable as the identity value zero for addition.

  • For multi-dimensional blocks, threads are linearly ranked in row-major order.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates an exclusive prefix sum of 128 integer items that are partitioned across 128 threads.

#include <cub/cub.cuh>  // or equivalently <cub/block/block_scan.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize BlockScan for a 1D block of 128 threads of type int
    using BlockScan = cub::BlockScan<int, 128>;

    // Allocate shared memory for BlockScan
    __shared__ typename BlockScan::TempStorage temp_storage;

    // Obtain input item for each thread
    int thread_data;
    ...

    // Collectively compute the block-wide exclusive prefix sum
    BlockScan(temp_storage).ExclusiveSum(thread_data, thread_data);

Suppose the set of input thread_data across the block of threads is 1, 1, ..., 1. The corresponding output thread_data in those threads will be 0, 1, ..., 127.

Parameters
  • input[in] Calling thread’s input item

  • output[out] Calling thread’s output item (may be aliased to input)

inline void ExclusiveSum(T input, T &output, T &block_aggregate)

Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. The value of 0 is applied as the initial value, and is assigned to output in thread0. Also provides every thread with the block-wide block_aggregate of all inputs.

  • This operation assumes the value of obtained by the T’s default constructor (or by zero-initialization if no user-defined default constructor exists) is suitable as the identity value zero for addition.

  • For multi-dimensional blocks, threads are linearly ranked in row-major order.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates an exclusive prefix sum of 128 integer items that are partitioned across 128 threads.

#include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize BlockScan for a 1D block of 128 threads of type int
    using BlockScan = cub::BlockScan<int, 128>;

    // Allocate shared memory for BlockScan
    __shared__ typename BlockScan::TempStorage temp_storage;

    // Obtain input item for each thread
    int thread_data;
    ...

    // Collectively compute the block-wide exclusive prefix sum
    int block_aggregate;
    BlockScan(temp_storage).ExclusiveSum(thread_data, thread_data, block_aggregate);

Suppose the set of input thread_data across the block of threads is 1, 1, ..., 1. The corresponding output thread_data in those threads will be 0, 1, ..., 127. Furthermore the value 128 will be stored in block_aggregate for all threads.

Parameters
  • input[in] Calling thread’s input item

  • output[out] Calling thread’s output item (may be aliased to input)

  • block_aggregate[out] block-wide aggregate reduction of input items

template<typename BlockPrefixCallbackOp>
inline void ExclusiveSum(T input, T &output, BlockPrefixCallbackOp &block_prefix_callback_op)

Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Instead of using 0 as the block-wide prefix, the call-back functor block_prefix_callback_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the “seed” value that logically prefixes the thread block’s scan inputs. Also provides every thread with the block-wide block_aggregate of all inputs.

  • This operation assumes the value of obtained by the T’s default constructor (or by zero-initialization if no user-defined default constructor exists) is suitable as the identity value zero for addition.

  • The block_prefix_callback_op functor must implement a member function T operator()(T block_aggregate). The functor’s input parameter block_aggregate is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane0 is applied as the block-wide prefix. Can be stateful.

  • For multi-dimensional blocks, threads are linearly ranked in row-major order.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates a single thread block that progressively computes an exclusive prefix sum over multiple “tiles” of input using a prefix functor to maintain a running total between block-wide scans. Each tile consists of 128 integer items that are partitioned across 128 threads.

#include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh>

// A stateful callback functor that maintains a running prefix to be applied
// during consecutive scan operations.
struct BlockPrefixCallbackOp
{
    // Running prefix
    int running_total;

    // Constructor
    __host__ BlockPrefixCallbackOp(int running_total) : running_total(running_total) {}

    // Callback operator to be entered by the first warp of threads in the block.
    // Thread-0 is responsible for returning a value for seeding the block-wide scan.
    __host__ int operator()(int block_aggregate)
    {
        int old_prefix = running_total;
        running_total += block_aggregate;
        return old_prefix;
    }
};

__global__ void ExampleKernel(int *d_data, int num_items, ...)
{
    // Specialize BlockScan for a 1D block of 128 threads
    using BlockScan = cub::BlockScan<int, 128>;

    // Allocate shared memory for BlockScan
    __shared__ typename BlockScan::TempStorage temp_storage;

    // Initialize running total
    BlockPrefixCallbackOp prefix_op(0);

    // Have the block iterate over segments of items
    for (int block_offset = 0; block_offset < num_items; block_offset += 128)
    {
        // Load a segment of consecutive items that are blocked across threads
        int thread_data = d_data[block_offset];

        // Collectively compute the block-wide exclusive prefix sum
        BlockScan(temp_storage).ExclusiveSum(
            thread_data, thread_data, prefix_op);
        CTA_SYNC();

        // Store scanned items to output segment
        d_data[block_offset] = thread_data;
    }

Suppose the input d_data is 1, 1, 1, 1, 1, 1, 1, 1, .... The corresponding output for the first segment will be 0, 1, ..., 127. The output for the second segment will be 128, 129, ..., 255.

Template Parameters

BlockPrefixCallbackOp[inferred] Call-back functor type having member T operator()(T block_aggregate)

Parameters
  • input[in] Calling thread’s input item

  • output[out] Calling thread’s output item (may be aliased to input)

  • block_prefix_callback_op[inout]

    warp0 only call-back functor for specifying a block-wide prefix to be applied to the logical input sequence.

Exclusive prefix sum operations (multiple data per thread)

template<int ITEMS_PER_THREAD>
inline void ExclusiveSum(T (&input)[ITEMS_PER_THREAD], T (&output)[ITEMS_PER_THREAD])

Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. The value of 0 is applied as the initial value, and is assigned to output[0] in thread0.

  • This operation assumes the value of obtained by the T’s default constructor (or by zero-initialization if no user-defined default constructor exists) is suitable as the identity value zero for addition.

  • Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.

  • Efficiency is increased with increased granularity ITEMS_PER_THREAD. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates an exclusive prefix sum of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.

#include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize BlockScan for a 1D block of 128 threads of type int
    using BlockScan = cub::BlockScan<int, 128>;

    // Allocate shared memory for BlockScan
    __shared__ typename BlockScan::TempStorage temp_storage;

    // Obtain a segment of consecutive items that are blocked across threads
    int thread_data[4];
    ...

    // Collectively compute the block-wide exclusive prefix sum
    BlockScan(temp_storage).ExclusiveSum(thread_data, thread_data);

Suppose the set of input thread_data across the block of threads is { [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }. The corresponding output thread_data in those threads will be { [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }.

Template Parameters

ITEMS_PER_THREAD[inferred] The number of consecutive items partitioned onto each thread.

Parameters
  • input[in] Calling thread’s input items

  • output[out] Calling thread’s output items (may be aliased to input)

template<int ITEMS_PER_THREAD>
inline void ExclusiveSum(T (&input)[ITEMS_PER_THREAD], T (&output)[ITEMS_PER_THREAD], T &block_aggregate)

Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. The value of 0 is applied as the initial value, and is assigned to output[0] in thread0. Also provides every thread with the block-wide block_aggregate of all inputs.

  • This operation assumes the value of obtained by the T’s default constructor (or by zero-initialization if no user-defined default constructor exists) is suitable as the identity value zero for addition.

  • Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.

  • Efficiency is increased with increased granularity ITEMS_PER_THREAD. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates an exclusive prefix sum of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.

#include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize BlockScan for a 1D block of 128 threads of type int
    using BlockScan = cub::BlockScan<int, 128>;

    // Allocate shared memory for BlockScan
    __shared__ typename BlockScan::TempStorage temp_storage;

    // Obtain a segment of consecutive items that are blocked across threads
    int thread_data[4];
    ...

    // Collectively compute the block-wide exclusive prefix sum
    int block_aggregate;
    BlockScan(temp_storage).ExclusiveSum(thread_data, thread_data, block_aggregate);

Suppose the set of input thread_data across the block of threads is { [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }. The corresponding output thread_data in those threads will be { [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }. Furthermore the value 512 will be stored in block_aggregate for all threads.

Template Parameters

ITEMS_PER_THREAD[inferred] The number of consecutive items partitioned onto each thread.

Parameters
  • input[in] Calling thread’s input items

  • output[out] Calling thread’s output items (may be aliased to input)

  • block_aggregate[out] block-wide aggregate reduction of input items

template<int ITEMS_PER_THREAD, typename BlockPrefixCallbackOp>
inline void ExclusiveSum(T (&input)[ITEMS_PER_THREAD], T (&output)[ITEMS_PER_THREAD], BlockPrefixCallbackOp &block_prefix_callback_op)

Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Instead of using 0 as the block-wide prefix, the call-back functor block_prefix_callback_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the “seed” value that logically prefixes the thread block’s scan inputs. Also provides every thread with the block-wide block_aggregate of all inputs.

  • This operation assumes the value of obtained by the T’s default constructor (or by zero-initialization if no user-defined default constructor exists) is suitable as the identity value zero for addition.

  • The block_prefix_callback_op functor must implement a member function T operator()(T block_aggregate). The functor’s input parameter block_aggregate is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane0 is applied as the block-wide prefix. Can be stateful.

  • Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.

  • Efficiency is increased with increased granularity ITEMS_PER_THREAD. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates a single thread block that progressively computes an exclusive prefix sum over multiple “tiles” of input using a prefix functor to maintain a running total between block-wide scans. Each tile consists of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.

#include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>

// A stateful callback functor that maintains a running prefix to be applied
// during consecutive scan operations.
struct BlockPrefixCallbackOp
{
    // Running prefix
    int running_total;

    // Constructor
    __host__ BlockPrefixCallbackOp(int running_total) : running_total(running_total) {}

    // Callback operator to be entered by the first warp of threads in the block.
    // Thread-0 is responsible for returning a value for seeding the block-wide scan.
    __host__ int operator()(int block_aggregate)
    {
        int old_prefix = running_total;
        running_total += block_aggregate;
        return old_prefix;
    }
};

__global__ void ExampleKernel(int *d_data, int num_items, ...)
{
    // Specialize BlockLoad, BlockStore, and BlockScan for a 1D block of 128 threads, 4 ints per thread
    using BlockLoad  = cub::BlockLoad<int*, 128, 4, BLOCK_LOAD_TRANSPOSE>;
    using BlockStore = cub::BlockStore<int, 128, 4, BLOCK_STORE_TRANSPOSE>;
    using BlockScan  = cub::BlockScan<int, 128>;

    // Allocate aliased shared memory for BlockLoad, BlockStore, and BlockScan
    __shared__ union {
        typename BlockLoad::TempStorage     load;
        typename BlockScan::TempStorage     scan;
        typename BlockStore::TempStorage    store;
    } temp_storage;

    // Initialize running total
    BlockPrefixCallbackOp prefix_op(0);

    // Have the block iterate over segments of items
    for (int block_offset = 0; block_offset < num_items; block_offset += 128 * 4)
    {
        // Load a segment of consecutive items that are blocked across threads
        int thread_data[4];
        BlockLoad(temp_storage.load).Load(d_data + block_offset, thread_data);
        CTA_SYNC();

        // Collectively compute the block-wide exclusive prefix sum
        int block_aggregate;
        BlockScan(temp_storage.scan).ExclusiveSum(
            thread_data, thread_data, prefix_op);
        CTA_SYNC();

        // Store scanned items to output segment
        BlockStore(temp_storage.store).Store(d_data + block_offset, thread_data);
        CTA_SYNC();
    }

Suppose the input d_data is 1, 1, 1, 1, 1, 1, 1, 1, .... The corresponding output for the first segment will be 0, 1, 2, 3, ..., 510, 511. The output for the second segment will be 512, 513, 514, 515, ..., 1022, 1023.

Template Parameters
  • ITEMS_PER_THREAD[inferred] The number of consecutive items partitioned onto each thread.

  • BlockPrefixCallbackOp[inferred] Call-back functor type having member T operator()(T block_aggregate)

Parameters
  • input[in] Calling thread’s input items

  • output[out] Calling thread’s output items (may be aliased to input)

  • block_prefix_callback_op[inout]

    warp0 only call-back functor for specifying a block-wide prefix to be applied to the logical input sequence.

Exclusive prefix scan operations

template<typename ScanOp>
inline void ExclusiveScan(T input, T &output, T initial_value, ScanOp scan_op)

Computes an exclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes one input element.

  • Supports non-commutative scan operators.

  • For multi-dimensional blocks, threads are linearly ranked in row-major order.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates an exclusive prefix max scan of 128 integer items that are partitioned across 128 threads.

#include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize BlockScan for a 1D block of 128 threads of type int
    using BlockScan = cub::BlockScan<int, 128>;

    // Allocate shared memory for BlockScan
    __shared__ typename BlockScan::TempStorage temp_storage;

    // Obtain input item for each thread
    int thread_data;
    ...

    // Collectively compute the block-wide exclusive prefix max scan
    BlockScan(temp_storage).ExclusiveScan(thread_data, thread_data, INT_MIN, cuda::maximum<>{});

Suppose the set of input thread_data across the block of threads is 0, -1, 2, -3, ..., 126, -127. The corresponding output thread_data in those threads will be INT_MIN, 0, 0, 2, ..., 124, 126.

Template Parameters

ScanOp[inferred] Binary scan functor type having member T operator()(const T &a, const T &b)

Parameters
  • input[in] Calling thread’s input item

  • output[out] Calling thread’s output item (may be aliased to input)

  • initial_value[in]

    Initial value to seed the exclusive scan (and is assigned to output[0] in thread0)

  • scan_op[in] Binary scan functor

template<typename ScanOp>
inline void ExclusiveScan(T input, T &output, T initial_value, ScanOp scan_op, T &block_aggregate)

Computes an exclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes one input element. Also provides every thread with the block-wide block_aggregate of all inputs.

  • Supports non-commutative scan operators.

  • For multi-dimensional blocks, threads are linearly ranked in row-major order.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates an exclusive prefix max scan of 128 integer items that are partitioned across 128 threads.

#include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize BlockScan for a 1D block of 128 threads of type int
    using BlockScan = cub::BlockScan<int, 128>;

    // Allocate shared memory for BlockScan
    __shared__ typename BlockScan::TempStorage temp_storage;

    // Obtain input item for each thread
    int thread_data;
    ...

    // Collectively compute the block-wide exclusive prefix max scan
    int block_aggregate;
    BlockScan(temp_storage).ExclusiveScan(thread_data, thread_data, INT_MIN, cuda::maximum<>{},
    block_aggregate);

Suppose the set of input thread_data across the block of threads is 0, -1, 2, -3, ..., 126, -127. The corresponding output thread_data in those threads will be INT_MIN, 0, 0, 2, ..., 124, 126. Furthermore the value 126 will be stored in block_aggregate for all threads.

Template Parameters

ScanOp[inferred] Binary scan functor type having member T operator()(const T &a, const T &b)

Parameters
  • input[in] Calling thread’s input items

  • output[out] Calling thread’s output items (may be aliased to input)

  • initial_value[in]

    Initial value to seed the exclusive scan (and is assigned to output[0] in thread0)

  • scan_op[in] Binary scan functor

  • block_aggregate[out] block-wide aggregate reduction of input items

template<typename ScanOp, typename BlockPrefixCallbackOp>
inline void ExclusiveScan(T input, T &output, ScanOp scan_op, BlockPrefixCallbackOp &block_prefix_callback_op)

Computes an exclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes one input element. The call-back functor block_prefix_callback_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the “seed” value that logically prefixes the thread block’s scan inputs. Also provides every thread with the block-wide block_aggregate of all inputs.

  • The block_prefix_callback_op functor must implement a member function T operator()(T block_aggregate). The functor’s input parameter block_aggregate is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane0 is applied as the block-wide prefix. Can be stateful.

  • Supports non-commutative scan operators.

  • For multi-dimensional blocks, threads are linearly ranked in row-major order.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates a single thread block that progressively computes an exclusive prefix max scan over multiple “tiles” of input using a prefix functor to maintain a running total between block-wide scans. Each tile consists of 128 integer items that are partitioned across 128 threads.

#include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>

// A stateful callback functor that maintains a running prefix to be applied
// during consecutive scan operations.
struct BlockPrefixCallbackOp
{
    // Running prefix
    int running_total;

    // Constructor
    __host__ BlockPrefixCallbackOp(int running_total) : running_total(running_total) {}

    // Callback operator to be entered by the first warp of threads in the block.
    // Thread-0 is responsible for returning a value for seeding the block-wide scan.
    __host__ int operator()(int block_aggregate)
    {
        int old_prefix = running_total;
        running_total = (block_aggregate > old_prefix) ? block_aggregate : old_prefix;
        return old_prefix;
    }
};

__global__ void ExampleKernel(int *d_data, int num_items, ...)
{
    // Specialize BlockScan for a 1D block of 128 threads
    using BlockScan = cub::BlockScan<int, 128>;

    // Allocate shared memory for BlockScan
    __shared__ typename BlockScan::TempStorage temp_storage;

    // Initialize running total
    BlockPrefixCallbackOp prefix_op(INT_MIN);

    // Have the block iterate over segments of items
    for (int block_offset = 0; block_offset < num_items; block_offset += 128)
    {
        // Load a segment of consecutive items that are blocked across threads
        int thread_data = d_data[block_offset];

        // Collectively compute the block-wide exclusive prefix max scan
        BlockScan(temp_storage).ExclusiveScan(
            thread_data, thread_data, INT_MIN, cuda::maximum<>{}, prefix_op);
        CTA_SYNC();

        // Store scanned items to output segment
        d_data[block_offset] = thread_data;
    }

Suppose the input d_data is 0, -1, 2, -3, 4, -5, .... The corresponding output for the first segment will be INT_MIN, 0, 0, 2, ..., 124, 126. The output for the second segment will be 126, 128, 128, 130, ..., 252, 254.

Template Parameters
  • ScanOp[inferred] Binary scan functor type having member T operator()(const T &a, const T &b)

  • BlockPrefixCallbackOp[inferred] Call-back functor type having member T operator()(T block_aggregate)

Parameters
  • input[in] Calling thread’s input item

  • output[out] Calling thread’s output item (may be aliased to input)

  • scan_op[in] Binary scan functor

  • block_prefix_callback_op[inout]

    warp0 only call-back functor for specifying a block-wide prefix to be applied to the logical input sequence.

Exclusive prefix scan operations (multiple data per thread)

template<int ITEMS_PER_THREAD, typename ScanOp>
inline void ExclusiveScan(T (&input)[ITEMS_PER_THREAD], T (&output)[ITEMS_PER_THREAD], T initial_value, ScanOp scan_op)

Computes an exclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes an array of consecutive input elements.

  • Supports non-commutative scan operators.

  • Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.

  • Efficiency is increased with increased granularity ITEMS_PER_THREAD. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates an exclusive prefix max scan of 512 integer items that are partitioned in a [<em>blocked arrangement</em>](../index.html#sec5sec3) across 128 threads where each thread owns 4 consecutive items.

#include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize BlockScan for a 1D block of 128 threads of type int
    using BlockScan = cub::BlockScan<int, 128>;

    // Allocate shared memory for BlockScan
    __shared__ typename BlockScan::TempStorage temp_storage;

    // Obtain a segment of consecutive items that are blocked across threads
    int thread_data[4];
    ...

    // Collectively compute the block-wide exclusive prefix max scan
    BlockScan(temp_storage).ExclusiveScan(thread_data, thread_data, INT_MIN, cuda::maximum<>{});

Suppose the set of input thread_data across the block of threads is { [0,-1,2,-3], [4,-5,6,-7], ..., [508,-509,510,-511] }. The corresponding output thread_data in those threads will be { [INT_MIN,0,0,2], [2,4,4,6], ..., [506,508,508,510] }.

Template Parameters
  • ITEMS_PER_THREAD[inferred] The number of consecutive items partitioned onto each thread.

  • ScanOp[inferred] Binary scan functor type having member T operator()(const T &a, const T &b)

Parameters
  • input[in] Calling thread’s input items

  • output[out] Calling thread’s output items (may be aliased to input)

  • initial_value[in]

    Initial value to seed the exclusive scan (and is assigned to output[0] in thread0)

  • scan_op[in] Binary scan functor

template<int ITEMS_PER_THREAD, typename ScanOp>
inline void ExclusiveScan(T (&input)[ITEMS_PER_THREAD], T (&output)[ITEMS_PER_THREAD], T initial_value, ScanOp scan_op, T &block_aggregate)

Computes an exclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide block_aggregate of all inputs.

  • Supports non-commutative scan operators.

  • Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.

  • Efficiency is increased with increased granularity ITEMS_PER_THREAD. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates an exclusive prefix max scan of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.

#include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize BlockScan for a 1D block of 128 threads of type int
    using BlockScan = cub::BlockScan<int, 128>;

    // Allocate shared memory for BlockScan
    __shared__ typename BlockScan::TempStorage temp_storage;

    // Obtain a segment of consecutive items that are blocked across threads
    int thread_data[4];
    ...

    // Collectively compute the block-wide exclusive prefix max scan
    int block_aggregate;
    BlockScan(temp_storage).ExclusiveScan(thread_data, thread_data, INT_MIN, cuda::maximum<>{},
    block_aggregate);

Suppose the set of input thread_data across the block of threads is { [0,-1,2,-3], [4,-5,6,-7], ..., [508,-509,510,-511] }. The corresponding output thread_data in those threads will be { [INT_MIN,0,0,2], [2,4,4,6], ..., [506,508,508,510] }. Furthermore the value 510 will be stored in block_aggregate for all threads.

Template Parameters
  • ITEMS_PER_THREAD[inferred] The number of consecutive items partitioned onto each thread.

  • ScanOp[inferred] Binary scan functor type having member T operator()(const T &a, const T &b)

Parameters
  • input[in] Calling thread’s input items

  • output[out] Calling thread’s output items (may be aliased to input)

  • initial_value[in]

    Initial value to seed the exclusive scan (and is assigned to output[0] in thread0)

  • scan_op[in] Binary scan functor

  • block_aggregate[out] block-wide aggregate reduction of input items

template<int ITEMS_PER_THREAD, typename ScanOp, typename BlockPrefixCallbackOp>
inline void ExclusiveScan(T (&input)[ITEMS_PER_THREAD], T (&output)[ITEMS_PER_THREAD], ScanOp scan_op, BlockPrefixCallbackOp &block_prefix_callback_op)

Computes an exclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes an array of consecutive input elements. The call-back functor block_prefix_callback_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the “seed” value that logically prefixes the thread block’s scan inputs. Also provides every thread with the block-wide block_aggregate of all inputs.

  • The block_prefix_callback_op functor must implement a member function T operator()(T block_aggregate). The functor’s input parameter block_aggregate is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane0 is applied as the block-wide prefix. Can be stateful.

  • Supports non-commutative scan operators.

  • Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.

  • Efficiency is increased with increased granularity ITEMS_PER_THREAD. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates a single thread block that progressively computes an exclusive prefix max scan over multiple “tiles” of input using a prefix functor to maintain a running total between block-wide scans. Each tile consists of 128 integer items that are partitioned across 128 threads.

#include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>

// A stateful callback functor that maintains a running prefix to be applied
// during consecutive scan operations.
struct BlockPrefixCallbackOp
{
    // Running prefix
    int running_total;

    // Constructor
    __host__ BlockPrefixCallbackOp(int running_total) : running_total(running_total) {}

    // Callback operator to be entered by the first warp of threads in the block.
    // Thread-0 is responsible for returning a value for seeding the block-wide scan.
    __host__ int operator()(int block_aggregate)
    {
        int old_prefix = running_total;
        running_total = (block_aggregate > old_prefix) ? block_aggregate : old_prefix;
        return old_prefix;
    }
};

__global__ void ExampleKernel(int *d_data, int num_items, ...)
{
    // Specialize BlockLoad, BlockStore, and BlockScan for a 1D block of 128 threads, 4 ints per thread
    using BlockLoad = cub::BlockLoad<int*, 128, 4, BLOCK_LOAD_TRANSPOSE>  ;
    using BlockStore = cub::BlockStore<int, 128, 4, BLOCK_STORE_TRANSPOSE> ;
    using BlockScan = cub::BlockScan<int, 128>                            ;

    // Allocate aliased shared memory for BlockLoad, BlockStore, and BlockScan
    __shared__ union {
        typename BlockLoad::TempStorage     load;
        typename BlockScan::TempStorage     scan;
        typename BlockStore::TempStorage    store;
    } temp_storage;

    // Initialize running total
    BlockPrefixCallbackOp prefix_op(0);

    // Have the block iterate over segments of items
    for (int block_offset = 0; block_offset < num_items; block_offset += 128 * 4)
    {
        // Load a segment of consecutive items that are blocked across threads
        int thread_data[4];
        BlockLoad(temp_storage.load).Load(d_data + block_offset, thread_data);
        CTA_SYNC();

        // Collectively compute the block-wide exclusive prefix max scan
        BlockScan(temp_storage.scan).ExclusiveScan(
            thread_data, thread_data, INT_MIN, cuda::maximum<>{}, prefix_op);
        CTA_SYNC();

        // Store scanned items to output segment
        BlockStore(temp_storage.store).Store(d_data + block_offset, thread_data);
        CTA_SYNC();
    }

Suppose the input d_data is 0, -1, 2, -3, 4, -5, .... The corresponding output for the first segment will be INT_MIN, 0, 0, 2, 2, 4, ..., 508, 510. The output for the second segment will be 510, 512, 512, 514, 514, 516, ..., 1020, 1022.

Template Parameters
  • ITEMS_PER_THREAD[inferred] The number of consecutive items partitioned onto each thread.

  • ScanOp[inferred] Binary scan functor type having member T operator()(const T &a, const T &b)

  • BlockPrefixCallbackOp[inferred] Call-back functor type having member T operator()(T block_aggregate)

Parameters
  • input[in] Calling thread’s input items

  • output[out] Calling thread’s output items (may be aliased to input)

  • scan_op[in] Binary scan functor

  • block_prefix_callback_op[inout]

    warp0 only call-back functor for specifying a block-wide prefix to be applied to the logical input sequence.

Inclusive prefix sum operations

inline void InclusiveSum(T input, T &output)

Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element.

  • For multi-dimensional blocks, threads are linearly ranked in row-major order.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates an inclusive prefix sum of 128 integer items that are partitioned across 128 threads.

#include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize BlockScan for a 1D block of 128 threads of type int
    using BlockScan = cub::BlockScan<int, 128>;

    // Allocate shared memory for BlockScan
    __shared__ typename BlockScan::TempStorage temp_storage;

    // Obtain input item for each thread
    int thread_data;
    ...

    // Collectively compute the block-wide inclusive prefix sum
    BlockScan(temp_storage).InclusiveSum(thread_data, thread_data);

Suppose the set of input thread_data across the block of threads is 1, 1, ..., 1. The corresponding output thread_data in those threads will be 1, 2, ..., 128.

Parameters
  • input[in] Calling thread’s input item

  • output[out] Calling thread’s output item (may be aliased to input)

inline void InclusiveSum(T input, T &output, T &block_aggregate)

Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Also provides every thread with the block-wide block_aggregate of all inputs.

  • For multi-dimensional blocks, threads are linearly ranked in row-major order.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates an inclusive prefix sum of 128 integer items that are partitioned across 128 threads.

#include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize BlockScan for a 1D block of 128 threads of type int
    using BlockScan = cub::BlockScan<int, 128>;

    // Allocate shared memory for BlockScan
    __shared__ typename BlockScan::TempStorage temp_storage;

    // Obtain input item for each thread
    int thread_data;
    ...

    // Collectively compute the block-wide inclusive prefix sum
    int block_aggregate;
    BlockScan(temp_storage).InclusiveSum(thread_data, thread_data, block_aggregate);

Suppose the set of input thread_data across the block of threads is 1, 1, ..., 1. The corresponding output thread_data in those threads will be 1, 2, ..., 128. Furthermore the value 128 will be stored in block_aggregate for all threads.

Parameters
  • input[in] Calling thread’s input item

  • output[out] Calling thread’s output item (may be aliased to input)

  • block_aggregate[out] block-wide aggregate reduction of input items

template<typename BlockPrefixCallbackOp>
inline void InclusiveSum(T input, T &output, BlockPrefixCallbackOp &block_prefix_callback_op)

Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Instead of using 0 as the block-wide prefix, the call-back functor block_prefix_callback_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the “seed” value that logically prefixes the thread block’s scan inputs. Also provides every thread with the block-wide block_aggregate of all inputs.

  • The block_prefix_callback_op functor must implement a member function T operator()(T block_aggregate). The functor’s input parameter block_aggregate is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane0 is applied as the block-wide prefix. Can be stateful.

  • For multi-dimensional blocks, threads are linearly ranked in row-major order.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates a single thread block that progressively computes an inclusive prefix sum over multiple “tiles” of input using a prefix functor to maintain a running total between block-wide scans. Each tile consists of 128 integer items that are partitioned across 128 threads.

#include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>

// A stateful callback functor that maintains a running prefix to be applied
// during consecutive scan operations.
struct BlockPrefixCallbackOp
{
    // Running prefix
    int running_total;

    // Constructor
    __host__ BlockPrefixCallbackOp(int running_total) : running_total(running_total) {}

    // Callback operator to be entered by the first warp of threads in the block.
    // Thread-0 is responsible for returning a value for seeding the block-wide scan.
    __host__ int operator()(int block_aggregate)
    {
        int old_prefix = running_total;
        running_total += block_aggregate;
        return old_prefix;
    }
};

__global__ void ExampleKernel(int *d_data, int num_items, ...)
{
    // Specialize BlockScan for a 1D block of 128 threads
    using BlockScan = cub::BlockScan<int, 128>;

    // Allocate shared memory for BlockScan
    __shared__ typename BlockScan::TempStorage temp_storage;

    // Initialize running total
    BlockPrefixCallbackOp prefix_op(0);

    // Have the block iterate over segments of items
    for (int block_offset = 0; block_offset < num_items; block_offset += 128)
    {
        // Load a segment of consecutive items that are blocked across threads
        int thread_data = d_data[block_offset];

        // Collectively compute the block-wide inclusive prefix sum
        BlockScan(temp_storage).InclusiveSum(
            thread_data, thread_data, prefix_op);
        CTA_SYNC();

        // Store scanned items to output segment
        d_data[block_offset] = thread_data;
    }

Suppose the input d_data is 1, 1, 1, 1, 1, 1, 1, 1, .... The corresponding output for the first segment will be 1, 2, ..., 128. The output for the second segment will be 129, 130, ..., 256.

Template Parameters

BlockPrefixCallbackOp[inferred] Call-back functor type having member T operator()(T block_aggregate)

Parameters
  • input[in] Calling thread’s input item

  • output[out] Calling thread’s output item (may be aliased to input)

  • block_prefix_callback_op[inout]

    warp0 only call-back functor for specifying a block-wide prefix to be applied to the logical input sequence.

Inclusive prefix sum operations (multiple data per thread)

template<int ITEMS_PER_THREAD>
inline void InclusiveSum(T (&input)[ITEMS_PER_THREAD], T (&output)[ITEMS_PER_THREAD])

Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements.

  • Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.

  • Efficiency is increased with increased granularity ITEMS_PER_THREAD. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates an inclusive prefix sum of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.

#include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize BlockScan for a 1D block of 128 threads of type int
    using BlockScan = cub::BlockScan<int, 128>;

    // Allocate shared memory for BlockScan
    __shared__ typename BlockScan::TempStorage temp_storage;

    // Obtain a segment of consecutive items that are blocked across threads
    int thread_data[4];
    ...

    // Collectively compute the block-wide inclusive prefix sum
    BlockScan(temp_storage).InclusiveSum(thread_data, thread_data);

Suppose the set of input thread_data across the block of threads is { [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }. The corresponding output thread_data in those threads will be { [1,2,3,4], [5,6,7,8], ..., [509,510,511,512] }.

Template Parameters

ITEMS_PER_THREAD[inferred] The number of consecutive items partitioned onto each thread.

Parameters
  • input[in] Calling thread’s input items

  • output[out] Calling thread’s output items (may be aliased to input)

template<int ITEMS_PER_THREAD>
inline void InclusiveSum(T (&input)[ITEMS_PER_THREAD], T (&output)[ITEMS_PER_THREAD], T &block_aggregate)

Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide block_aggregate of all inputs.

  • Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.

  • Efficiency is increased with increased granularity ITEMS_PER_THREAD. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates an inclusive prefix sum of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.

#include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize BlockScan for a 1D block of 128 threads of type int
    using BlockScan = cub::BlockScan<int, 128>;

    // Allocate shared memory for BlockScan
    __shared__ typename BlockScan::TempStorage temp_storage;

    // Obtain a segment of consecutive items that are blocked across threads
    int thread_data[4];
    ...

    // Collectively compute the block-wide inclusive prefix sum
    int block_aggregate;
    BlockScan(temp_storage).InclusiveSum(thread_data, thread_data, block_aggregate);

Suppose the set of input thread_data across the block of threads is { [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }. The corresponding output thread_data in those threads will be { [1,2,3,4], [5,6,7,8], ..., [509,510,511,512] }. Furthermore the value 512 will be stored in block_aggregate for all threads.

Template Parameters

ITEMS_PER_THREAD[inferred] The number of consecutive items partitioned onto each thread.

Parameters
  • input[in] Calling thread’s input items

  • output[out] Calling thread’s output items (may be aliased to input)

  • block_aggregate[out] block-wide aggregate reduction of input items

template<int ITEMS_PER_THREAD, typename BlockPrefixCallbackOp>
inline void InclusiveSum(T (&input)[ITEMS_PER_THREAD], T (&output)[ITEMS_PER_THREAD], BlockPrefixCallbackOp &block_prefix_callback_op)

Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Instead of using 0 as the block-wide prefix, the call-back functor block_prefix_callback_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the “seed” value that logically prefixes the thread block’s scan inputs. Also provides every thread with the block-wide block_aggregate of all inputs.

  • The block_prefix_callback_op functor must implement a member function T operator()(T block_aggregate). The functor’s input parameter block_aggregate is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane0 is applied as the block-wide prefix. Can be stateful.

  • Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.

  • Efficiency is increased with increased granularity ITEMS_PER_THREAD. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates a single thread block that progressively computes an inclusive prefix sum over multiple “tiles” of input using a prefix functor to maintain a running total between block-wide scans. Each tile consists of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.

#include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>

// A stateful callback functor that maintains a running prefix to be applied
// during consecutive scan operations.
struct BlockPrefixCallbackOp
{
    // Running prefix
    int running_total;

    // Constructor
    __host__ BlockPrefixCallbackOp(int running_total) : running_total(running_total) {}

    // Callback operator to be entered by the first warp of threads in the block.
    // Thread-0 is responsible for returning a value for seeding the block-wide scan.
    __host__ int operator()(int block_aggregate)
    {
        int old_prefix = running_total;
        running_total += block_aggregate;
        return old_prefix;
    }
};

__global__ void ExampleKernel(int *d_data, int num_items, ...)
{
    // Specialize BlockLoad, BlockStore, and BlockScan for a 1D block of 128 threads, 4 ints per thread
    using BlockLoad = cub::BlockLoad<int*, 128, 4, BLOCK_LOAD_TRANSPOSE>  ;
    using BlockStore = cub::BlockStore<int, 128, 4, BLOCK_STORE_TRANSPOSE> ;
    using BlockScan = cub::BlockScan<int, 128>                            ;

    // Allocate aliased shared memory for BlockLoad, BlockStore, and BlockScan
    __shared__ union {
        typename BlockLoad::TempStorage     load;
        typename BlockScan::TempStorage     scan;
        typename BlockStore::TempStorage    store;
    } temp_storage;

    // Initialize running total
    BlockPrefixCallbackOp prefix_op(0);

    // Have the block iterate over segments of items
    for (int block_offset = 0; block_offset < num_items; block_offset += 128 * 4)
    {
        // Load a segment of consecutive items that are blocked across threads
        int thread_data[4];
        BlockLoad(temp_storage.load).Load(d_data + block_offset, thread_data);
        CTA_SYNC();

        // Collectively compute the block-wide inclusive prefix sum
        BlockScan(temp_storage.scan).IncluisveSum(
            thread_data, thread_data, prefix_op);
        CTA_SYNC();

        // Store scanned items to output segment
        BlockStore(temp_storage.store).Store(d_data + block_offset, thread_data);
        CTA_SYNC();
    }

Suppose the input d_data is 1, 1, 1, 1, 1, 1, 1, 1, .... The corresponding output for the first segment will be 1, 2, 3, 4, ..., 511, 512. The output for the second segment will be 513, 514, 515, 516, ..., 1023, 1024.

Template Parameters
  • ITEMS_PER_THREAD[inferred] The number of consecutive items partitioned onto each thread.

  • BlockPrefixCallbackOp[inferred] Call-back functor type having member T operator()(T block_aggregate)

Parameters
  • input[in] Calling thread’s input items

  • output[out] Calling thread’s output items (may be aliased to input)

  • block_prefix_callback_op[inout]

    warp0 only call-back functor for specifying a block-wide prefix to be applied to the logical input sequence.

Inclusive prefix scan operations

template<typename ScanOp>
inline void InclusiveScan(T input, T &output, ScanOp scan_op)

Computes an inclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes one input element.

  • Supports non-commutative scan operators.

  • For multi-dimensional blocks, threads are linearly ranked in row-major order.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates an inclusive prefix max scan of 128 integer items that are partitioned across 128 threads.

#include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize BlockScan for a 1D block of 128 threads of type int
    using BlockScan = cub::BlockScan<int, 128>;

    // Allocate shared memory for BlockScan
    __shared__ typename BlockScan::TempStorage temp_storage;

    // Obtain input item for each thread
    int thread_data;
    ...

    // Collectively compute the block-wide inclusive prefix max scan
    BlockScan(temp_storage).InclusiveScan(thread_data, thread_data, cuda::maximum<>{});

Suppose the set of input thread_data across the block of threads is 0, -1, 2, -3, ..., 126, -127. The corresponding output thread_data in those threads will be 0, 0, 2, 2, ..., 126, 126.

Template Parameters

ScanOp[inferred] Binary scan functor type having member T operator()(const T &a, const T &b)

Parameters
  • input[in] Calling thread’s input item

  • output[out] Calling thread’s output item (may be aliased to input)

  • scan_op[in] Binary scan functor

template<typename ScanOp>
inline void InclusiveScan(T input, T &output, ScanOp scan_op, T &block_aggregate)

Computes an inclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes one input element. Also provides every thread with the block-wide block_aggregate of all inputs.

  • Supports non-commutative scan operators.

  • For multi-dimensional blocks, threads are linearly ranked in row-major order.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates an inclusive prefix max scan of 128 integer items that are partitioned across 128 threads.

#include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize BlockScan for a 1D block of 128 threads of type int
    using BlockScan = cub::BlockScan<int, 128>;

    // Allocate shared memory for BlockScan
    __shared__ typename BlockScan::TempStorage temp_storage;

    // Obtain input item for each thread
    int thread_data;
    ...

    // Collectively compute the block-wide inclusive prefix max scan
    int block_aggregate;
    BlockScan(temp_storage).InclusiveScan(thread_data, thread_data, cuda::maximum<>{}, block_aggregate);

Suppose the set of input thread_data across the block of threads is 0, -1, 2, -3, ..., 126, -127. The corresponding output thread_data in those threads will be 0, 0, 2, 2, ..., 126, 126. Furthermore the value 126 will be stored in block_aggregate for all threads.

Template Parameters

ScanOp[inferred] Binary scan functor type having member T operator()(const T &a, const T &b)

Parameters
  • input[in] Calling thread’s input item

  • output[out] Calling thread’s output item (may be aliased to input)

  • scan_op[in] Binary scan functor

  • block_aggregate[out] Block-wide aggregate reduction of input items

template<typename ScanOp, typename BlockPrefixCallbackOp>
inline void InclusiveScan(T input, T &output, ScanOp scan_op, BlockPrefixCallbackOp &block_prefix_callback_op)

Computes an inclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes one input element. The call-back functor block_prefix_callback_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the “seed” value that logically prefixes the thread block’s scan inputs. Also provides every thread with the block-wide block_aggregate of all inputs.

  • The block_prefix_callback_op functor must implement a member function T operator()(T block_aggregate). The functor’s input parameter block_aggregate is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane0 is applied as the block-wide prefix. Can be stateful.

  • Supports non-commutative scan operators.

  • For multi-dimensional blocks, threads are linearly ranked in row-major order.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates a single thread block that progressively computes an inclusive prefix max scan over multiple “tiles” of input using a prefix functor to maintain a running total between block-wide scans. Each tile consists of 128 integer items that are partitioned across 128 threads.

#include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>

// A stateful callback functor that maintains a running prefix to be applied
// during consecutive scan operations.
struct BlockPrefixCallbackOp
{
    // Running prefix
    int running_total;

    // Constructor
    __host__ BlockPrefixCallbackOp(int running_total) : running_total(running_total) {}

    // Callback operator to be entered by the first warp of threads in the block.
    // Thread-0 is responsible for returning a value for seeding the block-wide scan.
    __host__ int operator()(int block_aggregate)
    {
        int old_prefix = running_total;
        running_total = (block_aggregate > old_prefix) ? block_aggregate : old_prefix;
        return old_prefix;
    }
};

__global__ void ExampleKernel(int *d_data, int num_items, ...)
{
    // Specialize BlockScan for a 1D block of 128 threads
    using BlockScan = cub::BlockScan<int, 128>;

    // Allocate shared memory for BlockScan
    __shared__ typename BlockScan::TempStorage temp_storage;

    // Initialize running total
    BlockPrefixCallbackOp prefix_op(INT_MIN);

    // Have the block iterate over segments of items
    for (int block_offset = 0; block_offset < num_items; block_offset += 128)
    {
        // Load a segment of consecutive items that are blocked across threads
        int thread_data = d_data[block_offset];

        // Collectively compute the block-wide inclusive prefix max scan
        BlockScan(temp_storage).InclusiveScan(
            thread_data, thread_data, cuda::maximum<>{}, prefix_op);
        CTA_SYNC();

        // Store scanned items to output segment
        d_data[block_offset] = thread_data;
    }

Suppose the input d_data is 0, -1, 2, -3, 4, -5, .... The corresponding output for the first segment will be 0, 0, 2, 2, ..., 126, 126. The output for the second segment will be 128, 128, 130, 130, ..., 254, 254.

Template Parameters
  • ScanOp[inferred] Binary scan functor type having member T operator()(const T &a, const T &b)

  • BlockPrefixCallbackOp[inferred] Call-back functor type having member T operator()(T block_aggregate)

Parameters
  • input[in] Calling thread’s input item

  • output[out] Calling thread’s output item (may be aliased to input)

  • scan_op[in] Binary scan functor

  • block_prefix_callback_op[inout]

    warp0 only call-back functor for specifying a block-wide prefix to be applied to the logical input sequence.

Inclusive prefix scan operations (multiple data per thread)

template<int ITEMS_PER_THREAD, typename ScanOp>
inline void InclusiveScan(T (&input)[ITEMS_PER_THREAD], T (&output)[ITEMS_PER_THREAD], ScanOp scan_op)

Computes an inclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes an array of consecutive input elements.

  • Supports non-commutative scan operators.

  • Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.

  • Efficiency is increased with increased granularity ITEMS_PER_THREAD. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates an inclusive prefix max scan of 512 integer items that are partitioned in a [<em>blocked arrangement</em>](../index.html#sec5sec3) across 128 threads where each thread owns 4 consecutive items.

#include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize BlockScan for a 1D block of 128 threads of type int
    using BlockScan = cub::BlockScan<int, 128>;

    // Allocate shared memory for BlockScan
    __shared__ typename BlockScan::TempStorage temp_storage;

    // Obtain a segment of consecutive items that are blocked across threads
    int thread_data[4];
    ...

    // Collectively compute the block-wide inclusive prefix max scan
    BlockScan(temp_storage).InclusiveScan(thread_data, thread_data, cuda::maximum<>{});

Suppose the set of input thread_data across the block of threads is { [0,-1,2,-3], [4,-5,6,-7], ..., [508,-509,510,-511] }. The corresponding output thread_data in those threads will be { [0,0,2,2], [4,4,6,6], ..., [508,508,510,510] }.

Template Parameters
  • ITEMS_PER_THREAD[inferred] The number of consecutive items partitioned onto each thread.

  • ScanOp[inferred] Binary scan functor type having member T operator()(const T &a, const T &b)

Parameters
  • input[in] Calling thread’s input items

  • output[out] Calling thread’s output items (may be aliased to input)

  • scan_op[in] Binary scan functor

template<int ITEMS_PER_THREAD, typename ScanOp>
inline void InclusiveScan(T (&input)[ITEMS_PER_THREAD], T (&output)[ITEMS_PER_THREAD], T initial_value, ScanOp scan_op)

Computes an inclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes an array of consecutive input elements.

  • Supports non-commutative scan operators.

  • Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.

  • Efficiency is increased with increased granularity ITEMS_PER_THREAD. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates an inclusive prefix max scan of 128 integer items that are partitioned in a blocked arrangement across 64 threads where each thread owns 2 consecutive items.

__global__ void InclusiveBlockScanKernel(int* output)
{
  // Specialize BlockScan for a 1D block of 64 threads of type int
  using block_scan_t   = cub::BlockScan<int, 64>;
  using temp_storage_t = block_scan_t::TempStorage;

  // Allocate shared memory for BlockScan
  __shared__ temp_storage_t temp_storage;

  int initial_value = 1;
  int thread_data[] = {
    +1 * ((int) threadIdx.x * num_items_per_thread), // item 0
    -1 * ((int) threadIdx.x * num_items_per_thread + 1) // item 1
  };
  //  input: {[0, -1], [2, -3],[4, -5], ... [126, -127]}

  // Collectively compute the block-wide inclusive scan max
  block_scan_t(temp_storage).InclusiveScan(thread_data, thread_data, initial_value, ::cuda::maximum<>{});

  // output: {[1, 1], [2, 2],[3, 3], ... [126, 126]}
  // ...

Template Parameters
  • ITEMS_PER_THREAD[inferred] The number of consecutive items partitioned onto each thread.

  • ScanOp[inferred] Binary scan functor type having member T operator()(const T &a, const T &b)

Parameters
  • input[in] Calling thread’s input items

  • output[out] Calling thread’s output items (may be aliased to input)

  • initial_value[in] Initial value to seed the inclusive scan (uniform across block)

  • scan_op[in] Binary scan functor

template<int ITEMS_PER_THREAD, typename ScanOp>
inline void InclusiveScan(T (&input)[ITEMS_PER_THREAD], T (&output)[ITEMS_PER_THREAD], ScanOp scan_op, T &block_aggregate)

Computes an inclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide block_aggregate of all inputs.

  • Supports non-commutative scan operators.

  • Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.

  • Efficiency is increased with increased granularity ITEMS_PER_THREAD. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates an inclusive prefix max scan of 512 integer items that are partitioned in a [<em>blocked arrangement</em>](../index.html#sec5sec3) across 128 threads where each thread owns 4 consecutive items.

#include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize BlockScan for a 1D block of 128 threads of type int
    using BlockScan = cub::BlockScan<int, 128>;

    // Allocate shared memory for BlockScan
    __shared__ typename BlockScan::TempStorage temp_storage;

    // Obtain a segment of consecutive items that are blocked across threads
    int thread_data[4];
    ...

    // Collectively compute the block-wide inclusive prefix max scan
    int block_aggregate;
    BlockScan(temp_storage).InclusiveScan(thread_data, thread_data, cuda::maximum<>{}, block_aggregate);

Suppose the set of input thread_data across the block of threads is { [0,-1,2,-3], [4,-5,6,-7], ..., [508,-509,510,-511] }. The corresponding output thread_data in those threads will be { [0,0,2,2], [4,4,6,6], ..., [508,508,510,510] }. Furthermore the value 510 will be stored in block_aggregate for all threads.

Template Parameters
  • ITEMS_PER_THREAD[inferred] The number of consecutive items partitioned onto each thread.

  • ScanOp[inferred] Binary scan functor type having member T operator()(const T &a, const T &b)

Parameters
  • input[in] Calling thread’s input items

  • output[out] Calling thread’s output items (may be aliased to input)

  • scan_op[in] Binary scan functor

  • block_aggregate[out] Block-wide aggregate reduction of input items

template<int ITEMS_PER_THREAD, typename ScanOp>
inline void InclusiveScan(T (&input)[ITEMS_PER_THREAD], T (&output)[ITEMS_PER_THREAD], T initial_value, ScanOp scan_op, T &block_aggregate)

Computes an inclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide block_aggregate of all inputs.

  • Supports non-commutative scan operators.

  • Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.

  • Efficiency is increased with increased granularity ITEMS_PER_THREAD. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates an inclusive prefix max scan of 128 integer items that are partitioned in a blocked arrangement across 64 threads where each thread owns 2 consecutive items.

__global__ void InclusiveBlockScanKernelAggregate(int* output, int* d_block_aggregate)
{
  // Specialize BlockScan for a 1D block of 64 threads of type int
  using block_scan_t   = cub::BlockScan<int, 64>;
  using temp_storage_t = block_scan_t::TempStorage;

  // Allocate shared memory for BlockScan
  __shared__ temp_storage_t temp_storage;

  int initial_value = 1;
  int thread_data[] = {
    +1 * ((int) threadIdx.x * num_items_per_thread), // item 0
    -1 * ((int) threadIdx.x * num_items_per_thread + 1) // item 1
  };
  //  input: {[0, -1], [2, -3],[4, -5], ... [126, -127]}

  // Collectively compute the block-wide inclusive scan max
  int block_aggregate;
  block_scan_t(temp_storage)
    .InclusiveScan(thread_data, thread_data, initial_value, ::cuda::maximum<>{}, block_aggregate);

  // output: {[1, 1], [2, 2],[3, 3], ... [126, 126]}
  // block_aggregate = 126;
  // ...

The value 126 will be stored in block_aggregate for all threads.

Template Parameters
  • ITEMS_PER_THREAD[inferred] The number of consecutive items partitioned onto each thread.

  • ScanOp[inferred] Binary scan functor type having member T operator()(const T &a, const T &b)

Parameters
  • input[in] Calling thread’s input items

  • output[out] Calling thread’s output items (may be aliased to input)

  • initial_value[in] Initial value to seed the inclusive scan (uniform across block). It is not taken into account for block_aggregate.

  • scan_op[in] Binary scan functor

  • block_aggregate[out] Block-wide aggregate reduction of input items

template<int ITEMS_PER_THREAD, typename ScanOp, typename BlockPrefixCallbackOp>
inline void InclusiveScan(T (&input)[ITEMS_PER_THREAD], T (&output)[ITEMS_PER_THREAD], ScanOp scan_op, BlockPrefixCallbackOp &block_prefix_callback_op)

Computes an inclusive block-wide prefix scan using the specified binary scan_op functor. Each thread contributes an array of consecutive input elements. The call-back functor block_prefix_callback_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the “seed” value that logically prefixes the thread block’s scan inputs. Also provides every thread with the block-wide block_aggregate of all inputs.

  • The block_prefix_callback_op functor must implement a member function T operator()(T block_aggregate). The functor’s input parameter block_aggregate is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane0 is applied as the block-wide prefix. Can be stateful.

  • Supports non-commutative scan operators.

  • Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.

  • Efficiency is increased with increased granularity ITEMS_PER_THREAD. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates a single thread block that progressively computes an inclusive prefix max scan over multiple “tiles” of input using a prefix functor to maintain a running total between block-wide scans. Each tile consists of 128 integer items that are partitioned across 128 threads.

#include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>

// A stateful callback functor that maintains a running prefix to be applied
// during consecutive scan operations.
struct BlockPrefixCallbackOp
{
    // Running prefix
    int running_total;

    // Constructor
    __host__ BlockPrefixCallbackOp(int running_total) : running_total(running_total) {}

    // Callback operator to be entered by the first warp of threads in the block.
    // Thread-0 is responsible for returning a value for seeding the block-wide scan.
    __host__ int operator()(int block_aggregate)
    {
        int old_prefix = running_total;
        running_total = (block_aggregate > old_prefix) ? block_aggregate : old_prefix;
        return old_prefix;
    }
};

__global__ void ExampleKernel(int *d_data, int num_items, ...)
{
    // Specialize BlockLoad, BlockStore, and BlockScan for a 1D block of 128 threads, 4 ints per thread
    using BlockLoad = cub::BlockLoad<int*, 128, 4, BLOCK_LOAD_TRANSPOSE>  ;
    using BlockStore = cub::BlockStore<int, 128, 4, BLOCK_STORE_TRANSPOSE> ;
    using BlockScan = cub::BlockScan<int, 128>                            ;

    // Allocate aliased shared memory for BlockLoad, BlockStore, and BlockScan
    __shared__ union {
        typename BlockLoad::TempStorage     load;
        typename BlockScan::TempStorage     scan;
        typename BlockStore::TempStorage    store;
    } temp_storage;

    // Initialize running total
    BlockPrefixCallbackOp prefix_op(0);

    // Have the block iterate over segments of items
    for (int block_offset = 0; block_offset < num_items; block_offset += 128 * 4)
    {
        // Load a segment of consecutive items that are blocked across threads
        int thread_data[4];
        BlockLoad(temp_storage.load).Load(d_data + block_offset, thread_data);
        CTA_SYNC();

        // Collectively compute the block-wide inclusive prefix max scan
        BlockScan(temp_storage.scan).InclusiveScan(
            thread_data, thread_data, cuda::maximum<>{}, prefix_op);
        CTA_SYNC();

        // Store scanned items to output segment
        BlockStore(temp_storage.store).Store(d_data + block_offset, thread_data);
        CTA_SYNC();
    }

Suppose the input d_data is 0, -1, 2, -3, 4, -5, .... The corresponding output for the first segment will be 0, 0, 2, 2, 4, 4, ..., 510, 510. The output for the second segment will be 512, 512, 514, 514, 516, 516, ..., 1022, 1022.

Template Parameters
  • ITEMS_PER_THREAD[inferred] The number of consecutive items partitioned onto each thread.

  • ScanOp[inferred] Binary scan functor type having member T operator()(const T &a, const T &b)

  • BlockPrefixCallbackOp[inferred] Call-back functor type having member T operator()(T block_aggregate)

Parameters
  • input[in] Calling thread’s input items

  • output[out] Calling thread’s output items (may be aliased to input)

  • scan_op[in] Binary scan functor

  • block_prefix_callback_op[inout]

    warp0 only call-back functor for specifying a block-wide prefix to be applied to the logical input sequence.

struct TempStorage : public Uninitialized<_TempStorage>

The operations exposed by BlockScan require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union’d with other storage allocation types to facilitate memory reuse.