cub::BlockScan
Defined in cub/block/block_scan.cuh
-
template<typename T, int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int LEGACY_PTX_ARCH = 0>
class BlockScan The BlockScan class provides collective methods for computing a parallel prefix sum/scan of items partitioned across a CUDA thread block.
Overview
Given a list of input elements and a binary reduction operator, a prefix scan produces an output list where each element is computed to be the reduction of the elements occurring earlier in the input list. Prefix sum connotes a prefix scan with the addition operator. The term inclusive indicates that the ith output reduction incorporates the ith input. The term exclusive indicates the ith input is not incorporated into the ith output reduction.
For multi-dimensional blocks, threads are linearly ranked in row-major order.
BlockScan can be optionally specialized by algorithm to accommodate different workload profiles:
cub::BLOCK_SCAN_RAKING
: An efficient (high throughput) “raking reduce-then-scan” prefix scan algorithm.cub::BLOCK_SCAN_RAKING_MEMOIZE
: Similar to cub::BLOCK_SCAN_RAKING, but having higher throughput at the expense of additional register pressure for intermediate storage.cub::BLOCK_SCAN_WARP_SCANS
: A quick (low latency) “tiled warpscans” prefix scan algorithm.
Performance Considerations
Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants ofcub::BlockLoad
for efficiently gathering a blocked arrangement of elements across threads.Uses special instructions when applicable (e.g., warp
SHFL
)Uses synchronization-free communication between warp lanes when applicable
Invokes a minimal number of minimal block-wide synchronization barriers (only one or two depending on algorithm selection)
Incurs zero bank conflicts for most types
Computation is slightly more efficient (i.e., having lower instruction overhead) for:
Prefix sum variants (vs. generic scan)
The number of threads in the block is a multiple of the architecture’s warp size
See cub::BlockScanAlgorithm for performance details regarding algorithmic alternatives
A Simple Example
Every thread in the block uses the BlockScan class by first specializing the BlockScan type, then instantiating an instance with parameters for communication, and finally invoking one or more collective member functions.
The code snippet below illustrates an exclusive prefix sum of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.
#include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh> __global__ void ExampleKernel(...) { // Specialize BlockScan for a 1D block of 128 threads of type int using BlockScan = cub::BlockScan<int, 128>; // Allocate shared memory for BlockScan __shared__ typename BlockScan::TempStorage temp_storage; // Obtain a segment of consecutive items that are blocked across threads int thread_data[4]; ... // Collectively compute the block-wide exclusive prefix sum BlockScan(temp_storage).ExclusiveSum(thread_data, thread_data);
Suppose the set of input
thread_data
across the block of threads is{[1,1,1,1], [1,1,1,1], ..., [1,1,1,1]}
. The corresponding outputthread_data
in those threads will be{[0,1,2,3], [4,5,6,7], ..., [508,509,510,511]}
.- Template Parameters
T – Data type being scanned
BLOCK_DIM_X – The thread block length in threads along the X dimension
ALGORITHM – [optional] cub::BlockScanAlgorithm enumerator specifying the underlying algorithm to use (default: cub::BLOCK_SCAN_RAKING)
BLOCK_DIM_Y – [optional] The thread block length in threads along the Y dimension (default: 1)
BLOCK_DIM_Z – [optional] The thread block length in threads along the Z dimension (default: 1)
LEGACY_PTX_ARCH – [optional] Unused.
Collective constructors
-
inline BlockScan()
Collective constructor using a private static allocation of shared memory as temporary storage.
-
inline BlockScan(TempStorage &temp_storage)
Collective constructor using the specified memory allocation as temporary storage.
- Parameters
temp_storage – [in] Reference to memory allocation having layout type TempStorage
Exclusive prefix sum operations
-
inline void ExclusiveSum(T input, T &output)
Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. The value of 0 is applied as the initial value, and is assigned to
output
in thread0.This operation assumes the value of obtained by the
T
’s default constructor (or by zero-initialization if no user-defined default constructor exists) is suitable as the identity value zero for addition.For multi-dimensional blocks, threads are linearly ranked in row-major order.
A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates an exclusive prefix sum of 128 integer items that are partitioned across 128 threads.
#include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh> __global__ void ExampleKernel(...) { // Specialize BlockScan for a 1D block of 128 threads of type int using BlockScan = cub::BlockScan<int, 128>; // Allocate shared memory for BlockScan __shared__ typename BlockScan::TempStorage temp_storage; // Obtain input item for each thread int thread_data; ... // Collectively compute the block-wide exclusive prefix sum BlockScan(temp_storage).ExclusiveSum(thread_data, thread_data);
Suppose the set of input
thread_data
across the block of threads is1, 1, ..., 1
. The corresponding outputthread_data
in those threads will be0, 1, ..., 127
.- Parameters
input – [in] Calling thread’s input item
output – [out] Calling thread’s output item (may be aliased to
input
)
-
inline void ExclusiveSum(T input, T &output, T &block_aggregate)
Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. The value of 0 is applied as the initial value, and is assigned to
output
in thread0. Also provides every thread with the block-wideblock_aggregate
of all inputs.This operation assumes the value of obtained by the
T
’s default constructor (or by zero-initialization if no user-defined default constructor exists) is suitable as the identity value zero for addition.For multi-dimensional blocks, threads are linearly ranked in row-major order.
A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates an exclusive prefix sum of 128 integer items that are partitioned across 128 threads.
#include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh> __global__ void ExampleKernel(...) { // Specialize BlockScan for a 1D block of 128 threads of type int using BlockScan = cub::BlockScan<int, 128>; // Allocate shared memory for BlockScan __shared__ typename BlockScan::TempStorage temp_storage; // Obtain input item for each thread int thread_data; ... // Collectively compute the block-wide exclusive prefix sum int block_aggregate; BlockScan(temp_storage).ExclusiveSum(thread_data, thread_data, block_aggregate);
Suppose the set of input
thread_data
across the block of threads is1, 1, ..., 1
. The corresponding outputthread_data
in those threads will be0, 1, ..., 127
. Furthermore the value128
will be stored inblock_aggregate
for all threads.- Parameters
input – [in] Calling thread’s input item
output – [out] Calling thread’s output item (may be aliased to
input
)block_aggregate – [out] block-wide aggregate reduction of input items
-
template<typename BlockPrefixCallbackOp>
inline void ExclusiveSum(T input, T &output, BlockPrefixCallbackOp &block_prefix_callback_op) Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Instead of using 0 as the block-wide prefix, the call-back functor
block_prefix_callback_op
is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the “seed” value that logically prefixes the thread block’s scan inputs. Also provides every thread with the block-wideblock_aggregate
of all inputs.This operation assumes the value of obtained by the
T
’s default constructor (or by zero-initialization if no user-defined default constructor exists) is suitable as the identity value zero for addition.The
block_prefix_callback_op
functor must implement a member functionT operator()(T block_aggregate)
. The functor’s input parameterblock_aggregate
is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane0 is applied as the block-wide prefix. Can be stateful.For multi-dimensional blocks, threads are linearly ranked in row-major order.
A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates a single thread block that progressively computes an exclusive prefix sum over multiple “tiles” of input using a prefix functor to maintain a running total between block-wide scans. Each tile consists of 128 integer items that are partitioned across 128 threads.
#include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh> // A stateful callback functor that maintains a running prefix to be applied // during consecutive scan operations. struct BlockPrefixCallbackOp { // Running prefix int running_total; // Constructor __host__ BlockPrefixCallbackOp(int running_total) : running_total(running_total) {} // Callback operator to be entered by the first warp of threads in the block. // Thread-0 is responsible for returning a value for seeding the block-wide scan. __host__ int operator()(int block_aggregate) { int old_prefix = running_total; running_total += block_aggregate; return old_prefix; } }; __global__ void ExampleKernel(int *d_data, int num_items, ...) { // Specialize BlockScan for a 1D block of 128 threads using BlockScan = cub::BlockScan<int, 128>; // Allocate shared memory for BlockScan __shared__ typename BlockScan::TempStorage temp_storage; // Initialize running total BlockPrefixCallbackOp prefix_op(0); // Have the block iterate over segments of items for (int block_offset = 0; block_offset < num_items; block_offset += 128) { // Load a segment of consecutive items that are blocked across threads int thread_data = d_data[block_offset]; // Collectively compute the block-wide exclusive prefix sum BlockScan(temp_storage).ExclusiveSum( thread_data, thread_data, prefix_op); __syncthreads(); // Store scanned items to output segment d_data[block_offset] = thread_data; }
Suppose the input
d_data
is1, 1, 1, 1, 1, 1, 1, 1, ...
. The corresponding output for the first segment will be0, 1, ..., 127
. The output for the second segment will be128, 129, ..., 255
.- Template Parameters
BlockPrefixCallbackOp – [inferred] Call-back functor type having member
T operator()(T block_aggregate)
- Parameters
input – [in] Calling thread’s input item
output – [out] Calling thread’s output item (may be aliased to
input
)block_prefix_callback_op – [inout]
warp0 only call-back functor for specifying a block-wide prefix to be applied to the logical input sequence.
Exclusive prefix sum operations (multiple data per thread)
-
template<int ITEMS_PER_THREAD>
inline void ExclusiveSum(T (&input)[ITEMS_PER_THREAD], T (&output)[ITEMS_PER_THREAD]) Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. The value of 0 is applied as the initial value, and is assigned to
output[0]
in thread0.This operation assumes the value of obtained by the
T
’s default constructor (or by zero-initialization if no user-defined default constructor exists) is suitable as the identity value zero for addition.Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants ofcub::BlockLoad
for efficiently gathering a blocked arrangement of elements across threads.A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates an exclusive prefix sum of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.
#include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh> __global__ void ExampleKernel(...) { // Specialize BlockScan for a 1D block of 128 threads of type int using BlockScan = cub::BlockScan<int, 128>; // Allocate shared memory for BlockScan __shared__ typename BlockScan::TempStorage temp_storage; // Obtain a segment of consecutive items that are blocked across threads int thread_data[4]; ... // Collectively compute the block-wide exclusive prefix sum BlockScan(temp_storage).ExclusiveSum(thread_data, thread_data);
Suppose the set of input
thread_data
across the block of threads is{ [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }
. The corresponding outputthread_data
in those threads will be{ [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }
.- Template Parameters
ITEMS_PER_THREAD – [inferred] The number of consecutive items partitioned onto each thread.
- Parameters
input – [in] Calling thread’s input items
output – [out] Calling thread’s output items (may be aliased to
input
)
-
template<int ITEMS_PER_THREAD>
inline void ExclusiveSum(T (&input)[ITEMS_PER_THREAD], T (&output)[ITEMS_PER_THREAD], T &block_aggregate) Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. The value of 0 is applied as the initial value, and is assigned to
output[0]
in thread0. Also provides every thread with the block-wideblock_aggregate
of all inputs.This operation assumes the value of obtained by the
T
’s default constructor (or by zero-initialization if no user-defined default constructor exists) is suitable as the identity value zero for addition.Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants ofcub::BlockLoad
for efficiently gathering a blocked arrangement of elements across threads.A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates an exclusive prefix sum of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.
#include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh> __global__ void ExampleKernel(...) { // Specialize BlockScan for a 1D block of 128 threads of type int using BlockScan = cub::BlockScan<int, 128>; // Allocate shared memory for BlockScan __shared__ typename BlockScan::TempStorage temp_storage; // Obtain a segment of consecutive items that are blocked across threads int thread_data[4]; ... // Collectively compute the block-wide exclusive prefix sum int block_aggregate; BlockScan(temp_storage).ExclusiveSum(thread_data, thread_data, block_aggregate);
Suppose the set of input
thread_data
across the block of threads is{ [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }
. The corresponding outputthread_data
in those threads will be{ [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }
. Furthermore the value512
will be stored inblock_aggregate
for all threads.- Template Parameters
ITEMS_PER_THREAD – [inferred] The number of consecutive items partitioned onto each thread.
- Parameters
input – [in] Calling thread’s input items
output – [out] Calling thread’s output items (may be aliased to
input
)block_aggregate – [out] block-wide aggregate reduction of input items
-
template<int ITEMS_PER_THREAD, typename BlockPrefixCallbackOp>
inline void ExclusiveSum(T (&input)[ITEMS_PER_THREAD], T (&output)[ITEMS_PER_THREAD], BlockPrefixCallbackOp &block_prefix_callback_op) Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Instead of using 0 as the block-wide prefix, the call-back functor
block_prefix_callback_op
is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the “seed” value that logically prefixes the thread block’s scan inputs. Also provides every thread with the block-wideblock_aggregate
of all inputs.This operation assumes the value of obtained by the
T
’s default constructor (or by zero-initialization if no user-defined default constructor exists) is suitable as the identity value zero for addition.The
block_prefix_callback_op
functor must implement a member functionT operator()(T block_aggregate)
. The functor’s input parameterblock_aggregate
is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane0 is applied as the block-wide prefix. Can be stateful.Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants ofcub::BlockLoad
for efficiently gathering a blocked arrangement of elements across threads.A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates a single thread block that progressively computes an exclusive prefix sum over multiple “tiles” of input using a prefix functor to maintain a running total between block-wide scans. Each tile consists of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.
#include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh> // A stateful callback functor that maintains a running prefix to be applied // during consecutive scan operations. struct BlockPrefixCallbackOp { // Running prefix int running_total; // Constructor __host__ BlockPrefixCallbackOp(int running_total) : running_total(running_total) {} // Callback operator to be entered by the first warp of threads in the block. // Thread-0 is responsible for returning a value for seeding the block-wide scan. __host__ int operator()(int block_aggregate) { int old_prefix = running_total; running_total += block_aggregate; return old_prefix; } }; __global__ void ExampleKernel(int *d_data, int num_items, ...) { // Specialize BlockLoad, BlockStore, and BlockScan for a 1D block of 128 threads, 4 ints per thread using BlockLoad = cub::BlockLoad<int*, 128, 4, BLOCK_LOAD_TRANSPOSE>; using BlockStore = cub::BlockStore<int, 128, 4, BLOCK_STORE_TRANSPOSE>; using BlockScan = cub::BlockScan<int, 128>; // Allocate aliased shared memory for BlockLoad, BlockStore, and BlockScan __shared__ union { typename BlockLoad::TempStorage load; typename BlockScan::TempStorage scan; typename BlockStore::TempStorage store; } temp_storage; // Initialize running total BlockPrefixCallbackOp prefix_op(0); // Have the block iterate over segments of items for (int block_offset = 0; block_offset < num_items; block_offset += 128 * 4) { // Load a segment of consecutive items that are blocked across threads int thread_data[4]; BlockLoad(temp_storage.load).Load(d_data + block_offset, thread_data); __syncthreads(); // Collectively compute the block-wide exclusive prefix sum int block_aggregate; BlockScan(temp_storage.scan).ExclusiveSum( thread_data, thread_data, prefix_op); __syncthreads(); // Store scanned items to output segment BlockStore(temp_storage.store).Store(d_data + block_offset, thread_data); __syncthreads(); }
Suppose the input
d_data
is1, 1, 1, 1, 1, 1, 1, 1, ...
. The corresponding output for the first segment will be0, 1, 2, 3, ..., 510, 511
. The output for the second segment will be512, 513, 514, 515, ..., 1022, 1023
.- Template Parameters
ITEMS_PER_THREAD – [inferred] The number of consecutive items partitioned onto each thread.
BlockPrefixCallbackOp – [inferred] Call-back functor type having member
T operator()(T block_aggregate)
- Parameters
input – [in] Calling thread’s input items
output – [out] Calling thread’s output items (may be aliased to
input
)block_prefix_callback_op – [inout]
warp0 only call-back functor for specifying a block-wide prefix to be applied to the logical input sequence.
Exclusive prefix scan operations
-
template<typename ScanOp>
inline void ExclusiveScan(T input, T &output, T initial_value, ScanOp scan_op) Computes an exclusive block-wide prefix scan using the specified binary
scan_op
functor. Each thread contributes one input element.Supports non-commutative scan operators.
For multi-dimensional blocks, threads are linearly ranked in row-major order.
A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates an exclusive prefix max scan of 128 integer items that are partitioned across 128 threads.
#include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh> __global__ void ExampleKernel(...) { // Specialize BlockScan for a 1D block of 128 threads of type int using BlockScan = cub::BlockScan<int, 128>; // Allocate shared memory for BlockScan __shared__ typename BlockScan::TempStorage temp_storage; // Obtain input item for each thread int thread_data; ... // Collectively compute the block-wide exclusive prefix max scan BlockScan(temp_storage).ExclusiveScan(thread_data, thread_data, INT_MIN, cuda::maximum<>{});
Suppose the set of input
thread_data
across the block of threads is0, -1, 2, -3, ..., 126, -127
. The corresponding outputthread_data
in those threads will beINT_MIN, 0, 0, 2, ..., 124, 126
.- Template Parameters
ScanOp – [inferred] Binary scan functor type having member
T operator()(const T &a, const T &b)
- Parameters
input – [in] Calling thread’s input item
output – [out] Calling thread’s output item (may be aliased to
input
)initial_value – [in]
Initial value to seed the exclusive scan (and is assigned to output[0] in thread0)
scan_op – [in] Binary scan functor
-
template<typename ScanOp>
inline void ExclusiveScan(T input, T &output, T initial_value, ScanOp scan_op, T &block_aggregate) Computes an exclusive block-wide prefix scan using the specified binary
scan_op
functor. Each thread contributes one input element. Also provides every thread with the block-wideblock_aggregate
of all inputs.Supports non-commutative scan operators.
For multi-dimensional blocks, threads are linearly ranked in row-major order.
A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates an exclusive prefix max scan of 128 integer items that are partitioned across 128 threads.
#include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh> __global__ void ExampleKernel(...) { // Specialize BlockScan for a 1D block of 128 threads of type int using BlockScan = cub::BlockScan<int, 128>; // Allocate shared memory for BlockScan __shared__ typename BlockScan::TempStorage temp_storage; // Obtain input item for each thread int thread_data; ... // Collectively compute the block-wide exclusive prefix max scan int block_aggregate; BlockScan(temp_storage).ExclusiveScan(thread_data, thread_data, INT_MIN, cuda::maximum<>{}, block_aggregate);
Suppose the set of input
thread_data
across the block of threads is0, -1, 2, -3, ..., 126, -127
. The corresponding outputthread_data
in those threads will beINT_MIN, 0, 0, 2, ..., 124, 126
. Furthermore the value126
will be stored inblock_aggregate
for all threads.- Template Parameters
ScanOp – [inferred] Binary scan functor type having member
T operator()(const T &a, const T &b)
- Parameters
input – [in] Calling thread’s input items
output – [out] Calling thread’s output items (may be aliased to
input
)initial_value – [in]
Initial value to seed the exclusive scan (and is assigned to
output[0]
in thread0)scan_op – [in] Binary scan functor
block_aggregate – [out] block-wide aggregate reduction of input items
-
template<typename ScanOp, typename BlockPrefixCallbackOp>
inline void ExclusiveScan(T input, T &output, ScanOp scan_op, BlockPrefixCallbackOp &block_prefix_callback_op) Computes an exclusive block-wide prefix scan using the specified binary
scan_op
functor. Each thread contributes one input element. The call-back functorblock_prefix_callback_op
is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the “seed” value that logically prefixes the thread block’s scan inputs. Also provides every thread with the block-wideblock_aggregate
of all inputs.The
block_prefix_callback_op
functor must implement a member functionT operator()(T block_aggregate)
. The functor’s input parameterblock_aggregate
is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane0 is applied as the block-wide prefix. Can be stateful.Supports non-commutative scan operators.
For multi-dimensional blocks, threads are linearly ranked in row-major order.
A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates a single thread block that progressively computes an exclusive prefix max scan over multiple “tiles” of input using a prefix functor to maintain a running total between block-wide scans. Each tile consists of 128 integer items that are partitioned across 128 threads.
#include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh> // A stateful callback functor that maintains a running prefix to be applied // during consecutive scan operations. struct BlockPrefixCallbackOp { // Running prefix int running_total; // Constructor __host__ BlockPrefixCallbackOp(int running_total) : running_total(running_total) {} // Callback operator to be entered by the first warp of threads in the block. // Thread-0 is responsible for returning a value for seeding the block-wide scan. __host__ int operator()(int block_aggregate) { int old_prefix = running_total; running_total = (block_aggregate > old_prefix) ? block_aggregate : old_prefix; return old_prefix; } }; __global__ void ExampleKernel(int *d_data, int num_items, ...) { // Specialize BlockScan for a 1D block of 128 threads using BlockScan = cub::BlockScan<int, 128>; // Allocate shared memory for BlockScan __shared__ typename BlockScan::TempStorage temp_storage; // Initialize running total BlockPrefixCallbackOp prefix_op(INT_MIN); // Have the block iterate over segments of items for (int block_offset = 0; block_offset < num_items; block_offset += 128) { // Load a segment of consecutive items that are blocked across threads int thread_data = d_data[block_offset]; // Collectively compute the block-wide exclusive prefix max scan BlockScan(temp_storage).ExclusiveScan( thread_data, thread_data, INT_MIN, cuda::maximum<>{}, prefix_op); __syncthreads(); // Store scanned items to output segment d_data[block_offset] = thread_data; }
Suppose the input
d_data
is0, -1, 2, -3, 4, -5, ...
. The corresponding output for the first segment will beINT_MIN, 0, 0, 2, ..., 124, 126
. The output for the second segment will be126, 128, 128, 130, ..., 252, 254
.- Template Parameters
ScanOp – [inferred] Binary scan functor type having member
T operator()(const T &a, const T &b)
BlockPrefixCallbackOp – [inferred] Call-back functor type having member
T operator()(T block_aggregate)
- Parameters
input – [in] Calling thread’s input item
output – [out] Calling thread’s output item (may be aliased to
input
)scan_op – [in] Binary scan functor
block_prefix_callback_op – [inout]
warp0 only call-back functor for specifying a block-wide prefix to be applied to the logical input sequence.
Exclusive prefix scan operations (multiple data per thread)
-
template<int ITEMS_PER_THREAD, typename ScanOp>
inline void ExclusiveScan(T (&input)[ITEMS_PER_THREAD], T (&output)[ITEMS_PER_THREAD], T initial_value, ScanOp scan_op) Computes an exclusive block-wide prefix scan using the specified binary
scan_op
functor. Each thread contributes an array of consecutive input elements.Supports non-commutative scan operators.
Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants ofcub::BlockLoad
for efficiently gathering a blocked arrangement of elements across threads.A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates an exclusive prefix max scan of 512 integer items that are partitioned in a [<em>blocked arrangement</em>](../index.html#sec5sec3) across 128 threads where each thread owns 4 consecutive items.
#include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh> __global__ void ExampleKernel(...) { // Specialize BlockScan for a 1D block of 128 threads of type int using BlockScan = cub::BlockScan<int, 128>; // Allocate shared memory for BlockScan __shared__ typename BlockScan::TempStorage temp_storage; // Obtain a segment of consecutive items that are blocked across threads int thread_data[4]; ... // Collectively compute the block-wide exclusive prefix max scan BlockScan(temp_storage).ExclusiveScan(thread_data, thread_data, INT_MIN, cuda::maximum<>{});
Suppose the set of input
thread_data
across the block of threads is{ [0,-1,2,-3], [4,-5,6,-7], ..., [508,-509,510,-511] }
. The corresponding outputthread_data
in those threads will be{ [INT_MIN,0,0,2], [2,4,4,6], ..., [506,508,508,510] }
.- Template Parameters
ITEMS_PER_THREAD – [inferred] The number of consecutive items partitioned onto each thread.
ScanOp – [inferred] Binary scan functor type having member
T operator()(const T &a, const T &b)
- Parameters
input – [in] Calling thread’s input items
output – [out] Calling thread’s output items (may be aliased to
input
)initial_value – [in]
Initial value to seed the exclusive scan (and is assigned to output[0] in thread0)
scan_op – [in] Binary scan functor
-
template<int ITEMS_PER_THREAD, typename ScanOp>
inline void ExclusiveScan(T (&input)[ITEMS_PER_THREAD], T (&output)[ITEMS_PER_THREAD], T initial_value, ScanOp scan_op, T &block_aggregate) Computes an exclusive block-wide prefix scan using the specified binary
scan_op
functor. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wideblock_aggregate
of all inputs.Supports non-commutative scan operators.
Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants ofcub::BlockLoad
for efficiently gathering a blocked arrangement of elements across threads.A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates an exclusive prefix max scan of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.
#include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh> __global__ void ExampleKernel(...) { // Specialize BlockScan for a 1D block of 128 threads of type int using BlockScan = cub::BlockScan<int, 128>; // Allocate shared memory for BlockScan __shared__ typename BlockScan::TempStorage temp_storage; // Obtain a segment of consecutive items that are blocked across threads int thread_data[4]; ... // Collectively compute the block-wide exclusive prefix max scan int block_aggregate; BlockScan(temp_storage).ExclusiveScan(thread_data, thread_data, INT_MIN, cuda::maximum<>{}, block_aggregate);
Suppose the set of input
thread_data
across the block of threads is{ [0,-1,2,-3], [4,-5,6,-7], ..., [508,-509,510,-511] }
. The corresponding outputthread_data
in those threads will be{ [INT_MIN,0,0,2], [2,4,4,6], ..., [506,508,508,510] }
. Furthermore the value510
will be stored inblock_aggregate
for all threads.- Template Parameters
ITEMS_PER_THREAD – [inferred] The number of consecutive items partitioned onto each thread.
ScanOp – [inferred] Binary scan functor type having member
T operator()(const T &a, const T &b)
- Parameters
input – [in] Calling thread’s input items
output – [out] Calling thread’s output items (may be aliased to
input
)initial_value – [in]
Initial value to seed the exclusive scan (and is assigned to output[0] in thread0)
scan_op – [in] Binary scan functor
block_aggregate – [out] block-wide aggregate reduction of input items
-
template<int ITEMS_PER_THREAD, typename ScanOp, typename BlockPrefixCallbackOp>
inline void ExclusiveScan(T (&input)[ITEMS_PER_THREAD], T (&output)[ITEMS_PER_THREAD], ScanOp scan_op, BlockPrefixCallbackOp &block_prefix_callback_op) Computes an exclusive block-wide prefix scan using the specified binary
scan_op
functor. Each thread contributes an array of consecutive input elements. The call-back functorblock_prefix_callback_op
is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the “seed” value that logically prefixes the thread block’s scan inputs. Also provides every thread with the block-wideblock_aggregate
of all inputs.The
block_prefix_callback_op
functor must implement a member functionT operator()(T block_aggregate)
. The functor’s input parameterblock_aggregate
is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane0 is applied as the block-wide prefix. Can be stateful.Supports non-commutative scan operators.
Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants ofcub::BlockLoad
for efficiently gathering a blocked arrangement of elements across threads.A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates a single thread block that progressively computes an exclusive prefix max scan over multiple “tiles” of input using a prefix functor to maintain a running total between block-wide scans. Each tile consists of 128 integer items that are partitioned across 128 threads.
#include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh> // A stateful callback functor that maintains a running prefix to be applied // during consecutive scan operations. struct BlockPrefixCallbackOp { // Running prefix int running_total; // Constructor __host__ BlockPrefixCallbackOp(int running_total) : running_total(running_total) {} // Callback operator to be entered by the first warp of threads in the block. // Thread-0 is responsible for returning a value for seeding the block-wide scan. __host__ int operator()(int block_aggregate) { int old_prefix = running_total; running_total = (block_aggregate > old_prefix) ? block_aggregate : old_prefix; return old_prefix; } }; __global__ void ExampleKernel(int *d_data, int num_items, ...) { // Specialize BlockLoad, BlockStore, and BlockScan for a 1D block of 128 threads, 4 ints per thread using BlockLoad = cub::BlockLoad<int*, 128, 4, BLOCK_LOAD_TRANSPOSE> ; using BlockStore = cub::BlockStore<int, 128, 4, BLOCK_STORE_TRANSPOSE> ; using BlockScan = cub::BlockScan<int, 128> ; // Allocate aliased shared memory for BlockLoad, BlockStore, and BlockScan __shared__ union { typename BlockLoad::TempStorage load; typename BlockScan::TempStorage scan; typename BlockStore::TempStorage store; } temp_storage; // Initialize running total BlockPrefixCallbackOp prefix_op(0); // Have the block iterate over segments of items for (int block_offset = 0; block_offset < num_items; block_offset += 128 * 4) { // Load a segment of consecutive items that are blocked across threads int thread_data[4]; BlockLoad(temp_storage.load).Load(d_data + block_offset, thread_data); __syncthreads(); // Collectively compute the block-wide exclusive prefix max scan BlockScan(temp_storage.scan).ExclusiveScan( thread_data, thread_data, INT_MIN, cuda::maximum<>{}, prefix_op); __syncthreads(); // Store scanned items to output segment BlockStore(temp_storage.store).Store(d_data + block_offset, thread_data); __syncthreads(); }
Suppose the input
d_data
is0, -1, 2, -3, 4, -5, ...
. The corresponding output for the first segment will beINT_MIN, 0, 0, 2, 2, 4, ..., 508, 510
. The output for the second segment will be510, 512, 512, 514, 514, 516, ..., 1020, 1022
.- Template Parameters
ITEMS_PER_THREAD – [inferred] The number of consecutive items partitioned onto each thread.
ScanOp – [inferred] Binary scan functor type having member
T operator()(const T &a, const T &b)
BlockPrefixCallbackOp – [inferred] Call-back functor type having member
T operator()(T block_aggregate)
- Parameters
input – [in] Calling thread’s input items
output – [out] Calling thread’s output items (may be aliased to
input
)scan_op – [in] Binary scan functor
block_prefix_callback_op – [inout]
warp0 only call-back functor for specifying a block-wide prefix to be applied to the logical input sequence.
Inclusive prefix sum operations
-
inline void InclusiveSum(T input, T &output)
Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element.
For multi-dimensional blocks, threads are linearly ranked in row-major order.
A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates an inclusive prefix sum of 128 integer items that are partitioned across 128 threads.
#include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh> __global__ void ExampleKernel(...) { // Specialize BlockScan for a 1D block of 128 threads of type int using BlockScan = cub::BlockScan<int, 128>; // Allocate shared memory for BlockScan __shared__ typename BlockScan::TempStorage temp_storage; // Obtain input item for each thread int thread_data; ... // Collectively compute the block-wide inclusive prefix sum BlockScan(temp_storage).InclusiveSum(thread_data, thread_data);
Suppose the set of input
thread_data
across the block of threads is1, 1, ..., 1
. The corresponding outputthread_data
in those threads will be1, 2, ..., 128
.- Parameters
input – [in] Calling thread’s input item
output – [out] Calling thread’s output item (may be aliased to
input
)
-
inline void InclusiveSum(T input, T &output, T &block_aggregate)
Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Also provides every thread with the block-wide
block_aggregate
of all inputs.For multi-dimensional blocks, threads are linearly ranked in row-major order.
A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates an inclusive prefix sum of 128 integer items that are partitioned across 128 threads.
#include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh> __global__ void ExampleKernel(...) { // Specialize BlockScan for a 1D block of 128 threads of type int using BlockScan = cub::BlockScan<int, 128>; // Allocate shared memory for BlockScan __shared__ typename BlockScan::TempStorage temp_storage; // Obtain input item for each thread int thread_data; ... // Collectively compute the block-wide inclusive prefix sum int block_aggregate; BlockScan(temp_storage).InclusiveSum(thread_data, thread_data, block_aggregate);
Suppose the set of input
thread_data
across the block of threads is1, 1, ..., 1
. The corresponding outputthread_data
in those threads will be1, 2, ..., 128
. Furthermore the value128
will be stored inblock_aggregate
for all threads.- Parameters
input – [in] Calling thread’s input item
output – [out] Calling thread’s output item (may be aliased to
input
)block_aggregate – [out] block-wide aggregate reduction of input items
-
template<typename BlockPrefixCallbackOp>
inline void InclusiveSum(T input, T &output, BlockPrefixCallbackOp &block_prefix_callback_op) Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Instead of using 0 as the block-wide prefix, the call-back functor
block_prefix_callback_op
is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the “seed” value that logically prefixes the thread block’s scan inputs. Also provides every thread with the block-wideblock_aggregate
of all inputs.The
block_prefix_callback_op
functor must implement a member functionT operator()(T block_aggregate)
. The functor’s input parameterblock_aggregate
is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane0 is applied as the block-wide prefix. Can be stateful.For multi-dimensional blocks, threads are linearly ranked in row-major order.
A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates a single thread block that progressively computes an inclusive prefix sum over multiple “tiles” of input using a prefix functor to maintain a running total between block-wide scans. Each tile consists of 128 integer items that are partitioned across 128 threads.
#include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh> // A stateful callback functor that maintains a running prefix to be applied // during consecutive scan operations. struct BlockPrefixCallbackOp { // Running prefix int running_total; // Constructor __host__ BlockPrefixCallbackOp(int running_total) : running_total(running_total) {} // Callback operator to be entered by the first warp of threads in the block. // Thread-0 is responsible for returning a value for seeding the block-wide scan. __host__ int operator()(int block_aggregate) { int old_prefix = running_total; running_total += block_aggregate; return old_prefix; } }; __global__ void ExampleKernel(int *d_data, int num_items, ...) { // Specialize BlockScan for a 1D block of 128 threads using BlockScan = cub::BlockScan<int, 128>; // Allocate shared memory for BlockScan __shared__ typename BlockScan::TempStorage temp_storage; // Initialize running total BlockPrefixCallbackOp prefix_op(0); // Have the block iterate over segments of items for (int block_offset = 0; block_offset < num_items; block_offset += 128) { // Load a segment of consecutive items that are blocked across threads int thread_data = d_data[block_offset]; // Collectively compute the block-wide inclusive prefix sum BlockScan(temp_storage).InclusiveSum( thread_data, thread_data, prefix_op); __syncthreads(); // Store scanned items to output segment d_data[block_offset] = thread_data; }
Suppose the input
d_data
is1, 1, 1, 1, 1, 1, 1, 1, ...
. The corresponding output for the first segment will be1, 2, ..., 128
. The output for the second segment will be129, 130, ..., 256
.- Template Parameters
BlockPrefixCallbackOp – [inferred] Call-back functor type having member
T operator()(T block_aggregate)
- Parameters
input – [in] Calling thread’s input item
output – [out] Calling thread’s output item (may be aliased to
input
)block_prefix_callback_op – [inout]
warp0 only call-back functor for specifying a block-wide prefix to be applied to the logical input sequence.
Inclusive prefix sum operations (multiple data per thread)
-
template<int ITEMS_PER_THREAD>
inline void InclusiveSum(T (&input)[ITEMS_PER_THREAD], T (&output)[ITEMS_PER_THREAD]) Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements.
Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants ofcub::BlockLoad
for efficiently gathering a blocked arrangement of elements across threads.A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates an inclusive prefix sum of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.
#include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh> __global__ void ExampleKernel(...) { // Specialize BlockScan for a 1D block of 128 threads of type int using BlockScan = cub::BlockScan<int, 128>; // Allocate shared memory for BlockScan __shared__ typename BlockScan::TempStorage temp_storage; // Obtain a segment of consecutive items that are blocked across threads int thread_data[4]; ... // Collectively compute the block-wide inclusive prefix sum BlockScan(temp_storage).InclusiveSum(thread_data, thread_data);
Suppose the set of input
thread_data
across the block of threads is{ [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }
. The corresponding outputthread_data
in those threads will be{ [1,2,3,4], [5,6,7,8], ..., [509,510,511,512] }
.- Template Parameters
ITEMS_PER_THREAD – [inferred] The number of consecutive items partitioned onto each thread.
- Parameters
input – [in] Calling thread’s input items
output – [out] Calling thread’s output items (may be aliased to
input
)
-
template<int ITEMS_PER_THREAD>
inline void InclusiveSum(T (&input)[ITEMS_PER_THREAD], T (&output)[ITEMS_PER_THREAD], T &block_aggregate) Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide
block_aggregate
of all inputs.Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants ofcub::BlockLoad
for efficiently gathering a blocked arrangement of elements across threads.A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates an inclusive prefix sum of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.
#include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh> __global__ void ExampleKernel(...) { // Specialize BlockScan for a 1D block of 128 threads of type int using BlockScan = cub::BlockScan<int, 128>; // Allocate shared memory for BlockScan __shared__ typename BlockScan::TempStorage temp_storage; // Obtain a segment of consecutive items that are blocked across threads int thread_data[4]; ... // Collectively compute the block-wide inclusive prefix sum int block_aggregate; BlockScan(temp_storage).InclusiveSum(thread_data, thread_data, block_aggregate);
Suppose the set of input
thread_data
across the block of threads is{ [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }
. The corresponding outputthread_data
in those threads will be{ [1,2,3,4], [5,6,7,8], ..., [509,510,511,512] }
. Furthermore the value512
will be stored inblock_aggregate
for all threads.- Template Parameters
ITEMS_PER_THREAD – [inferred] The number of consecutive items partitioned onto each thread.
- Parameters
input – [in] Calling thread’s input items
output – [out] Calling thread’s output items (may be aliased to
input
)block_aggregate – [out] block-wide aggregate reduction of input items
-
template<int ITEMS_PER_THREAD, typename BlockPrefixCallbackOp>
inline void InclusiveSum(T (&input)[ITEMS_PER_THREAD], T (&output)[ITEMS_PER_THREAD], BlockPrefixCallbackOp &block_prefix_callback_op) Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Instead of using 0 as the block-wide prefix, the call-back functor
block_prefix_callback_op
is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the “seed” value that logically prefixes the thread block’s scan inputs. Also provides every thread with the block-wideblock_aggregate
of all inputs.The
block_prefix_callback_op
functor must implement a member functionT operator()(T block_aggregate)
. The functor’s input parameterblock_aggregate
is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane0 is applied as the block-wide prefix. Can be stateful.Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants ofcub::BlockLoad
for efficiently gathering a blocked arrangement of elements across threads.A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates a single thread block that progressively computes an inclusive prefix sum over multiple “tiles” of input using a prefix functor to maintain a running total between block-wide scans. Each tile consists of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.
#include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh> // A stateful callback functor that maintains a running prefix to be applied // during consecutive scan operations. struct BlockPrefixCallbackOp { // Running prefix int running_total; // Constructor __host__ BlockPrefixCallbackOp(int running_total) : running_total(running_total) {} // Callback operator to be entered by the first warp of threads in the block. // Thread-0 is responsible for returning a value for seeding the block-wide scan. __host__ int operator()(int block_aggregate) { int old_prefix = running_total; running_total += block_aggregate; return old_prefix; } }; __global__ void ExampleKernel(int *d_data, int num_items, ...) { // Specialize BlockLoad, BlockStore, and BlockScan for a 1D block of 128 threads, 4 ints per thread using BlockLoad = cub::BlockLoad<int*, 128, 4, BLOCK_LOAD_TRANSPOSE> ; using BlockStore = cub::BlockStore<int, 128, 4, BLOCK_STORE_TRANSPOSE> ; using BlockScan = cub::BlockScan<int, 128> ; // Allocate aliased shared memory for BlockLoad, BlockStore, and BlockScan __shared__ union { typename BlockLoad::TempStorage load; typename BlockScan::TempStorage scan; typename BlockStore::TempStorage store; } temp_storage; // Initialize running total BlockPrefixCallbackOp prefix_op(0); // Have the block iterate over segments of items for (int block_offset = 0; block_offset < num_items; block_offset += 128 * 4) { // Load a segment of consecutive items that are blocked across threads int thread_data[4]; BlockLoad(temp_storage.load).Load(d_data + block_offset, thread_data); __syncthreads(); // Collectively compute the block-wide inclusive prefix sum BlockScan(temp_storage.scan).IncluisveSum( thread_data, thread_data, prefix_op); __syncthreads(); // Store scanned items to output segment BlockStore(temp_storage.store).Store(d_data + block_offset, thread_data); __syncthreads(); }
Suppose the input
d_data
is1, 1, 1, 1, 1, 1, 1, 1, ...
. The corresponding output for the first segment will be1, 2, 3, 4, ..., 511, 512
. The output for the second segment will be513, 514, 515, 516, ..., 1023, 1024
.- Template Parameters
ITEMS_PER_THREAD – [inferred] The number of consecutive items partitioned onto each thread.
BlockPrefixCallbackOp – [inferred] Call-back functor type having member
T operator()(T block_aggregate)
- Parameters
input – [in] Calling thread’s input items
output – [out] Calling thread’s output items (may be aliased to
input
)block_prefix_callback_op – [inout]
warp0 only call-back functor for specifying a block-wide prefix to be applied to the logical input sequence.
Inclusive prefix scan operations
-
template<typename ScanOp>
inline void InclusiveScan(T input, T &output, ScanOp scan_op) Computes an inclusive block-wide prefix scan using the specified binary
scan_op
functor. Each thread contributes one input element.Supports non-commutative scan operators.
For multi-dimensional blocks, threads are linearly ranked in row-major order.
A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates an inclusive prefix max scan of 128 integer items that are partitioned across 128 threads.
#include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh> __global__ void ExampleKernel(...) { // Specialize BlockScan for a 1D block of 128 threads of type int using BlockScan = cub::BlockScan<int, 128>; // Allocate shared memory for BlockScan __shared__ typename BlockScan::TempStorage temp_storage; // Obtain input item for each thread int thread_data; ... // Collectively compute the block-wide inclusive prefix max scan BlockScan(temp_storage).InclusiveScan(thread_data, thread_data, cuda::maximum<>{});
Suppose the set of input
thread_data
across the block of threads is0, -1, 2, -3, ..., 126, -127
. The corresponding outputthread_data
in those threads will be0, 0, 2, 2, ..., 126, 126
.- Template Parameters
ScanOp – [inferred] Binary scan functor type having member
T operator()(const T &a, const T &b)
- Parameters
input – [in] Calling thread’s input item
output – [out] Calling thread’s output item (may be aliased to
input
)scan_op – [in] Binary scan functor
-
template<typename ScanOp>
inline void InclusiveScan(T input, T &output, ScanOp scan_op, T &block_aggregate) Computes an inclusive block-wide prefix scan using the specified binary
scan_op
functor. Each thread contributes one input element. Also provides every thread with the block-wideblock_aggregate
of all inputs.Supports non-commutative scan operators.
For multi-dimensional blocks, threads are linearly ranked in row-major order.
A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates an inclusive prefix max scan of 128 integer items that are partitioned across 128 threads.
#include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh> __global__ void ExampleKernel(...) { // Specialize BlockScan for a 1D block of 128 threads of type int using BlockScan = cub::BlockScan<int, 128>; // Allocate shared memory for BlockScan __shared__ typename BlockScan::TempStorage temp_storage; // Obtain input item for each thread int thread_data; ... // Collectively compute the block-wide inclusive prefix max scan int block_aggregate; BlockScan(temp_storage).InclusiveScan(thread_data, thread_data, cuda::maximum<>{}, block_aggregate);
Suppose the set of input
thread_data
across the block of threads is0, -1, 2, -3, ..., 126, -127
. The corresponding outputthread_data
in those threads will be0, 0, 2, 2, ..., 126, 126
. Furthermore the value126
will be stored inblock_aggregate
for all threads.- Template Parameters
ScanOp – [inferred] Binary scan functor type having member
T operator()(const T &a, const T &b)
- Parameters
input – [in] Calling thread’s input item
output – [out] Calling thread’s output item (may be aliased to
input
)scan_op – [in] Binary scan functor
block_aggregate – [out] Block-wide aggregate reduction of input items
-
template<typename ScanOp, typename BlockPrefixCallbackOp>
inline void InclusiveScan(T input, T &output, ScanOp scan_op, BlockPrefixCallbackOp &block_prefix_callback_op) Computes an inclusive block-wide prefix scan using the specified binary
scan_op
functor. Each thread contributes one input element. The call-back functorblock_prefix_callback_op
is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the “seed” value that logically prefixes the thread block’s scan inputs. Also provides every thread with the block-wideblock_aggregate
of all inputs.The
block_prefix_callback_op
functor must implement a member functionT operator()(T block_aggregate)
. The functor’s input parameterblock_aggregate
is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane0 is applied as the block-wide prefix. Can be stateful.Supports non-commutative scan operators.
For multi-dimensional blocks, threads are linearly ranked in row-major order.
A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates a single thread block that progressively computes an inclusive prefix max scan over multiple “tiles” of input using a prefix functor to maintain a running total between block-wide scans. Each tile consists of 128 integer items that are partitioned across 128 threads.
#include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh> // A stateful callback functor that maintains a running prefix to be applied // during consecutive scan operations. struct BlockPrefixCallbackOp { // Running prefix int running_total; // Constructor __host__ BlockPrefixCallbackOp(int running_total) : running_total(running_total) {} // Callback operator to be entered by the first warp of threads in the block. // Thread-0 is responsible for returning a value for seeding the block-wide scan. __host__ int operator()(int block_aggregate) { int old_prefix = running_total; running_total = (block_aggregate > old_prefix) ? block_aggregate : old_prefix; return old_prefix; } }; __global__ void ExampleKernel(int *d_data, int num_items, ...) { // Specialize BlockScan for a 1D block of 128 threads using BlockScan = cub::BlockScan<int, 128>; // Allocate shared memory for BlockScan __shared__ typename BlockScan::TempStorage temp_storage; // Initialize running total BlockPrefixCallbackOp prefix_op(INT_MIN); // Have the block iterate over segments of items for (int block_offset = 0; block_offset < num_items; block_offset += 128) { // Load a segment of consecutive items that are blocked across threads int thread_data = d_data[block_offset]; // Collectively compute the block-wide inclusive prefix max scan BlockScan(temp_storage).InclusiveScan( thread_data, thread_data, cuda::maximum<>{}, prefix_op); __syncthreads(); // Store scanned items to output segment d_data[block_offset] = thread_data; }
Suppose the input
d_data
is0, -1, 2, -3, 4, -5, ...
. The corresponding output for the first segment will be0, 0, 2, 2, ..., 126, 126
. The output for the second segment will be128, 128, 130, 130, ..., 254, 254
.- Template Parameters
ScanOp – [inferred] Binary scan functor type having member
T operator()(const T &a, const T &b)
BlockPrefixCallbackOp – [inferred] Call-back functor type having member
T operator()(T block_aggregate)
- Parameters
input – [in] Calling thread’s input item
output – [out] Calling thread’s output item (may be aliased to
input
)scan_op – [in] Binary scan functor
block_prefix_callback_op – [inout]
warp0 only call-back functor for specifying a block-wide prefix to be applied to the logical input sequence.
Inclusive prefix scan operations (multiple data per thread)
-
template<int ITEMS_PER_THREAD, typename ScanOp>
inline void InclusiveScan(T (&input)[ITEMS_PER_THREAD], T (&output)[ITEMS_PER_THREAD], ScanOp scan_op) Computes an inclusive block-wide prefix scan using the specified binary
scan_op
functor. Each thread contributes an array of consecutive input elements.Supports non-commutative scan operators.
Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants ofcub::BlockLoad
for efficiently gathering a blocked arrangement of elements across threads.A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates an inclusive prefix max scan of 512 integer items that are partitioned in a [<em>blocked arrangement</em>](../index.html#sec5sec3) across 128 threads where each thread owns 4 consecutive items.
#include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh> __global__ void ExampleKernel(...) { // Specialize BlockScan for a 1D block of 128 threads of type int using BlockScan = cub::BlockScan<int, 128>; // Allocate shared memory for BlockScan __shared__ typename BlockScan::TempStorage temp_storage; // Obtain a segment of consecutive items that are blocked across threads int thread_data[4]; ... // Collectively compute the block-wide inclusive prefix max scan BlockScan(temp_storage).InclusiveScan(thread_data, thread_data, cuda::maximum<>{});
Suppose the set of input
thread_data
across the block of threads is{ [0,-1,2,-3], [4,-5,6,-7], ..., [508,-509,510,-511] }
. The corresponding outputthread_data
in those threads will be{ [0,0,2,2], [4,4,6,6], ..., [508,508,510,510] }
.- Template Parameters
ITEMS_PER_THREAD – [inferred] The number of consecutive items partitioned onto each thread.
ScanOp – [inferred] Binary scan functor type having member
T operator()(const T &a, const T &b)
- Parameters
input – [in] Calling thread’s input items
output – [out] Calling thread’s output items (may be aliased to
input
)scan_op – [in] Binary scan functor
-
template<int ITEMS_PER_THREAD, typename ScanOp>
inline void InclusiveScan(T (&input)[ITEMS_PER_THREAD], T (&output)[ITEMS_PER_THREAD], T initial_value, ScanOp scan_op) Computes an inclusive block-wide prefix scan using the specified binary
scan_op
functor. Each thread contributes an array of consecutive input elements.Supports non-commutative scan operators.
Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants ofcub::BlockLoad
for efficiently gathering a blocked arrangement of elements across threads.A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates an inclusive prefix max scan of 128 integer items that are partitioned in a blocked arrangement across 64 threads where each thread owns 2 consecutive items.
__global__ void InclusiveBlockScanKernel(int* output) { // Specialize BlockScan for a 1D block of 64 threads of type int using block_scan_t = cub::BlockScan<int, 64>; using temp_storage_t = block_scan_t::TempStorage; // Allocate shared memory for BlockScan __shared__ temp_storage_t temp_storage; int initial_value = 1; int thread_data[] = { +1 * ((int) threadIdx.x * num_items_per_thread), // item 0 -1 * ((int) threadIdx.x * num_items_per_thread + 1) // item 1 }; // input: {[0, -1], [2, -3],[4, -5], ... [126, -127]} // Collectively compute the block-wide inclusive scan max block_scan_t(temp_storage).InclusiveScan(thread_data, thread_data, initial_value, ::cuda::maximum<>{}); // output: {[1, 1], [2, 2],[3, 3], ... [126, 126]} // ...
- Template Parameters
ITEMS_PER_THREAD – [inferred] The number of consecutive items partitioned onto each thread.
ScanOp – [inferred] Binary scan functor type having member
T operator()(const T &a, const T &b)
- Parameters
input – [in] Calling thread’s input items
output – [out] Calling thread’s output items (may be aliased to
input
)initial_value – [in] Initial value to seed the inclusive scan (uniform across block)
scan_op – [in] Binary scan functor
-
template<int ITEMS_PER_THREAD, typename ScanOp>
inline void InclusiveScan(T (&input)[ITEMS_PER_THREAD], T (&output)[ITEMS_PER_THREAD], ScanOp scan_op, T &block_aggregate) Computes an inclusive block-wide prefix scan using the specified binary
scan_op
functor. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wideblock_aggregate
of all inputs.Supports non-commutative scan operators.
Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants ofcub::BlockLoad
for efficiently gathering a blocked arrangement of elements across threads.A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates an inclusive prefix max scan of 512 integer items that are partitioned in a [<em>blocked arrangement</em>](../index.html#sec5sec3) across 128 threads where each thread owns 4 consecutive items.
#include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh> __global__ void ExampleKernel(...) { // Specialize BlockScan for a 1D block of 128 threads of type int using BlockScan = cub::BlockScan<int, 128>; // Allocate shared memory for BlockScan __shared__ typename BlockScan::TempStorage temp_storage; // Obtain a segment of consecutive items that are blocked across threads int thread_data[4]; ... // Collectively compute the block-wide inclusive prefix max scan int block_aggregate; BlockScan(temp_storage).InclusiveScan(thread_data, thread_data, cuda::maximum<>{}, block_aggregate);
Suppose the set of input
thread_data
across the block of threads is{ [0,-1,2,-3], [4,-5,6,-7], ..., [508,-509,510,-511] }
. The corresponding outputthread_data
in those threads will be{ [0,0,2,2], [4,4,6,6], ..., [508,508,510,510] }
. Furthermore the value510
will be stored inblock_aggregate
for all threads.- Template Parameters
ITEMS_PER_THREAD – [inferred] The number of consecutive items partitioned onto each thread.
ScanOp – [inferred] Binary scan functor type having member
T operator()(const T &a, const T &b)
- Parameters
input – [in] Calling thread’s input items
output – [out] Calling thread’s output items (may be aliased to
input
)scan_op – [in] Binary scan functor
block_aggregate – [out] Block-wide aggregate reduction of input items
-
template<int ITEMS_PER_THREAD, typename ScanOp>
inline void InclusiveScan(T (&input)[ITEMS_PER_THREAD], T (&output)[ITEMS_PER_THREAD], T initial_value, ScanOp scan_op, T &block_aggregate) Computes an inclusive block-wide prefix scan using the specified binary
scan_op
functor. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wideblock_aggregate
of all inputs.Supports non-commutative scan operators.
Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants ofcub::BlockLoad
for efficiently gathering a blocked arrangement of elements across threads.A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates an inclusive prefix max scan of 128 integer items that are partitioned in a blocked arrangement across 64 threads where each thread owns 2 consecutive items.
__global__ void InclusiveBlockScanKernelAggregate(int* output, int* d_block_aggregate) { // Specialize BlockScan for a 1D block of 64 threads of type int using block_scan_t = cub::BlockScan<int, 64>; using temp_storage_t = block_scan_t::TempStorage; // Allocate shared memory for BlockScan __shared__ temp_storage_t temp_storage; int initial_value = 1; int thread_data[] = { +1 * ((int) threadIdx.x * num_items_per_thread), // item 0 -1 * ((int) threadIdx.x * num_items_per_thread + 1) // item 1 }; // input: {[0, -1], [2, -3],[4, -5], ... [126, -127]} // Collectively compute the block-wide inclusive scan max int block_aggregate; block_scan_t(temp_storage) .InclusiveScan(thread_data, thread_data, initial_value, ::cuda::maximum<>{}, block_aggregate); // output: {[1, 1], [2, 2],[3, 3], ... [126, 126]} // block_aggregate = 126; // ...
The value
126
will be stored inblock_aggregate
for all threads.- Template Parameters
ITEMS_PER_THREAD – [inferred] The number of consecutive items partitioned onto each thread.
ScanOp – [inferred] Binary scan functor type having member
T operator()(const T &a, const T &b)
- Parameters
input – [in] Calling thread’s input items
output – [out] Calling thread’s output items (may be aliased to
input
)initial_value – [in] Initial value to seed the inclusive scan (uniform across block). It is not taken into account for block_aggregate.
scan_op – [in] Binary scan functor
block_aggregate – [out] Block-wide aggregate reduction of input items
-
template<int ITEMS_PER_THREAD, typename ScanOp, typename BlockPrefixCallbackOp>
inline void InclusiveScan(T (&input)[ITEMS_PER_THREAD], T (&output)[ITEMS_PER_THREAD], ScanOp scan_op, BlockPrefixCallbackOp &block_prefix_callback_op) Computes an inclusive block-wide prefix scan using the specified binary
scan_op
functor. Each thread contributes an array of consecutive input elements. The call-back functorblock_prefix_callback_op
is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the “seed” value that logically prefixes the thread block’s scan inputs. Also provides every thread with the block-wideblock_aggregate
of all inputs.The
block_prefix_callback_op
functor must implement a member functionT operator()(T block_aggregate)
. The functor’s input parameterblock_aggregate
is the same value also returned by the scan operation. The functor will be invoked by the first warp of threads in the block, however only the return value from lane0 is applied as the block-wide prefix. Can be stateful.Supports non-commutative scan operators.
Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants ofcub::BlockLoad
for efficiently gathering a blocked arrangement of elements across threads.A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates a single thread block that progressively computes an inclusive prefix max scan over multiple “tiles” of input using a prefix functor to maintain a running total between block-wide scans. Each tile consists of 128 integer items that are partitioned across 128 threads.
#include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh> // A stateful callback functor that maintains a running prefix to be applied // during consecutive scan operations. struct BlockPrefixCallbackOp { // Running prefix int running_total; // Constructor __host__ BlockPrefixCallbackOp(int running_total) : running_total(running_total) {} // Callback operator to be entered by the first warp of threads in the block. // Thread-0 is responsible for returning a value for seeding the block-wide scan. __host__ int operator()(int block_aggregate) { int old_prefix = running_total; running_total = (block_aggregate > old_prefix) ? block_aggregate : old_prefix; return old_prefix; } }; __global__ void ExampleKernel(int *d_data, int num_items, ...) { // Specialize BlockLoad, BlockStore, and BlockScan for a 1D block of 128 threads, 4 ints per thread using BlockLoad = cub::BlockLoad<int*, 128, 4, BLOCK_LOAD_TRANSPOSE> ; using BlockStore = cub::BlockStore<int, 128, 4, BLOCK_STORE_TRANSPOSE> ; using BlockScan = cub::BlockScan<int, 128> ; // Allocate aliased shared memory for BlockLoad, BlockStore, and BlockScan __shared__ union { typename BlockLoad::TempStorage load; typename BlockScan::TempStorage scan; typename BlockStore::TempStorage store; } temp_storage; // Initialize running total BlockPrefixCallbackOp prefix_op(0); // Have the block iterate over segments of items for (int block_offset = 0; block_offset < num_items; block_offset += 128 * 4) { // Load a segment of consecutive items that are blocked across threads int thread_data[4]; BlockLoad(temp_storage.load).Load(d_data + block_offset, thread_data); __syncthreads(); // Collectively compute the block-wide inclusive prefix max scan BlockScan(temp_storage.scan).InclusiveScan( thread_data, thread_data, cuda::maximum<>{}, prefix_op); __syncthreads(); // Store scanned items to output segment BlockStore(temp_storage.store).Store(d_data + block_offset, thread_data); __syncthreads(); }
Suppose the input
d_data
is0, -1, 2, -3, 4, -5, ...
. The corresponding output for the first segment will be0, 0, 2, 2, 4, 4, ..., 510, 510
. The output for the second segment will be512, 512, 514, 514, 516, 516, ..., 1022, 1022
.- Template Parameters
ITEMS_PER_THREAD – [inferred] The number of consecutive items partitioned onto each thread.
ScanOp – [inferred] Binary scan functor type having member
T operator()(const T &a, const T &b)
BlockPrefixCallbackOp – [inferred] Call-back functor type having member
T operator()(T block_aggregate)
- Parameters
input – [in] Calling thread’s input items
output – [out] Calling thread’s output items (may be aliased to
input
)scan_op – [in] Binary scan functor
block_prefix_callback_op – [inout]
warp0 only call-back functor for specifying a block-wide prefix to be applied to the logical input sequence.
-
struct TempStorage : public Uninitialized<_TempStorage>
The operations exposed by BlockScan require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the
__shared__
keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) orunion
’d with other storage allocation types to facilitate memory reuse.