cub::WarpScan
Defined in cub/warp/warp_scan.cuh
-
template<typename T, int LOGICAL_WARP_THREADS = CUB_PTX_WARP_THREADS, int LEGACY_PTX_ARCH = 0>
class WarpScan The WarpScan class provides collective methods for computing a parallel prefix scan of items partitioned across a CUDA thread warp.
Overview
Given a list of input elements and a binary reduction operator, a prefix scan produces an output list where each element is computed to be the reduction of the elements occurring earlier in the input list. Prefix sum connotes a prefix scan with the addition operator. The term inclusive indicates that the ith output reduction incorporates the ith input. The term exclusive indicates the ith input is not incorporated into the ith output reduction.
Supports non-commutative scan operators
Supports “logical” warps smaller than the physical warp size (e.g., a logical warp of 8 threads)
The number of entrant threads must be an multiple of
LOGICAL_WARP_THREADS
Performance Considerations
Uses special instructions when applicable (e.g., warp
SHFL
)Uses synchronization-free communication between warp lanes when applicable
Incurs zero bank conflicts for most types
Computation is slightly more efficient (i.e., having lower instruction overhead) for:
Summation (vs. generic scan)
The architecture’s warp size is a whole multiple of
LOGICAL_WARP_THREADS
Simple Examples
Every thread in the warp uses the WarpScan class by first specializing the WarpScan type, then instantiating an instance with parameters for communication, and finally invoking or more collective member functions.
The code snippet below illustrates four concurrent warp prefix sums within a block of 128 threads (one per each of the 32-thread warps).
#include <cub/cub.cuh> __global__ void ExampleKernel(...) { // Specialize WarpScan for type int using WarpScan = cub::WarpScan<int>; // Allocate WarpScan shared memory for 4 warps __shared__ typename WarpScan::TempStorage temp_storage[4]; // Obtain one input item per thread int thread_data = ... // Compute warp-wide prefix sums int warp_id = threadIdx.x / 32; WarpScan(temp_storage[warp_id]).ExclusiveSum(thread_data, thread_data);
Suppose the set of input
thread_data
across the block of threads is{1, 1, 1, 1, ...}
. The corresponding outputthread_data
in each of the four warps of threads will be0, 1, 2, 3, ..., 31}
.The code snippet below illustrates a single warp prefix sum within a block of 128 threads.
#include <cub/cub.cuh> __global__ void ExampleKernel(...) { // Specialize WarpScan for type int using WarpScan = cub::WarpScan<int>; // Allocate WarpScan shared memory for one warp __shared__ typename WarpScan::TempStorage temp_storage; ... // Only the first warp performs a prefix sum if (threadIdx.x < 32) { // Obtain one input item per thread int thread_data = ... // Compute warp-wide prefix sums WarpScan(temp_storage).ExclusiveSum(thread_data, thread_data);
Suppose the set of input
thread_data
across the warp of threads is{1, 1, 1, 1, ...}
. The corresponding outputthread_data
will be{0, 1, 2, 3, ..., 31}
.- Template Parameters
T – The scan input/output element type
LOGICAL_WARP_THREADS – [optional] The number of threads per “logical” warp (may be less than the number of hardware warp threads). Default is the warp size associated with the CUDA Compute Capability targeted by the compiler (e.g., 32 threads for SM20).
LEGACY_PTX_ARCH – [optional] Unused.
Collective constructors
-
inline WarpScan(TempStorage &temp_storage)
Collective constructor using the specified memory allocation as temporary storage. Logical warp and lane identifiers are constructed from
threadIdx.x
.- Parameters
temp_storage – [in] Reference to memory allocation having layout type TempStorage
Inclusive prefix sums
-
inline void InclusiveSum(T input, T &inclusive_output)
Computes an inclusive prefix sum across the calling warp.
A subsequent
__syncwarp()
warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates four concurrent warp-wide inclusive prefix sums within a block of 128 threads (one per each of the 32-thread warps).
#include <cub/cub.cuh> __global__ void ExampleKernel(...) { // Specialize WarpScan for type int using WarpScan = cub::WarpScan<int>; // Allocate WarpScan shared memory for 4 warps __shared__ typename WarpScan::TempStorage temp_storage[4]; // Obtain one input item per thread int thread_data = ... // Compute inclusive warp-wide prefix sums int warp_id = threadIdx.x / 32; WarpScan(temp_storage[warp_id]).InclusiveSum(thread_data, thread_data);
Suppose the set of input
thread_data
across the block of threads is{1, 1, 1, 1, ...}
. The corresponding outputthread_data
in each of the four warps of threads will be1, 2, 3, ..., 32}
.- Parameters
input – [in] Calling thread’s input item.
inclusive_output – [out] Calling thread’s output item. May be aliased with
input
.
-
inline void InclusiveSum(T input, T &inclusive_output, T &warp_aggregate)
Computes an inclusive prefix sum across the calling warp. Also provides every thread with the warp-wide
warp_aggregate
of all inputs.A subsequent
__syncwarp()
warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates four concurrent warp-wide inclusive prefix sums within a block of 128 threads (one per each of the 32-thread warps).
#include <cub/cub.cuh> __global__ void ExampleKernel(...) { // Specialize WarpScan for type int using WarpScan = cub::WarpScan<int>; // Allocate WarpScan shared memory for 4 warps __shared__ typename WarpScan::TempStorage temp_storage[4]; // Obtain one input item per thread int thread_data = ... // Compute inclusive warp-wide prefix sums int warp_aggregate; int warp_id = threadIdx.x / 32; WarpScan(temp_storage[warp_id]).InclusiveSum(thread_data, thread_data, warp_aggregate);
Suppose the set of input
thread_data
across the block of threads is{1, 1, 1, 1, ...}
. The corresponding outputthread_data
in each of the four warps of threads will be1, 2, 3, ..., 32}
. Furthermore,warp_aggregate
for all threads in all warps will be32
.- Parameters
input – [in] Calling thread’s input item
inclusive_output – [out] Calling thread’s output item. May be aliased with
input
warp_aggregate – [out] Warp-wide aggregate reduction of input items
Exclusive prefix sums
-
inline void ExclusiveSum(T input, T &exclusive_output)
Computes an exclusive prefix sum across the calling warp. The value of 0 is applied as the initial value, and is assigned to
exclusive_output
in lane0.This operation assumes the value of obtained by the
T
’s default constructor (or by zero-initialization if no user-defined default constructor exists) is suitable as the identity value zero for addition.A subsequent
__syncwarp()
warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates four concurrent warp-wide exclusive prefix sums within a block of 128 threads (one per each of the 32-thread warps).
#include <cub/cub.cuh> __global__ void ExampleKernel(...) { // Specialize WarpScan for type int using WarpScan = cub::WarpScan<int>; // Allocate WarpScan shared memory for 4 warps __shared__ typename WarpScan::TempStorage temp_storage[4]; // Obtain one input item per thread int thread_data = ... // Compute exclusive warp-wide prefix sums int warp_id = threadIdx.x / 32; WarpScan(temp_storage[warp_id]).ExclusiveSum(thread_data, thread_data);
Suppose the set of input
thread_data
across the block of threads is{1, 1, 1, 1, ...}
. The corresponding outputthread_data
in each of the four warps of threads will be0, 1, 2, ..., 31}
.- Parameters
input – [in] Calling thread’s input item.
exclusive_output – [out] Calling thread’s output item. May be aliased with
input
.
-
inline void ExclusiveSum(T input, T &exclusive_output, T &warp_aggregate)
Computes an exclusive prefix sum across the calling warp. The value of 0 is applied as the initial value, and is assigned to
exclusive_output
in lane0. Also provides every thread with the warp-widewarp_aggregate
of all inputs.This operation assumes the value of obtained by the
T
’s default constructor (or by zero-initialization if no user-defined default constructor exists) is suitable as the identity value zero for addition.A subsequent
__syncwarp()
warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates four concurrent warp-wide exclusive prefix sums within a block of 128 threads (one per each of the 32-thread warps).
#include <cub/cub.cuh> __global__ void ExampleKernel(...) { // Specialize WarpScan for type int using WarpScan = cub::WarpScan<int>; // Allocate WarpScan shared memory for 4 warps __shared__ typename WarpScan::TempStorage temp_storage[4]; // Obtain one input item per thread int thread_data = ... // Compute exclusive warp-wide prefix sums int warp_aggregate; int warp_id = threadIdx.x / 32; WarpScan(temp_storage[warp_id]).ExclusiveSum(thread_data, thread_data, warp_aggregate);
Suppose the set of input
thread_data
across the block of threads is{1, 1, 1, 1, ...}
. The corresponding outputthread_data
in each of the four warps of threads will be0, 1, 2, ..., 31}
. Furthermore,warp_aggregate
for all threads in all warps will be32
.- Parameters
input – [in] Calling thread’s input item
exclusive_output – [out] Calling thread’s output item. May be aliased with
input
warp_aggregate – [out] Warp-wide aggregate reduction of input items
Inclusive prefix scans
-
template<typename ScanOp>
inline void InclusiveScan(T input, T &inclusive_output, ScanOp scan_op) Computes an inclusive prefix scan using the specified binary scan functor across the calling warp.
A subsequent
__syncwarp()
warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates four concurrent warp-wide inclusive prefix max scans within a block of 128 threads (one per each of the 32-thread warps).
#include <cub/cub.cuh> __global__ void ExampleKernel(...) { // Specialize WarpScan for type int using WarpScan = cub::WarpScan<int>; // Allocate WarpScan shared memory for 4 warps __shared__ typename WarpScan::TempStorage temp_storage[4]; // Obtain one input item per thread int thread_data = ... // Compute inclusive warp-wide prefix max scans int warp_id = threadIdx.x / 32; WarpScan(temp_storage[warp_id]).InclusiveScan(thread_data, thread_data, cuda::maximum<>{});
Suppose the set of input
thread_data
across the block of threads is{0, -1, 2, -3, ..., 126, -127}
. The corresponding outputthread_data
in the first warp would be0, 0, 2, 2, ..., 30, 30
, the output for the second warp would be32, 32, 34, 34, ..., 62, 62
, etc.- Template Parameters
ScanOp – [inferred] Binary scan operator type having member
T operator()(const T &a, const T &b)
- Parameters
input – [in] Calling thread’s input item
inclusive_output – [out] Calling thread’s output item. May be aliased with
input
scan_op – [in] Binary scan operator
-
template<typename ScanOp>
inline void InclusiveScan(T input, T &inclusive_output, T initial_value, ScanOp scan_op) Computes an inclusive prefix scan using the specified binary scan functor across the calling warp.
A subsequent
__syncwarp()
warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates four concurrent warp-wide inclusive prefix sum scans within a block of 128 threads (one per each of the 32-thread warps).
__global__ void InclusiveWarpScanKernel(int* output) { // Specialize WarpScan for type int using warp_scan_t = cub::WarpScan<int>; // Allocate WarpScan shared memory for 4 warps __shared__ typename warp_scan_t::TempStorage temp_storage[num_warps]; int warp_id = threadIdx.x / 32; int initial_value = 3; int thread_data = threadIdx.x % 32 + warp_id; // warp #0 input: {0, 1, 2, 3, ..., 31} // warp #1 input: {1, 2, 3, 4, ..., 32} // warp #2 input: {2, 3, 4, 5, ..., 33} // warp #4 input: {3, 4, 5, 6, ..., 34} // Collectively compute the warp-wide inclusive prefix max scan warp_scan_t(temp_storage[warp_id]).InclusiveScan(thread_data, thread_data, initial_value, ::cuda::maximum<>{}); // initial value = 3 (for each warp) // warp #0 output: {3, 3, 3, 3, ..., 31} // warp #1 output: {3, 3, 3, 4, ..., 32} // warp #2 output: {3, 3, 4, 5, ..., 33} // warp #3 output: {3, 4, 5, 6, ..., 34} output[threadIdx.x] = thread_data;
Suppose the set of input
thread_data
in the first warp is{0, 1, 2, 3, ..., 31}
, in the second warp is{1, 2, 3, 4, ..., 32}
etc. The corresponding outputthread_data
for a max operation in the first warp would be{3, 3, 3, 3, ..., 31}
, the output for the second warp would be{3, 3, 3, 4, ..., 32}
, etc.- Template Parameters
ScanOp – [inferred] Binary scan operator type having member
T operator()(const T &a, const T &b)
- Parameters
input – [in] Calling thread’s input item
inclusive_output – [out] Calling thread’s output item. May be aliased with
input
initial_value – [in] Initial value to seed the inclusive scan (uniform across warp)
scan_op – [in] Binary scan operator
-
template<typename ScanOp>
inline void InclusiveScan(T input, T &inclusive_output, ScanOp scan_op, T &warp_aggregate) Computes an inclusive prefix scan using the specified binary scan functor across the calling warp. Also provides every thread with the warp-wide
warp_aggregate
of all inputs.A subsequent
__syncwarp()
warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates four concurrent warp-wide inclusive prefix max scans within a block of 128 threads (one per each of the 32-thread warps).
#include <cub/cub.cuh> __global__ void ExampleKernel(...) { // Specialize WarpScan for type int using WarpScan = cub::WarpScan<int>; // Allocate WarpScan shared memory for 4 warps __shared__ typename WarpScan::TempStorage temp_storage[4]; // Obtain one input item per thread int thread_data = ... // Compute inclusive warp-wide prefix max scans int warp_aggregate; int warp_id = threadIdx.x / 32; WarpScan(temp_storage[warp_id]).InclusiveScan( thread_data, thread_data, cuda::maximum<>{}, warp_aggregate);
Suppose the set of input
thread_data
across the block of threads is{0, -1, 2, -3, ..., 126, -127}
. The corresponding outputthread_data
in the first warp would be0, 0, 2, 2, ..., 30, 30
, the output for the second warp would be32, 32, 34, 34, ..., 62, 62
, etc. Furthermore,warp_aggregate
would be assigned30
for threads in the first warp,62
for threads in the second warp, etc.- Template Parameters
ScanOp – [inferred] Binary scan operator type having member
T operator()(const T &a, const T &b)
- Parameters
input – [in] Calling thread’s input item
inclusive_output – [out] Calling thread’s output item. May be aliased with
input
scan_op – [in] Binary scan operator
warp_aggregate – [out] Warp-wide aggregate reduction of input items.
-
template<typename ScanOp>
inline void InclusiveScan(T input, T &inclusive_output, T initial_value, ScanOp scan_op, T &warp_aggregate) Computes an inclusive prefix scan using the specified binary scan functor across the calling warp. Also provides every thread with the warp-wide
warp_aggregate
of all inputs.A subsequent
__syncwarp()
warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates four concurrent warp-wide inclusive prefix max scans within a block of 128 threads (one scan per warp).
__global__ void InclusiveWarpScanKernelAggr(int* output, int* d_warp_aggregate) { // Specialize WarpScan for type int using warp_scan_t = cub::WarpScan<int>; // Allocate WarpScan shared memory for 4 warps __shared__ typename warp_scan_t::TempStorage temp_storage[num_warps]; int warp_id = threadIdx.x / 32; int initial_value = 3; // for each warp int thread_data = 1; int warp_aggregate; // warp #0 input: {1, 1, 1, 1, ..., 1} // warp #1 input: {1, 1, 1, 1, ..., 1} // warp #2 input: {1, 1, 1, 1, ..., 1} // warp #4 input: {1, 1, 1, 1, ..., 1} // Collectively compute the warp-wide inclusive prefix max scan warp_scan_t(temp_storage[warp_id]) .InclusiveScan(thread_data, thread_data, initial_value, ::cuda::std::plus<>{}, warp_aggregate); // warp #1 output: {4, 5, 6, 7, ..., 35} - warp aggregate: 32 // warp #2 output: {4, 5, 6, 7, ..., 35} - warp aggregate: 32 // warp #0 output: {4, 5, 6, 7, ..., 35} - warp aggregate: 32 // warp #3 output: {4, 5, 6, 7, ..., 35} - warp aggregate: 32
Suppose the set of input
thread_data
across the block of threads is{1, 1, 1, 1, ..., 1}
. For initial value equal to 3, the corresponding outputthread_data
for a sum operation in the first warp would be{4, 5, 6, 7, ..., 35}
, the output for the second warp would be{4, 5, 6, 7, ..., 35}
, etc. Furthermore,warp_aggregate
would be assigned32
for threads in each warp.- Template Parameters
ScanOp – [inferred] Binary scan operator type having member
T operator()(const T &a, const T &b)
- Parameters
input – [in] Calling thread’s input item
inclusive_output – [out] Calling thread’s output item. May be aliased with
input
initial_value – [in] Initial value to seed the inclusive scan (uniform across warp). It is not taken into account for warp_aggregate.
scan_op – [in] Binary scan operator
warp_aggregate – [out] Warp-wide aggregate reduction of input items.
Exclusive prefix scans
-
template<typename ScanOp>
inline void ExclusiveScan(T input, T &exclusive_output, ScanOp scan_op) Computes an exclusive prefix scan using the specified binary scan functor across the calling warp. Because no initial value is supplied, the
output
computed for lane0 is undefined.A subsequent
__syncwarp()
warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates four concurrent warp-wide exclusive prefix max scans within a block of 128 threads (one per each of the 32-thread warps).
#include <cub/cub.cuh> __global__ void ExampleKernel(...) { // Specialize WarpScan for type int using WarpScan = cub::WarpScan<int>; // Allocate WarpScan shared memory for 4 warps __shared__ typename WarpScan::TempStorage temp_storage[4]; // Obtain one input item per thread int thread_data = ... // Compute exclusive warp-wide prefix max scans int warp_id = threadIdx.x / 32; WarpScan(temp_storage[warp_id]).ExclusiveScan(thread_data, thread_data, cuda::maximum<>{});
Suppose the set of input
thread_data
across the block of threads is{0, -1, 2, -3, ..., 126, -127}
. The corresponding outputthread_data
in the first warp would be?, 0, 0, 2, ..., 28, 30
, the output for the second warp would be?, 32, 32, 34, ..., 60, 62
, etc. (The outputthread_data
in warp lane0 is undefined.)- Template Parameters
ScanOp – [inferred] Binary scan operator type having member
T operator()(const T &a, const T &b)
- Parameters
input – [in] Calling thread’s input item
exclusive_output – [out] Calling thread’s output item. May be aliased with
input
scan_op – [in] Binary scan operator
-
template<typename ScanOp>
inline void ExclusiveScan(T input, T &exclusive_output, T initial_value, ScanOp scan_op) Computes an exclusive prefix scan using the specified binary scan functor across the calling warp.
A subsequent
__syncwarp()
warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates four concurrent warp-wide exclusive prefix max scans within a block of 128 threads (one per each of the 32-thread warps).
#include <cub/cub.cuh> __global__ void ExampleKernel(...) { // Specialize WarpScan for type int using WarpScan = cub::WarpScan<int>; // Allocate WarpScan shared memory for 4 warps __shared__ typename WarpScan::TempStorage temp_storage[4]; // Obtain one input item per thread int thread_data = ... // Compute exclusive warp-wide prefix max scans int warp_id = threadIdx.x / 32; WarpScan(temp_storage[warp_id]).ExclusiveScan(thread_data, thread_data, INT_MIN, cuda::maximum<>{});
Suppose the set of input
thread_data
across the block of threads is{0, -1, 2, -3, ..., 126, -127}
. The corresponding outputthread_data
in the first warp would beINT_MIN, 0, 0, 2, ..., 28, 30
, the output for the second warp would be30, 32, 32, 34, ..., 60, 62
, etc.- Template Parameters
ScanOp – [inferred] Binary scan operator type having member
T operator()(const T &a, const T &b)
- Parameters
input – [in] Calling thread’s input item
exclusive_output – [out] Calling thread’s output item. May be aliased with
input
initial_value – [in] Initial value to seed the exclusive scan
scan_op – [in] Binary scan operator
-
template<typename ScanOp>
inline void ExclusiveScan(T input, T &exclusive_output, ScanOp scan_op, T &warp_aggregate) Computes an exclusive prefix scan using the specified binary scan functor across the calling warp. Because no initial value is supplied, the
output
computed for lane0 is undefined. Also provides every thread with the warp-widewarp_aggregate
of all inputs.A subsequent
__syncwarp()
warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates four concurrent warp-wide exclusive prefix max scans within a block of 128 threads (one per each of the 32-thread warps).
#include <cub/cub.cuh> __global__ void ExampleKernel(...) { // Specialize WarpScan for type int using WarpScan = cub::WarpScan<int>; // Allocate WarpScan shared memory for 4 warps __shared__ typename WarpScan::TempStorage temp_storage[4]; // Obtain one input item per thread int thread_data = ... // Compute exclusive warp-wide prefix max scans int warp_aggregate; int warp_id = threadIdx.x / 32; WarpScan(temp_storage[warp_id]).ExclusiveScan(thread_data, thread_data, cuda::maximum<>{}, warp_aggregate);
Suppose the set of input
thread_data
across the block of threads is{0, -1, 2, -3, ..., 126, -127}
. The corresponding outputthread_data
in the first warp would be?, 0, 0, 2, ..., 28, 30
, the output for the second warp would be?, 32, 32, 34, ..., 60, 62
, etc. (The outputthread_data
in warp lane0 is undefined). Furthermore,warp_aggregate
would be assigned30
for threads in the first warp, p 62 for threads in the second warp, etc.- Template Parameters
ScanOp – [inferred] Binary scan operator type having member
T operator()(const T &a, const T &b)
- Parameters
input – [in] Calling thread’s input item
exclusive_output – [out] Calling thread’s output item. May be aliased with
input
scan_op – [in] Binary scan operator
warp_aggregate – [out] Warp-wide aggregate reduction of input items
-
template<typename ScanOp>
inline void ExclusiveScan(T input, T &exclusive_output, T initial_value, ScanOp scan_op, T &warp_aggregate) Computes an exclusive prefix scan using the specified binary scan functor across the calling warp. Also provides every thread with the warp-wide
warp_aggregate
of all inputs.A subsequent
__syncwarp()
warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates four concurrent warp-wide exclusive prefix max scans within a block of 128 threads (one per each of the 32-thread warps).
#include <cub/cub.cuh> __global__ void ExampleKernel(...) { // Specialize WarpScan for type int using WarpScan = cub::WarpScan<int>; // Allocate WarpScan shared memory for 4 warps __shared__ typename WarpScan::TempStorage temp_storage[4]; // Obtain one input item per thread int thread_data = ... // Compute exclusive warp-wide prefix max scans int warp_aggregate; int warp_id = threadIdx.x / 32; WarpScan(temp_storage[warp_id]).ExclusiveScan(thread_data, thread_data, INT_MIN, cuda::maximum<>{}, warp_aggregate);
Suppose the set of input
thread_data
across the block of threads is{0, -1, 2, -3, ..., 126, -127}
. The corresponding outputthread_data
in the first warp would beINT_MIN, 0, 0, 2, ..., 28, 30
, the output for the second warp would be30, 32, 32, 34, ..., 60, 62
, etc. Furthermore,warp_aggregate
would be assigned30
for threads in the first warp,62
for threads in the second warp, etc.- Template Parameters
ScanOp – [inferred] Binary scan operator type having member
T operator()(const T &a, const T &b)
- Parameters
input – [in] Calling thread’s input item
exclusive_output – [out] Calling thread’s output item. May be aliased with
input
initial_value – [in] Initial value to seed the exclusive scan
scan_op – [in] Binary scan operator
warp_aggregate – [out] Warp-wide aggregate reduction of input items
Combination (inclusive & exclusive) prefix scans
-
template<typename ScanOp>
inline void Scan(T input, T &inclusive_output, T &exclusive_output, ScanOp scan_op) Computes both inclusive and exclusive prefix scans using the specified binary scan functor across the calling warp. Because no initial value is supplied, the
exclusive_output
computed for lane0 is undefined.A subsequent
__syncwarp()
warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates four concurrent warp-wide exclusive prefix max scans within a block of 128 threads (one per each of the 32-thread warps).
#include <cub/cub.cuh> __global__ void ExampleKernel(...) { // Specialize WarpScan for type int using WarpScan = cub::WarpScan<int>; // Allocate WarpScan shared memory for 4 warps __shared__ typename WarpScan::TempStorage temp_storage[4]; // Obtain one input item per thread int thread_data = ... // Compute exclusive warp-wide prefix max scans int inclusive_partial, exclusive_partial; WarpScan(temp_storage[warp_id]).Scan(thread_data, inclusive_partial, exclusive_partial, cuda::maximum<>{});
Suppose the set of input
thread_data
across the block of threads is{0, -1, 2, -3, ..., 126, -127}
. The corresponding outputinclusive_partial
in the first warp would be0, 0, 2, 2, ..., 30, 30
, the output for the second warp would be32, 32, 34, 34, ..., 62, 62
, etc. The corresponding outputexclusive_partial
in the first warp would be?, 0, 0, 2, ..., 28, 30
, the output for the second warp would be?, 32, 32, 34, ..., 60, 62
, etc. (The outputthread_data
in warp lane0 is undefined.)- Template Parameters
ScanOp – [inferred] Binary scan operator type having member
T operator()(const T &a, const T &b)
- Parameters
input – [in] Calling thread’s input item
inclusive_output – [out] Calling thread’s inclusive-scan output item
exclusive_output – [out] Calling thread’s exclusive-scan output item
scan_op – [in] Binary scan operator
-
template<typename ScanOp>
inline void Scan(T input, T &inclusive_output, T &exclusive_output, T initial_value, ScanOp scan_op) Computes both inclusive and exclusive prefix scans using the specified binary scan functor across the calling warp.
A subsequent
__syncwarp()
warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates four concurrent warp-wide prefix max scans within a block of 128 threads (one per each of the 32-thread warps).
#include <cub/cub.cuh> __global__ void ExampleKernel(...) { // Specialize WarpScan for type int using WarpScan = cub::WarpScan<int>; // Allocate WarpScan shared memory for 4 warps __shared__ typename WarpScan::TempStorage temp_storage[4]; // Obtain one input item per thread int thread_data = ... // Compute inclusive warp-wide prefix max scans int warp_id = threadIdx.x / 32; int inclusive_partial, exclusive_partial; WarpScan(temp_storage[warp_id]).Scan(thread_data, inclusive_partial, exclusive_partial, INT_MIN, cuda::maximum<>{});
Suppose the set of input
thread_data
across the block of threads is{0, -1, 2, -3, ..., 126, -127}
. The corresponding outputinclusive_partial
in the first warp would be0, 0, 2, 2, ..., 30, 30
, the output for the second warp would be32, 32, 34, 34, ..., 62, 62
, etc. The corresponding outputexclusive_partial
in the first warp would beINT_MIN, 0, 0, 2, ..., 28, 30
, the output for the second warp would be30, 32, 32, 34, ..., 60, 62
, etc.- Template Parameters
ScanOp – [inferred] Binary scan operator type having member
T operator()(const T &a, const T &b)
- Parameters
input – [in] Calling thread’s input item
inclusive_output – [out] Calling thread’s inclusive-scan output item
exclusive_output – [out] Calling thread’s exclusive-scan output item
initial_value – [in] Initial value to seed the exclusive scan
scan_op – [in] Binary scan operator
Data exchange
-
inline T Broadcast(T input, unsigned int src_lane)
Broadcast the value
input
from lanesrc_lane to all lanes in the warpA subsequent
__syncwarp()
warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates the warp-wide broadcasts of values from lane0 in each of four warps to all other threads in those warps.
#include <cub/cub.cuh> __global__ void ExampleKernel(...) { // Specialize WarpScan for type int using WarpScan = cub::WarpScan<int>; // Allocate WarpScan shared memory for 4 warps __shared__ typename WarpScan::TempStorage temp_storage[4]; // Obtain one input item per thread int thread_data = ... // Broadcast from lane0 in each warp to all other threads in the warp int warp_id = threadIdx.x / 32; thread_data = WarpScan(temp_storage[warp_id]).Broadcast(thread_data, 0);
Suppose the set of input
thread_data
across the block of threads is{0, 1, 2, 3, ..., 127}
. The corresponding outputthread_data
will be{0, 0, ..., 0}
in warp0,{32, 32, ..., 32}
in warp1,{64, 64, ..., 64}
in warp2, etc.- Parameters
input – [in] The value to broadcast
src_lane – [in] Which warp lane is to do the broadcasting
-
struct TempStorage : public Uninitialized<_TempStorage>
The operations exposed by WarpScan require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the
__shared__
keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) orunion
’d with other storage allocation types to facilitate memory reuse.