cub::WarpScan

Defined in cub/warp/warp_scan.cuh

template<typename T, int LOGICAL_WARP_THREADS = CUB_PTX_WARP_THREADS, int LEGACY_PTX_ARCH = 0>
class WarpScan

The WarpScan class provides collective methods for computing a parallel prefix scan of items partitioned across a CUDA thread warp.

../_images/warp_scan_logo.png

Overview

  • Given a list of input elements and a binary reduction operator, a prefix scan produces an output list where each element is computed to be the reduction of the elements occurring earlier in the input list. Prefix sum connotes a prefix scan with the addition operator. The term inclusive indicates that the ith output reduction incorporates the ith input. The term exclusive indicates the ith input is not incorporated into the ith output reduction.

  • Supports non-commutative scan operators

  • Supports “logical” warps smaller than the physical warp size (e.g., a logical warp of 8 threads)

  • The number of entrant threads must be an multiple of LOGICAL_WARP_THREADS

Performance Considerations

  • Uses special instructions when applicable (e.g., warp SHFL)

  • Uses synchronization-free communication between warp lanes when applicable

  • Incurs zero bank conflicts for most types

  • Computation is slightly more efficient (i.e., having lower instruction overhead) for:

    • Summation (vs. generic scan)

    • The architecture’s warp size is a whole multiple of LOGICAL_WARP_THREADS

Simple Examples

Every thread in the warp uses the WarpScan class by first specializing the WarpScan type, then instantiating an instance with parameters for communication, and finally invoking or more collective member functions.

The code snippet below illustrates four concurrent warp prefix sums within a block of 128 threads (one per each of the 32-thread warps).

#include <cub/cub.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize WarpScan for type int
    using WarpScan = cub::WarpScan<int>;

    // Allocate WarpScan shared memory for 4 warps
    __shared__ typename WarpScan::TempStorage temp_storage[4];

    // Obtain one input item per thread
    int thread_data = ...

    // Compute warp-wide prefix sums
    int warp_id = threadIdx.x / 32;
    WarpScan(temp_storage[warp_id]).ExclusiveSum(thread_data, thread_data);

Suppose the set of input thread_data across the block of threads is {1, 1, 1, 1, ...}. The corresponding output thread_data in each of the four warps of threads will be 0, 1, 2, 3, ..., 31}.

The code snippet below illustrates a single warp prefix sum within a block of 128 threads.

#include <cub/cub.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize WarpScan for type int
    using WarpScan = cub::WarpScan<int>;

    // Allocate WarpScan shared memory for one warp
    __shared__ typename WarpScan::TempStorage temp_storage;
    ...

    // Only the first warp performs a prefix sum
    if (threadIdx.x < 32)
    {
        // Obtain one input item per thread
        int thread_data = ...

        // Compute warp-wide prefix sums
        WarpScan(temp_storage).ExclusiveSum(thread_data, thread_data);

Suppose the set of input thread_data across the warp of threads is {1, 1, 1, 1, ...}. The corresponding output thread_data will be {0, 1, 2, 3, ..., 31}.

Template Parameters
  • T – The scan input/output element type

  • LOGICAL_WARP_THREADS[optional] The number of threads per “logical” warp (may be less than the number of hardware warp threads). Default is the warp size associated with the CUDA Compute Capability targeted by the compiler (e.g., 32 threads for SM20).

  • LEGACY_PTX_ARCH[optional] Unused.

Collective constructors

inline WarpScan(TempStorage &temp_storage)

Collective constructor using the specified memory allocation as temporary storage. Logical warp and lane identifiers are constructed from threadIdx.x.

Parameters

temp_storage[in] Reference to memory allocation having layout type TempStorage

Inclusive prefix sums

inline void InclusiveSum(T input, T &inclusive_output)

Computes an inclusive prefix sum across the calling warp.

  • A subsequent __syncwarp() warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates four concurrent warp-wide inclusive prefix sums within a block of 128 threads (one per each of the 32-thread warps).

#include <cub/cub.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize WarpScan for type int
    using WarpScan = cub::WarpScan<int>;

    // Allocate WarpScan shared memory for 4 warps
    __shared__ typename WarpScan::TempStorage temp_storage[4];

    // Obtain one input item per thread
    int thread_data = ...

    // Compute inclusive warp-wide prefix sums
    int warp_id = threadIdx.x / 32;
    WarpScan(temp_storage[warp_id]).InclusiveSum(thread_data, thread_data);

Suppose the set of input thread_data across the block of threads is {1, 1, 1, 1, ...}. The corresponding output thread_data in each of the four warps of threads will be 1, 2, 3, ..., 32}.

Parameters
  • input[in] Calling thread’s input item.

  • inclusive_output[out] Calling thread’s output item. May be aliased with input.

inline void InclusiveSum(T input, T &inclusive_output, T &warp_aggregate)

Computes an inclusive prefix sum across the calling warp. Also provides every thread with the warp-wide warp_aggregate of all inputs.

  • A subsequent __syncwarp() warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates four concurrent warp-wide inclusive prefix sums within a block of 128 threads (one per each of the 32-thread warps).

#include <cub/cub.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize WarpScan for type int
    using WarpScan = cub::WarpScan<int>;

    // Allocate WarpScan shared memory for 4 warps
    __shared__ typename WarpScan::TempStorage temp_storage[4];

    // Obtain one input item per thread
    int thread_data = ...

    // Compute inclusive warp-wide prefix sums
    int warp_aggregate;
    int warp_id = threadIdx.x / 32;
    WarpScan(temp_storage[warp_id]).InclusiveSum(thread_data,
                                                 thread_data,
                                                 warp_aggregate);

Suppose the set of input thread_data across the block of threads is {1, 1, 1, 1, ...}. The corresponding output thread_data in each of the four warps of threads will be 1, 2, 3, ..., 32}. Furthermore, warp_aggregate for all threads in all warps will be 32.

Parameters
  • input[in] Calling thread’s input item

  • inclusive_output[out] Calling thread’s output item. May be aliased with input

  • warp_aggregate[out] Warp-wide aggregate reduction of input items

Exclusive prefix sums

inline void ExclusiveSum(T input, T &exclusive_output)

Computes an exclusive prefix sum across the calling warp. The value of 0 is applied as the initial value, and is assigned to exclusive_output in lane0.

  • This operation assumes the value of obtained by the T’s default constructor (or by zero-initialization if no user-defined default constructor exists) is suitable as the identity value zero for addition.

  • A subsequent __syncwarp() warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates four concurrent warp-wide exclusive prefix sums within a block of 128 threads (one per each of the 32-thread warps).

#include <cub/cub.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize WarpScan for type int
    using WarpScan = cub::WarpScan<int>;

    // Allocate WarpScan shared memory for 4 warps
    __shared__ typename WarpScan::TempStorage temp_storage[4];

    // Obtain one input item per thread
    int thread_data = ...

    // Compute exclusive warp-wide prefix sums
    int warp_id = threadIdx.x / 32;
    WarpScan(temp_storage[warp_id]).ExclusiveSum(thread_data, thread_data);

Suppose the set of input thread_data across the block of threads is {1, 1, 1, 1, ...}. The corresponding output thread_data in each of the four warps of threads will be 0, 1, 2, ..., 31}.

Parameters
  • input[in] Calling thread’s input item.

  • exclusive_output[out] Calling thread’s output item. May be aliased with input.

inline void ExclusiveSum(T input, T &exclusive_output, T &warp_aggregate)

Computes an exclusive prefix sum across the calling warp. The value of 0 is applied as the initial value, and is assigned to exclusive_output in lane0. Also provides every thread with the warp-wide warp_aggregate of all inputs.

  • This operation assumes the value of obtained by the T’s default constructor (or by zero-initialization if no user-defined default constructor exists) is suitable as the identity value zero for addition.

  • A subsequent __syncwarp() warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates four concurrent warp-wide exclusive prefix sums within a block of 128 threads (one per each of the 32-thread warps).

#include <cub/cub.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize WarpScan for type int
    using WarpScan = cub::WarpScan<int>;

    // Allocate WarpScan shared memory for 4 warps
    __shared__ typename WarpScan::TempStorage temp_storage[4];

    // Obtain one input item per thread
    int thread_data = ...

    // Compute exclusive warp-wide prefix sums
    int warp_aggregate;
    int warp_id = threadIdx.x / 32;
    WarpScan(temp_storage[warp_id]).ExclusiveSum(thread_data,
                                                 thread_data,
                                                 warp_aggregate);

Suppose the set of input thread_data across the block of threads is {1, 1, 1, 1, ...}. The corresponding output thread_data in each of the four warps of threads will be 0, 1, 2, ..., 31}. Furthermore, warp_aggregate for all threads in all warps will be 32.

Parameters
  • input[in] Calling thread’s input item

  • exclusive_output[out] Calling thread’s output item. May be aliased with input

  • warp_aggregate[out] Warp-wide aggregate reduction of input items

Inclusive prefix scans

template<typename ScanOp>
inline void InclusiveScan(T input, T &inclusive_output, ScanOp scan_op)

Computes an inclusive prefix scan using the specified binary scan functor across the calling warp.

  • A subsequent __syncwarp() warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates four concurrent warp-wide inclusive prefix max scans within a block of 128 threads (one per each of the 32-thread warps).

#include <cub/cub.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize WarpScan for type int
    using WarpScan = cub::WarpScan<int>;

    // Allocate WarpScan shared memory for 4 warps
    __shared__ typename WarpScan::TempStorage temp_storage[4];

    // Obtain one input item per thread
    int thread_data = ...

    // Compute inclusive warp-wide prefix max scans
    int warp_id = threadIdx.x / 32;
    WarpScan(temp_storage[warp_id]).InclusiveScan(thread_data, thread_data, cuda::maximum<>{});

Suppose the set of input thread_data across the block of threads is {0, -1, 2, -3, ..., 126, -127}. The corresponding output thread_data in the first warp would be 0, 0, 2, 2, ..., 30, 30, the output for the second warp would be 32, 32, 34, 34, ..., 62, 62, etc.

Template Parameters

ScanOp[inferred] Binary scan operator type having member T operator()(const T &a, const T &b)

Parameters
  • input[in] Calling thread’s input item

  • inclusive_output[out] Calling thread’s output item. May be aliased with input

  • scan_op[in] Binary scan operator

template<typename ScanOp>
inline void InclusiveScan(T input, T &inclusive_output, T initial_value, ScanOp scan_op)

Computes an inclusive prefix scan using the specified binary scan functor across the calling warp.

  • A subsequent __syncwarp() warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates four concurrent warp-wide inclusive prefix sum scans within a block of 128 threads (one per each of the 32-thread warps).

__global__ void InclusiveWarpScanKernel(int* output)
{
  // Specialize WarpScan for type int
  using warp_scan_t = cub::WarpScan<int>;
  // Allocate WarpScan shared memory for 4 warps
  __shared__ typename warp_scan_t::TempStorage temp_storage[num_warps];

  int warp_id       = threadIdx.x / 32;
  int initial_value = 3;
  int thread_data   = threadIdx.x % 32 + warp_id;

  // warp #0 input: {0, 1, 2, 3, ..., 31}
  // warp #1 input: {1, 2, 3, 4, ..., 32}
  // warp #2 input: {2, 3, 4, 5, ..., 33}
  // warp #4 input: {3, 4, 5, 6, ..., 34}

  // Collectively compute the warp-wide inclusive prefix max scan
  warp_scan_t(temp_storage[warp_id]).InclusiveScan(thread_data, thread_data, initial_value, ::cuda::maximum<>{});

  // initial value = 3 (for each warp)
  // warp #0 output: {3, 3, 3, 3, ..., 31}
  // warp #1 output: {3, 3, 3, 4, ..., 32}
  // warp #2 output: {3, 3, 4, 5, ..., 33}
  // warp #3 output: {3, 4, 5, 6, ..., 34}
  output[threadIdx.x] = thread_data;

Suppose the set of input thread_data in the first warp is {0, 1, 2, 3, ..., 31}, in the second warp is {1, 2, 3, 4, ..., 32} etc. The corresponding output thread_data for a max operation in the first warp would be {3, 3, 3, 3, ..., 31}, the output for the second warp would be {3, 3, 3, 4, ..., 32}, etc.

Template Parameters

ScanOp[inferred] Binary scan operator type having member T operator()(const T &a, const T &b)

Parameters
  • input[in] Calling thread’s input item

  • inclusive_output[out] Calling thread’s output item. May be aliased with input

  • initial_value[in] Initial value to seed the inclusive scan (uniform across warp)

  • scan_op[in] Binary scan operator

template<typename ScanOp>
inline void InclusiveScan(T input, T &inclusive_output, ScanOp scan_op, T &warp_aggregate)

Computes an inclusive prefix scan using the specified binary scan functor across the calling warp. Also provides every thread with the warp-wide warp_aggregate of all inputs.

  • A subsequent __syncwarp() warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates four concurrent warp-wide inclusive prefix max scans within a block of 128 threads (one per each of the 32-thread warps).

#include <cub/cub.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize WarpScan for type int
    using WarpScan = cub::WarpScan<int>;

    // Allocate WarpScan shared memory for 4 warps
    __shared__ typename WarpScan::TempStorage temp_storage[4];

    // Obtain one input item per thread
    int thread_data = ...

    // Compute inclusive warp-wide prefix max scans
    int warp_aggregate;
    int warp_id = threadIdx.x / 32;
    WarpScan(temp_storage[warp_id]).InclusiveScan(
        thread_data, thread_data, cuda::maximum<>{}, warp_aggregate);

Suppose the set of input thread_data across the block of threads is {0, -1, 2, -3, ..., 126, -127}. The corresponding output thread_data in the first warp would be 0, 0, 2, 2, ..., 30, 30, the output for the second warp would be 32, 32, 34, 34, ..., 62, 62, etc. Furthermore, warp_aggregate would be assigned 30 for threads in the first warp, 62 for threads in the second warp, etc.

Template Parameters

ScanOp[inferred] Binary scan operator type having member T operator()(const T &a, const T &b)

Parameters
  • input[in] Calling thread’s input item

  • inclusive_output[out] Calling thread’s output item. May be aliased with input

  • scan_op[in] Binary scan operator

  • warp_aggregate[out] Warp-wide aggregate reduction of input items.

template<typename ScanOp>
inline void InclusiveScan(T input, T &inclusive_output, T initial_value, ScanOp scan_op, T &warp_aggregate)

Computes an inclusive prefix scan using the specified binary scan functor across the calling warp. Also provides every thread with the warp-wide warp_aggregate of all inputs.

  • A subsequent __syncwarp() warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates four concurrent warp-wide inclusive prefix max scans within a block of 128 threads (one scan per warp).

__global__ void InclusiveWarpScanKernelAggr(int* output, int* d_warp_aggregate)
{
  // Specialize WarpScan for type int
  using warp_scan_t = cub::WarpScan<int>;
  // Allocate WarpScan shared memory for 4 warps
  __shared__ typename warp_scan_t::TempStorage temp_storage[num_warps];

  int warp_id       = threadIdx.x / 32;
  int initial_value = 3; // for each warp
  int thread_data   = 1;
  int warp_aggregate;

  // warp #0 input: {1, 1, 1, 1, ..., 1}
  // warp #1 input: {1, 1, 1, 1, ..., 1}
  // warp #2 input: {1, 1, 1, 1, ..., 1}
  // warp #4 input: {1, 1, 1, 1, ..., 1}

  // Collectively compute the warp-wide inclusive prefix max scan
  warp_scan_t(temp_storage[warp_id])
    .InclusiveScan(thread_data, thread_data, initial_value, ::cuda::std::plus<>{}, warp_aggregate);

  // warp #1 output: {4, 5, 6, 7, ..., 35} - warp aggregate: 32
  // warp #2 output: {4, 5, 6, 7, ..., 35} - warp aggregate: 32
  // warp #0 output: {4, 5, 6, 7, ..., 35} - warp aggregate: 32
  // warp #3 output: {4, 5, 6, 7, ..., 35} - warp aggregate: 32

Suppose the set of input thread_data across the block of threads is {1, 1, 1, 1, ..., 1}. For initial value equal to 3, the corresponding output thread_data for a sum operation in the first warp would be {4, 5, 6, 7, ..., 35}, the output for the second warp would be {4, 5, 6, 7, ..., 35}, etc. Furthermore, warp_aggregate would be assigned 32 for threads in each warp.

Template Parameters

ScanOp[inferred] Binary scan operator type having member T operator()(const T &a, const T &b)

Parameters
  • input[in] Calling thread’s input item

  • inclusive_output[out] Calling thread’s output item. May be aliased with input

  • initial_value[in] Initial value to seed the inclusive scan (uniform across warp). It is not taken into account for warp_aggregate.

  • scan_op[in] Binary scan operator

  • warp_aggregate[out] Warp-wide aggregate reduction of input items.

Exclusive prefix scans

template<typename ScanOp>
inline void ExclusiveScan(T input, T &exclusive_output, ScanOp scan_op)

Computes an exclusive prefix scan using the specified binary scan functor across the calling warp. Because no initial value is supplied, the output computed for lane0 is undefined.

  • A subsequent __syncwarp() warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates four concurrent warp-wide exclusive prefix max scans within a block of 128 threads (one per each of the 32-thread warps).

#include <cub/cub.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize WarpScan for type int
    using WarpScan = cub::WarpScan<int>;

    // Allocate WarpScan shared memory for 4 warps
    __shared__ typename WarpScan::TempStorage temp_storage[4];

    // Obtain one input item per thread
    int thread_data = ...

    // Compute exclusive warp-wide prefix max scans
    int warp_id = threadIdx.x / 32;
    WarpScan(temp_storage[warp_id]).ExclusiveScan(thread_data, thread_data, cuda::maximum<>{});

Suppose the set of input thread_data across the block of threads is {0, -1, 2, -3, ..., 126, -127}. The corresponding output thread_data in the first warp would be ?, 0, 0, 2, ..., 28, 30, the output for the second warp would be ?, 32, 32, 34, ..., 60, 62, etc. (The output thread_data in warp lane0 is undefined.)

Template Parameters

ScanOp[inferred] Binary scan operator type having member T operator()(const T &a, const T &b)

Parameters
  • input[in] Calling thread’s input item

  • exclusive_output[out] Calling thread’s output item. May be aliased with input

  • scan_op[in] Binary scan operator

template<typename ScanOp>
inline void ExclusiveScan(T input, T &exclusive_output, T initial_value, ScanOp scan_op)

Computes an exclusive prefix scan using the specified binary scan functor across the calling warp.

  • A subsequent __syncwarp() warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates four concurrent warp-wide exclusive prefix max scans within a block of 128 threads (one per each of the 32-thread warps).

#include <cub/cub.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize WarpScan for type int
    using WarpScan = cub::WarpScan<int>;

    // Allocate WarpScan shared memory for 4 warps
    __shared__ typename WarpScan::TempStorage temp_storage[4];

    // Obtain one input item per thread
    int thread_data = ...

    // Compute exclusive warp-wide prefix max scans
    int warp_id = threadIdx.x / 32;
    WarpScan(temp_storage[warp_id]).ExclusiveScan(thread_data,
                                                  thread_data,
                                                  INT_MIN,
                                                  cuda::maximum<>{});

Suppose the set of input thread_data across the block of threads is {0, -1, 2, -3, ..., 126, -127}. The corresponding output thread_data in the first warp would be INT_MIN, 0, 0, 2, ..., 28, 30, the output for the second warp would be 30, 32, 32, 34, ..., 60, 62, etc.

Template Parameters

ScanOp[inferred] Binary scan operator type having member T operator()(const T &a, const T &b)

Parameters
  • input[in] Calling thread’s input item

  • exclusive_output[out] Calling thread’s output item. May be aliased with input

  • initial_value[in] Initial value to seed the exclusive scan

  • scan_op[in] Binary scan operator

template<typename ScanOp>
inline void ExclusiveScan(T input, T &exclusive_output, ScanOp scan_op, T &warp_aggregate)

Computes an exclusive prefix scan using the specified binary scan functor across the calling warp. Because no initial value is supplied, the output computed for lane0 is undefined. Also provides every thread with the warp-wide warp_aggregate of all inputs.

  • A subsequent __syncwarp() warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates four concurrent warp-wide exclusive prefix max scans within a block of 128 threads (one per each of the 32-thread warps).

#include <cub/cub.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize WarpScan for type int
    using WarpScan = cub::WarpScan<int>;

    // Allocate WarpScan shared memory for 4 warps
    __shared__ typename WarpScan::TempStorage temp_storage[4];

    // Obtain one input item per thread
    int thread_data = ...

    // Compute exclusive warp-wide prefix max scans
    int warp_aggregate;
    int warp_id = threadIdx.x / 32;
    WarpScan(temp_storage[warp_id]).ExclusiveScan(thread_data,
                                                  thread_data,
                                                  cuda::maximum<>{},
                                                  warp_aggregate);

Suppose the set of input thread_data across the block of threads is {0, -1, 2, -3, ..., 126, -127}. The corresponding output thread_data in the first warp would be ?, 0, 0, 2, ..., 28, 30, the output for the second warp would be ?, 32, 32, 34, ..., 60, 62, etc. (The output thread_data in warp lane0 is undefined). Furthermore, warp_aggregate would be assigned 30 for threads in the first warp, p 62 for threads in the second warp, etc.

Template Parameters

ScanOp[inferred] Binary scan operator type having member T operator()(const T &a, const T &b)

Parameters
  • input[in] Calling thread’s input item

  • exclusive_output[out] Calling thread’s output item. May be aliased with input

  • scan_op[in] Binary scan operator

  • warp_aggregate[out] Warp-wide aggregate reduction of input items

template<typename ScanOp>
inline void ExclusiveScan(T input, T &exclusive_output, T initial_value, ScanOp scan_op, T &warp_aggregate)

Computes an exclusive prefix scan using the specified binary scan functor across the calling warp. Also provides every thread with the warp-wide warp_aggregate of all inputs.

  • A subsequent __syncwarp() warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates four concurrent warp-wide exclusive prefix max scans within a block of 128 threads (one per each of the 32-thread warps).

#include <cub/cub.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize WarpScan for type int
    using WarpScan = cub::WarpScan<int>;

    // Allocate WarpScan shared memory for 4 warps
    __shared__ typename WarpScan::TempStorage temp_storage[4];

    // Obtain one input item per thread
    int thread_data = ...

    // Compute exclusive warp-wide prefix max scans
    int warp_aggregate;
    int warp_id = threadIdx.x / 32;
    WarpScan(temp_storage[warp_id]).ExclusiveScan(thread_data,
                                                  thread_data,
                                                  INT_MIN,
                                                  cuda::maximum<>{},
                                                  warp_aggregate);

Suppose the set of input thread_data across the block of threads is {0, -1, 2, -3, ..., 126, -127}. The corresponding output thread_data in the first warp would be INT_MIN, 0, 0, 2, ..., 28, 30, the output for the second warp would be 30, 32, 32, 34, ..., 60, 62, etc. Furthermore, warp_aggregate would be assigned 30 for threads in the first warp, 62 for threads in the second warp, etc.

Template Parameters

ScanOp[inferred] Binary scan operator type having member T operator()(const T &a, const T &b)

Parameters
  • input[in] Calling thread’s input item

  • exclusive_output[out] Calling thread’s output item. May be aliased with input

  • initial_value[in] Initial value to seed the exclusive scan

  • scan_op[in] Binary scan operator

  • warp_aggregate[out] Warp-wide aggregate reduction of input items

Combination (inclusive & exclusive) prefix scans

template<typename ScanOp>
inline void Scan(T input, T &inclusive_output, T &exclusive_output, ScanOp scan_op)

Computes both inclusive and exclusive prefix scans using the specified binary scan functor across the calling warp. Because no initial value is supplied, the exclusive_output computed for lane0 is undefined.

  • A subsequent __syncwarp() warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates four concurrent warp-wide exclusive prefix max scans within a block of 128 threads (one per each of the 32-thread warps).

#include <cub/cub.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize WarpScan for type int
    using WarpScan = cub::WarpScan<int>;

    // Allocate WarpScan shared memory for 4 warps
    __shared__ typename WarpScan::TempStorage temp_storage[4];

    // Obtain one input item per thread
    int thread_data = ...

    // Compute exclusive warp-wide prefix max scans
    int inclusive_partial, exclusive_partial;
    WarpScan(temp_storage[warp_id]).Scan(thread_data,
                                         inclusive_partial,
                                         exclusive_partial,
                                         cuda::maximum<>{});

Suppose the set of input thread_data across the block of threads is {0, -1, 2, -3, ..., 126, -127}. The corresponding output inclusive_partial in the first warp would be 0, 0, 2, 2, ..., 30, 30, the output for the second warp would be 32, 32, 34, 34, ..., 62, 62, etc. The corresponding output exclusive_partial in the first warp would be ?, 0, 0, 2, ..., 28, 30, the output for the second warp would be ?, 32, 32, 34, ..., 60, 62, etc. (The output thread_data in warp lane0 is undefined.)

Template Parameters

ScanOp[inferred] Binary scan operator type having member T operator()(const T &a, const T &b)

Parameters
  • input[in] Calling thread’s input item

  • inclusive_output[out] Calling thread’s inclusive-scan output item

  • exclusive_output[out] Calling thread’s exclusive-scan output item

  • scan_op[in] Binary scan operator

template<typename ScanOp>
inline void Scan(T input, T &inclusive_output, T &exclusive_output, T initial_value, ScanOp scan_op)

Computes both inclusive and exclusive prefix scans using the specified binary scan functor across the calling warp.

  • A subsequent __syncwarp() warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates four concurrent warp-wide prefix max scans within a block of 128 threads (one per each of the 32-thread warps).

#include <cub/cub.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize WarpScan for type int
    using WarpScan = cub::WarpScan<int>;

    // Allocate WarpScan shared memory for 4 warps
    __shared__ typename WarpScan::TempStorage temp_storage[4];

    // Obtain one input item per thread
    int thread_data = ...

    // Compute inclusive warp-wide prefix max scans
    int warp_id = threadIdx.x / 32;
    int inclusive_partial, exclusive_partial;
    WarpScan(temp_storage[warp_id]).Scan(thread_data,
                                         inclusive_partial,
                                         exclusive_partial,
                                         INT_MIN,
                                         cuda::maximum<>{});

Suppose the set of input thread_data across the block of threads is {0, -1, 2, -3, ..., 126, -127}. The corresponding output inclusive_partial in the first warp would be 0, 0, 2, 2, ..., 30, 30, the output for the second warp would be 32, 32, 34, 34, ..., 62, 62, etc. The corresponding output exclusive_partial in the first warp would be INT_MIN, 0, 0, 2, ..., 28, 30, the output for the second warp would be 30, 32, 32, 34, ..., 60, 62, etc.

Template Parameters

ScanOp[inferred] Binary scan operator type having member T operator()(const T &a, const T &b)

Parameters
  • input[in] Calling thread’s input item

  • inclusive_output[out] Calling thread’s inclusive-scan output item

  • exclusive_output[out] Calling thread’s exclusive-scan output item

  • initial_value[in] Initial value to seed the exclusive scan

  • scan_op[in] Binary scan operator

Data exchange

inline T Broadcast(T input, unsigned int src_lane)

Broadcast the value input from lanesrc_lane to all lanes in the warp

  • A subsequent __syncwarp() warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates the warp-wide broadcasts of values from lane0 in each of four warps to all other threads in those warps.

#include <cub/cub.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize WarpScan for type int
    using WarpScan = cub::WarpScan<int>;

    // Allocate WarpScan shared memory for 4 warps
    __shared__ typename WarpScan::TempStorage temp_storage[4];

    // Obtain one input item per thread
    int thread_data = ...

    // Broadcast from lane0 in each warp to all other threads in the warp
    int warp_id = threadIdx.x / 32;
    thread_data = WarpScan(temp_storage[warp_id]).Broadcast(thread_data, 0);

Suppose the set of input thread_data across the block of threads is {0, 1, 2, 3, ..., 127}. The corresponding output thread_data will be {0, 0, ..., 0} in warp0, {32, 32, ..., 32} in warp1, {64, 64, ..., 64} in warp2, etc.

Parameters
  • input[in] The value to broadcast

  • src_lane[in] Which warp lane is to do the broadcasting

struct TempStorage : public Uninitialized<_TempStorage>

The operations exposed by WarpScan require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union’d with other storage allocation types to facilitate memory reuse.