cub::WarpReduceBatched#

template<typename T, int Batches, int LogicalWarpThreads = detail::warp_threads, bool SyncPhysicalWarp = false>
class WarpReduceBatched#

The WarpReduceBatched class provides collective methods for performing batched parallel reductions of multiple batches of items partitioned across a CUDA thread warp.

Overview#

  • A reduction (or fold) uses a binary combining operator to compute a single aggregate from a list of input elements. Parallel reductions are in general only deterministic when the reduction operator is both commutative and associative.

  • Supports “logical” warps smaller than the physical warp size (e.g., logical warps of 8 threads).

  • This primitive performs batched reductions taking one item per batch per thread.

  • Results are distributed among the threads. When there are more batches than logical warp threads, results can be distributed among threads in either striped or blocked manner.

  • The number of batches must be non-negative. Compile-time and register pressure increase with the number of batches.

Performance Characteristics#

  • Uses special instructions when applicable (e.g., warp SHFL instructions).

  • Uses synchronization-free communication between warp lanes when applicable.

  • Should generally give much better performance than sequential WarpReduce calls independent of blocked or

striped output arrangements. - Achieves peak efficiency when the number of batches is a multiple of the number of logical warp threads. - For smaller than physical warp size logical warps, using SyncPhysicalWarp = true should in general give better performance than SyncPhysicalWarp = false.

Note that it can cause a deadlock if not all non-exited logical warps from the same physical warp participate in the reduction (due to branches).

  • Any uneven number of batches is less efficient than the next higher even number.

  • For more batches than logical warp threads, the striped output can be slightly more performant than blocked output depending on the number of batches and the number of logical warp threads.

  • Blocked output should generally give much better performance than converting striped output to blocked using e.g.

WarpExchange::StripedToBlocked(). - For types of less than 4 bytes future optimization might let blocked output outperform striped output.

Simple Example#

Every thread in the warp uses the WarpReduceBatched class by first specializing the WarpReduceBatched type, then instantiating an instance with parameters for communication, and finally invoking or more collective member functions.

The code snippet below illustrates reduction of 3 batches across 32 threads in each of 2 warps:

using WarpReduceBatched = cub::WarpReduceBatched<int, 3>;

// Assume 64 threads per block, so 64 / 32 = 2 logical warps
// Each logical warp has its own TempStorage
__shared__ typename WarpReduceBatched::TempStorage temp_storage[2];

const int logical_warp_id = static_cast<int>(threadIdx.x) / 32;

int thread_data[3];
thread_data[0] = static_cast<int>(threadIdx.x) - 1;
thread_data[1] = static_cast<int>(threadIdx.x);
thread_data[2] = static_cast<int>(threadIdx.x) + 1;

int result = WarpReduceBatched{temp_storage[logical_warp_id]}.Reduce(thread_data, cuda::maximum{});
// results across threads: [30, 31, 32, ?, ?, ..., ?, 62, 63, 64, ?, ?, ..., ?]

Template Parameters:
  • T – The reduction input/output element type

  • Batches – The number of batches to reduce. Also corresponds to the size of the range of inputs for each thread.

  • LogicalWarpThreads[optional] The number of threads per “logical” warp (may be less than the number of hardware warp threads but has to be a power of two). Default is the warp size of the targeted CUDA compute-capability (e.g., 32 threads for SM80).

Collective constructors

inline WarpReduceBatched(TempStorage &temp_storage)#

Collective constructor using the specified memory allocation as temporary storage. Logical warp and lane identifiers are constructed from threadIdx.x.

Parameters:

temp_storage[in] Reference to memory allocation having layout type TempStorage

Generic reductions

template<typename InputT, typename ReductionOp>
inline T Reduce(
const InputT &inputs,
ReductionOp reduction_op
)#

Computes a warp-wide reduction for each batch in the calling warp using the specified binary reduction functor. Thread i returns the result for batch i. Returned items that have no corresponding input batch are invalid. For more batches than logical warp threads or generic code that could result in zero batches, use ReduceToStriped() or ReduceToBlocked() instead.

Added in version 3.4.0: First appears in CUDA Toolkit 13.4.

A subsequent __syncwarp() warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet#

The code snippet below illustrates reduction of 3 batches across 16 threads in the branched-off first logical warp (using cuda::std::array inputs and outputs):

// Can't allow for physical warp synchronization since only the first logical warp participates due to the
// conditional. The other threads (assuming there are more than 16 threads per block) can't exit early due to the
// barrier.
using WarpReduceBatched = cub::WarpReduceBatched<int, 3, 16>;

// Only the first logical warp participates, so only a single TempStorage is needed
__shared__ typename WarpReduceBatched::TempStorage temp_storage;

int result{};
if (threadIdx.x < 16)
{
  cuda::std::array<int, 3> inputs{};
  inputs[0] = static_cast<int>(threadIdx.x) - 1;
  inputs[1] = static_cast<int>(threadIdx.x);
  inputs[2] = static_cast<int>(threadIdx.x) + 1;

  result = WarpReduceBatched{temp_storage}.Reduce(inputs, cuda::maximum{});
}
// results across threads: [14, 15, 16, ?, ?, ..., ?, 0, 0, ..., 0]
__syncthreads();
// Can reuse TempStorage after the barrier.

Template Parameters:
  • InputT[inferred] The data type to be reduced having member operator[](int i) and must be statically-sized (size() method or static array)

  • ReductionOp[inferred] Binary reduction operator type having member T operator()(const T &a, const T &b)

Parameters:
  • inputs[in] Statically-sized input range holding Batches items

  • reduction_op[in] Binary reduction operator

Returns:

The reduction of the input values of the batch corresponding to the logical lane.

template<typename InputT, typename OutputT, typename ReductionOp>
inline void ReduceToStriped(
const InputT &inputs,
OutputT &outputs,
ReductionOp reduction_op
)#

Computes a warp-wide reduction for each batch in the calling warp using the specified binary reduction functor. The user must provide an output range of max_out_per_thread = ceil_div(Batches, LogicalWarpThreads) items. Logical lane i stores results in its output range in a striped manner: outputs[0] = result of batch i, outputs[1] = result of batch i + LogicalWarpThreads, etc. Items in the output range that have no corresponding input batch are invalid.

Added in version 3.4.0: First appears in CUDA Toolkit 13.4.

A subsequent __syncwarp() warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet#

The code snippet below illustrates reduction of 3 batches across 2 threads in every second logical warp (using cuda::std::array inputs and outputs):

// Can't allow for physical warp synchronization since only every other logical warp participates.
// The other threads (assuming there are more than 2 threads per block) can't exit early due to the
// barrier.
using WarpReduceBatched = cub::WarpReduceBatched<int, 3, 2>;

// Assume 8 threads per block, so 8 / 2 = 4 logical warps
// Only every other logical warp participates, so only 2 TempStorage are needed
__shared__ typename WarpReduceBatched::TempStorage temp_storage[2];

const int logical_warp_id   = static_cast<int>(threadIdx.x) / 2;
const bool is_participating = logical_warp_id % 2 == 0;
const int participant_idx   = logical_warp_id / 2;

cuda::std::array<int, 2> results{};
if (is_participating)
{
  cuda::std::array<int, 3> inputs{};
  inputs[0] = static_cast<int>(threadIdx.x) - 1;
  inputs[1] = static_cast<int>(threadIdx.x);
  inputs[2] = static_cast<int>(threadIdx.x) + 1;

  WarpReduceBatched{temp_storage[participant_idx]}.ReduceToStriped(inputs, results, cuda::maximum{});
}
// results across threads:
// [[0, 2], [1, ?], [0, 0], [0, 0], [4, 6], [5, ?], [0, 0], [0, 0]]
__syncthreads();
// Can reuse TempStorage after the barrier.

Template Parameters:
  • InputT[inferred] The data type to be reduced having member operator[](int i) and must be statically-sized (size() method or static array)

  • OutputT[inferred] The data type to hold results having member operator[](int i) and must be statically-sized (size() method or static array)

  • ReductionOp[inferred] Binary reduction operator type having member T operator()(const T &a, const T &b)

Parameters:
  • inputs[in] Statically-sized input range holding Batches items

  • outputs[out] Statically-sized output range holding ceil_div(Batches, LogicalWarpThreads) items

  • reduction_op[in] Binary reduction operator

template<typename InputT, typename OutputT, typename ReductionOp>
inline void ReduceToBlocked(
const InputT &inputs,
OutputT &outputs,
ReductionOp reduction_op
)#

Computes a warp-wide reduction for each batch in the calling warp using the specified binary reduction functor. The user must provide an output range of max_out_per_thread = ceil_div(Batches, LogicalWarpThreads) items. Logical lane i stores results in its output range in a blocked manner: outputs[0] = result of batch i * max_out_per_thread, outputs[1] = result of batch i * max_out_per_thread + 1, etc. Items in the output range that have no corresponding input batch are invalid.

Added in version 3.4.0: First appears in CUDA Toolkit 13.4.

A subsequent __syncwarp() warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet#

The code snippet below illustrates reduction of 3 batches across 2 threads in every second logical warp (using cuda::std::array inputs and outputs):

// Can't allow for physical warp synchronization since only every other logical warp participates.
// The other threads (assuming there are more than 2 threads per block) can't exit early due to the
// barrier.
using WarpReduceBatched = cub::WarpReduceBatched<int, 3, 2>;

// Assume 8 threads per block, so 8 / 2 = 4 logical warps
// Only every other logical warp participates, so only 2 TempStorage are needed
__shared__ typename WarpReduceBatched::TempStorage temp_storage[2];

const int logical_warp_id   = static_cast<int>(threadIdx.x) / 2;
const bool is_participating = logical_warp_id % 2 == 0;
const int participant_idx   = logical_warp_id / 2;

cuda::std::array<int, 2> results{};
if (is_participating)
{
  cuda::std::array<int, 3> inputs{};
  inputs[0] = static_cast<int>(threadIdx.x) - 1;
  inputs[1] = static_cast<int>(threadIdx.x);
  inputs[2] = static_cast<int>(threadIdx.x) + 1;

  WarpReduceBatched{temp_storage[participant_idx]}.ReduceToBlocked(inputs, results, cuda::maximum{});
}
// results across threads:
// [[0, 1], [2, ?], [0, 0], [0, 0], [4, 5], [6, ?], [0, 0], [0, 0]]
__syncthreads();
// Can reuse TempStorage after the barrier.

Template Parameters:
  • InputT[inferred] The data type to be reduced having member operator[](int i) and must be statically-sized (size() method or static array)

  • OutputT[inferred] The data type to hold results having member operator[](int i) and must be statically-sized (size() method or static array)

  • ReductionOp[inferred] Binary reduction operator type having member T operator()(const T &a, const T &b)

Parameters:
  • inputs[in] Statically-sized input range holding Batches items

  • outputs[out] Statically-sized output range holding ceil_div(Batches, LogicalWarpThreads) items

  • reduction_op[in] Binary reduction operator

Sum reductions

Computes a warp-wide sum for each batch in the calling warp. Thread i returns the result for batch i. Returned items that have no corresponding input batch are invalid. For more batches than logical warp threads or generic code that could result in zero batches, use SumToStriped() or SumToBlocked() instead.

Added in version 3.4.0: First appears in CUDA Toolkit 13.4.

A subsequent __syncwarp() warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet#

The code snippet below illustrates reduction of 3 batches across 4 threads in each of 2 logical warps

// We can enable physical warp synchronization since all non-exited lanes do participate in the primitive.
using WarpReduceBatched = cub::WarpReduceBatched<int, 3, 4, true>;

// Assume 8 threads per block, so 8 / 4 = 2 logical warps
__shared__ typename WarpReduceBatched::TempStorage temp_storage[2];

const int logical_warp_id = static_cast<int>(threadIdx.x) / 4;

cuda::std::array<int, 3> inputs{};
inputs[0]  = static_cast<int>(threadIdx.x) - 1;
inputs[1]  = static_cast<int>(threadIdx.x);
inputs[2]  = static_cast<int>(threadIdx.x) + 1;
int result = WarpReduceBatched{temp_storage[logical_warp_id]}.Sum(inputs);
// results across threads:
// [2, 6, 10, ?, 18, 22, 26, ?]

A subsequent __syncwarp() warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

tparam InputT:

[inferred] The data type to be reduced having member operator[](int i) and must be statically-sized (size() method or static array)

param inputs:

[in] Statically-sized input range holding Batches items

return:

The sum of the input values of the batch corresponding to the logical lane.

template<typename InputT>
inline T Sum(const InputT &inputs)#
template<typename InputT, typename OutputT>
inline void SumToStriped(
const InputT &inputs,
OutputT &outputs
)#

Computes a warp-wide sum for each batch in the calling warp. The user must provide an output range of max_out_per_thread = ceil_div(Batches, LogicalWarpThreads) items. Logical lane i stores results in its output range in a striped manner: outputs[0] = result of batch i, outputs[1] = result of batch i + LogicalWarpThreads, etc. Items in the output range that have no corresponding input batch are invalid.

Added in version 3.4.0: First appears in CUDA Toolkit 13.4.

A subsequent __syncwarp() warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet#

The code snippet below illustrates reduction of 5 batches across 2 threads in each of 4 logical warps meaning more than one output per thread (using cuda::std::span inputs and outputs):

// We can enable physical warp synchronization since all non-exited lanes do participate in the primitive.
using WarpReduceBatched = cub::WarpReduceBatched<int, 5, 2, true>;

// Assume 8 threads per block, so 8 / 2 = 4 logical warps
__shared__ typename WarpReduceBatched::TempStorage temp_storage[4];

const int logical_warp_id = static_cast<int>(threadIdx.x) / 2;

cuda::std::array<int, 5> thread_data{};
thread_data[0] = static_cast<int>(threadIdx.x) - 2;
thread_data[1] = static_cast<int>(threadIdx.x) - 1;
thread_data[2] = static_cast<int>(threadIdx.x);
thread_data[3] = static_cast<int>(threadIdx.x) + 1;
thread_data[4] = static_cast<int>(threadIdx.x) + 2;
// Use a static size spans to pick the first 3 elements as inputs and the last 2 elements as outputs (aliasing inputs)
cuda::std::span<int, 5> inputs{cuda::std::begin(thread_data), 5};
cuda::std::span<int, 3> results{cuda::std::begin(thread_data) + 2, 3};
WarpReduceBatched{temp_storage[logical_warp_id]}.SumToStriped(inputs, results);
// results across threads:
// [[-3, 1, 5], [-1, 3, ?], [1, 5, 9], [3, 7, ?], ..., [9, 13, 17], [11, 15, ?]]

A subsequent __syncwarp() warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Template Parameters:
  • InputT[inferred] The data type to be reduced having member operator[](int i) and must be statically-sized (size() method or static array)

  • OutputT[inferred] The data type to hold results having member operator[](int i) and must be statically-sized (size() method or static array)

Parameters:
  • inputs[in] Statically-sized input range holding Batches items

  • outputs[out] Statically-sized output range holding ceil_div(Batches, LogicalWarpThreads) items

template<typename InputT, typename OutputT>
inline void SumToBlocked(
const InputT &inputs,
OutputT &outputs
)#

Computes a warp-wide sum for each batch in the calling warp. The user must provide an output range of max_out_per_thread = ceil_div(Batches, LogicalWarpThreads) items. Logical lane i stores results in its output range in a blocked manner: outputs[0] = result of batch i * max_out_per_thread, outputs[1] = result of batch i * max_out_per_thread + 1, etc. Items in the output range that have no corresponding input batch are invalid.

Added in version 3.4.0: First appears in CUDA Toolkit 13.4.

A subsequent __syncwarp() warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet#

The code snippet below illustrates reduction of 5 batches across 2 threads in each of 4 logical warps meaning more than one output per thread (using cuda::std::span inputs and outputs):

// We can enable physical warp synchronization since all non-exited lanes do participate in the primitive.
using WarpReduceBatched = cub::WarpReduceBatched<int, 5, 2, true>;

// Assume 8 threads per block, so 8 / 2 = 4 logical warps
__shared__ typename WarpReduceBatched::TempStorage temp_storage[4];

const int logical_warp_id = static_cast<int>(threadIdx.x) / 2;

cuda::std::array<int, 5> thread_data{};
thread_data[0] = static_cast<int>(threadIdx.x) - 2;
thread_data[1] = static_cast<int>(threadIdx.x) - 1;
thread_data[2] = static_cast<int>(threadIdx.x);
thread_data[3] = static_cast<int>(threadIdx.x) + 1;
thread_data[4] = static_cast<int>(threadIdx.x) + 2;
// Use a static size spans to pick the first 3 elements as inputs and the last 2 elements as outputs (aliasing inputs)
cuda::std::span<int, 5> inputs{cuda::std::begin(thread_data), 5};
cuda::std::span<int, 3> results{cuda::std::begin(thread_data) + 2, 3};
WarpReduceBatched{temp_storage[logical_warp_id]}.SumToBlocked(inputs, results);
// results across threads:
// [[-3, -1, 1], [3, 5, ?], [1, 3, 5], [7, 9, ?], ..., [9, 11, 13], [15, 17, ?]]

A subsequent __syncwarp() warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Template Parameters:
  • InputT[inferred] The data type to be reduced having member operator[](int i) and must be statically-sized (size() method or static array)

  • OutputT[inferred] The data type to hold results having member operator[](int i) and must be statically-sized (size() method or static array)

Parameters:
  • inputs[in] Statically-sized input range holding Batches items

  • outputs[out] Statically-sized output range holding ceil_div(Batches, LogicalWarpThreads) items

struct TempStorage : public Uninitialized<_TempStorage>#

The operations exposed by WarpReduceBatched require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union’d with other storage allocation types to facilitate memory reuse.