cub::WarpReduceBatched#
-
template<typename T, int Batches, int LogicalWarpThreads = detail::warp_threads, bool SyncPhysicalWarp = false>
class WarpReduceBatched# The
WarpReduceBatchedclass provides collective methods for performing batched parallel reductions of multiple batches of items partitioned across a CUDA thread warp.Overview#
A reduction (or fold) uses a binary combining operator to compute a single aggregate from a list of input elements. Parallel reductions are in general only deterministic when the reduction operator is both commutative and associative.
Supports “logical” warps smaller than the physical warp size (e.g., logical warps of 8 threads).
This primitive performs batched reductions taking one item per batch per thread.
Results are distributed among the threads. When there are more batches than logical warp threads, results can be distributed among threads in either striped or blocked manner.
The number of batches must be non-negative. Compile-time and register pressure increase with the number of batches.
Performance Characteristics#
Uses special instructions when applicable (e.g., warp
SHFLinstructions).Uses synchronization-free communication between warp lanes when applicable.
Should generally give much better performance than sequential
WarpReducecalls independent of blocked or
striped output arrangements. - Achieves peak efficiency when the number of batches is a multiple of the number of logical warp threads. - For smaller than physical warp size logical warps, using
SyncPhysicalWarp = trueshould in general give better performance thanSyncPhysicalWarp = false.Note that it can cause a deadlock if not all non-exited logical warps from the same physical warp participate in the reduction (due to branches).
Any uneven number of batches is less efficient than the next higher even number.
For more batches than logical warp threads, the striped output can be slightly more performant than blocked output depending on the number of batches and the number of logical warp threads.
Blocked output should generally give much better performance than converting striped output to blocked using e.g.
WarpExchange::StripedToBlocked(). - For types of less than 4 bytes future optimization might let blocked output outperform striped output.Simple Example#
Every thread in the warp uses the WarpReduceBatched class by first specializing the WarpReduceBatched type, then instantiating an instance with parameters for communication, and finally invoking or more collective member functions.
The code snippet below illustrates reduction of 3 batches across 32 threads in each of 2 warps:
using WarpReduceBatched = cub::WarpReduceBatched<int, 3>; // Assume 64 threads per block, so 64 / 32 = 2 logical warps // Each logical warp has its own TempStorage __shared__ typename WarpReduceBatched::TempStorage temp_storage[2]; const int logical_warp_id = static_cast<int>(threadIdx.x) / 32; int thread_data[3]; thread_data[0] = static_cast<int>(threadIdx.x) - 1; thread_data[1] = static_cast<int>(threadIdx.x); thread_data[2] = static_cast<int>(threadIdx.x) + 1; int result = WarpReduceBatched{temp_storage[logical_warp_id]}.Reduce(thread_data, cuda::maximum{}); // results across threads: [30, 31, 32, ?, ?, ..., ?, 62, 63, 64, ?, ?, ..., ?]
- Template Parameters:
T – The reduction input/output element type
Batches – The number of batches to reduce. Also corresponds to the size of the range of inputs for each thread.
LogicalWarpThreads – [optional] The number of threads per “logical” warp (may be less than the number of hardware warp threads but has to be a power of two). Default is the warp size of the targeted CUDA compute-capability (e.g., 32 threads for SM80).
Collective constructors
-
inline WarpReduceBatched(TempStorage &temp_storage)#
Collective constructor using the specified memory allocation as temporary storage. Logical warp and lane identifiers are constructed from
threadIdx.x.- Parameters:
temp_storage – [in] Reference to memory allocation having layout type TempStorage
Generic reductions
-
template<typename InputT, typename ReductionOp>
inline T Reduce( - const InputT &inputs,
- ReductionOp reduction_op
Computes a warp-wide reduction for each batch in the calling warp using the specified binary reduction functor. Thread
ireturns the result for batchi. Returned items that have no corresponding input batch are invalid. For more batches than logical warp threads or generic code that could result in zero batches, useReduceToStriped()orReduceToBlocked()instead.Added in version 3.4.0: First appears in CUDA Toolkit 13.4.
A subsequent
__syncwarp()warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage) is to be reused or repurposed.Snippet#
The code snippet below illustrates reduction of 3 batches across 16 threads in the branched-off first logical warp (using
cuda::std::arrayinputs and outputs):// Can't allow for physical warp synchronization since only the first logical warp participates due to the // conditional. The other threads (assuming there are more than 16 threads per block) can't exit early due to the // barrier. using WarpReduceBatched = cub::WarpReduceBatched<int, 3, 16>; // Only the first logical warp participates, so only a single TempStorage is needed __shared__ typename WarpReduceBatched::TempStorage temp_storage; int result{}; if (threadIdx.x < 16) { cuda::std::array<int, 3> inputs{}; inputs[0] = static_cast<int>(threadIdx.x) - 1; inputs[1] = static_cast<int>(threadIdx.x); inputs[2] = static_cast<int>(threadIdx.x) + 1; result = WarpReduceBatched{temp_storage}.Reduce(inputs, cuda::maximum{}); } // results across threads: [14, 15, 16, ?, ?, ..., ?, 0, 0, ..., 0] __syncthreads(); // Can reuse TempStorage after the barrier.
- Template Parameters:
InputT – [inferred] The data type to be reduced having member
operator[](int i)and must be statically-sized (size()method or static array)ReductionOp – [inferred] Binary reduction operator type having member
T operator()(const T &a, const T &b)
- Parameters:
inputs – [in] Statically-sized input range holding
Batchesitemsreduction_op – [in] Binary reduction operator
- Returns:
The reduction of the input values of the batch corresponding to the logical lane.
-
template<typename InputT, typename OutputT, typename ReductionOp>
inline void ReduceToStriped( - const InputT &inputs,
- OutputT &outputs,
- ReductionOp reduction_op
Computes a warp-wide reduction for each batch in the calling warp using the specified binary reduction functor. The user must provide an output range of
max_out_per_thread = ceil_div(Batches, LogicalWarpThreads)items. Logical laneistores results in its output range in a striped manner:outputs[0]= result of batchi,outputs[1]= result of batchi + LogicalWarpThreads, etc. Items in the output range that have no corresponding input batch are invalid.Added in version 3.4.0: First appears in CUDA Toolkit 13.4.
A subsequent
__syncwarp()warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage) is to be reused or repurposed.Snippet#
The code snippet below illustrates reduction of 3 batches across 2 threads in every second logical warp (using
cuda::std::arrayinputs and outputs):// Can't allow for physical warp synchronization since only every other logical warp participates. // The other threads (assuming there are more than 2 threads per block) can't exit early due to the // barrier. using WarpReduceBatched = cub::WarpReduceBatched<int, 3, 2>; // Assume 8 threads per block, so 8 / 2 = 4 logical warps // Only every other logical warp participates, so only 2 TempStorage are needed __shared__ typename WarpReduceBatched::TempStorage temp_storage[2]; const int logical_warp_id = static_cast<int>(threadIdx.x) / 2; const bool is_participating = logical_warp_id % 2 == 0; const int participant_idx = logical_warp_id / 2; cuda::std::array<int, 2> results{}; if (is_participating) { cuda::std::array<int, 3> inputs{}; inputs[0] = static_cast<int>(threadIdx.x) - 1; inputs[1] = static_cast<int>(threadIdx.x); inputs[2] = static_cast<int>(threadIdx.x) + 1; WarpReduceBatched{temp_storage[participant_idx]}.ReduceToStriped(inputs, results, cuda::maximum{}); } // results across threads: // [[0, 2], [1, ?], [0, 0], [0, 0], [4, 6], [5, ?], [0, 0], [0, 0]] __syncthreads(); // Can reuse TempStorage after the barrier.
- Template Parameters:
InputT – [inferred] The data type to be reduced having member
operator[](int i)and must be statically-sized (size()method or static array)OutputT – [inferred] The data type to hold results having member
operator[](int i)and must be statically-sized (size()method or static array)ReductionOp – [inferred] Binary reduction operator type having member
T operator()(const T &a, const T &b)
- Parameters:
inputs – [in] Statically-sized input range holding
Batchesitemsoutputs – [out] Statically-sized output range holding
ceil_div(Batches, LogicalWarpThreads)itemsreduction_op – [in] Binary reduction operator
-
template<typename InputT, typename OutputT, typename ReductionOp>
inline void ReduceToBlocked( - const InputT &inputs,
- OutputT &outputs,
- ReductionOp reduction_op
Computes a warp-wide reduction for each batch in the calling warp using the specified binary reduction functor. The user must provide an output range of
max_out_per_thread = ceil_div(Batches, LogicalWarpThreads)items. Logical lane i stores results in its output range in a blocked manner:outputs[0]= result of batchi * max_out_per_thread,outputs[1]= result of batchi * max_out_per_thread + 1, etc. Items in the output range that have no corresponding input batch are invalid.Added in version 3.4.0: First appears in CUDA Toolkit 13.4.
A subsequent
__syncwarp()warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage) is to be reused or repurposed.Snippet#
The code snippet below illustrates reduction of 3 batches across 2 threads in every second logical warp (using
cuda::std::arrayinputs and outputs):// Can't allow for physical warp synchronization since only every other logical warp participates. // The other threads (assuming there are more than 2 threads per block) can't exit early due to the // barrier. using WarpReduceBatched = cub::WarpReduceBatched<int, 3, 2>; // Assume 8 threads per block, so 8 / 2 = 4 logical warps // Only every other logical warp participates, so only 2 TempStorage are needed __shared__ typename WarpReduceBatched::TempStorage temp_storage[2]; const int logical_warp_id = static_cast<int>(threadIdx.x) / 2; const bool is_participating = logical_warp_id % 2 == 0; const int participant_idx = logical_warp_id / 2; cuda::std::array<int, 2> results{}; if (is_participating) { cuda::std::array<int, 3> inputs{}; inputs[0] = static_cast<int>(threadIdx.x) - 1; inputs[1] = static_cast<int>(threadIdx.x); inputs[2] = static_cast<int>(threadIdx.x) + 1; WarpReduceBatched{temp_storage[participant_idx]}.ReduceToBlocked(inputs, results, cuda::maximum{}); } // results across threads: // [[0, 1], [2, ?], [0, 0], [0, 0], [4, 5], [6, ?], [0, 0], [0, 0]] __syncthreads(); // Can reuse TempStorage after the barrier.
- Template Parameters:
InputT – [inferred] The data type to be reduced having member
operator[](int i)and must be statically-sized (size()method or static array)OutputT – [inferred] The data type to hold results having member
operator[](int i)and must be statically-sized (size()method or static array)ReductionOp – [inferred] Binary reduction operator type having member
T operator()(const T &a, const T &b)
- Parameters:
inputs – [in] Statically-sized input range holding
Batchesitemsoutputs – [out] Statically-sized output range holding
ceil_div(Batches, LogicalWarpThreads)itemsreduction_op – [in] Binary reduction operator
Sum reductions
Computes a warp-wide sum for each batch in the calling warp. Thread
ireturns the result for batchi. Returned items that have no corresponding input batch are invalid. For more batches than logical warp threads or generic code that could result in zero batches, useSumToStriped()orSumToBlocked()instead.Added in version 3.4.0: First appears in CUDA Toolkit 13.4.
A subsequent
__syncwarp()warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage) is to be reused or repurposed.Snippet#
The code snippet below illustrates reduction of 3 batches across 4 threads in each of 2 logical warps
// We can enable physical warp synchronization since all non-exited lanes do participate in the primitive. using WarpReduceBatched = cub::WarpReduceBatched<int, 3, 4, true>; // Assume 8 threads per block, so 8 / 4 = 2 logical warps __shared__ typename WarpReduceBatched::TempStorage temp_storage[2]; const int logical_warp_id = static_cast<int>(threadIdx.x) / 4; cuda::std::array<int, 3> inputs{}; inputs[0] = static_cast<int>(threadIdx.x) - 1; inputs[1] = static_cast<int>(threadIdx.x); inputs[2] = static_cast<int>(threadIdx.x) + 1; int result = WarpReduceBatched{temp_storage[logical_warp_id]}.Sum(inputs); // results across threads: // [2, 6, 10, ?, 18, 22, 26, ?]
A subsequent
__syncwarp()warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage) is to be reused or repurposed.- tparam InputT:
[inferred] The data type to be reduced having member
operator[](int i)and must be statically-sized (size()method or static array)
- param inputs:
[in] Statically-sized input range holding
Batchesitems
- return:
The sum of the input values of the batch corresponding to the logical lane.
-
template<typename InputT, typename OutputT>
inline void SumToStriped(
)# Computes a warp-wide sum for each batch in the calling warp. The user must provide an output range of
max_out_per_thread = ceil_div(Batches, LogicalWarpThreads)items. Logical laneistores results in its output range in a striped manner:outputs[0]= result of batchi,outputs[1]= result of batchi + LogicalWarpThreads, etc. Items in the output range that have no corresponding input batch are invalid.Added in version 3.4.0: First appears in CUDA Toolkit 13.4.
A subsequent
__syncwarp()warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage) is to be reused or repurposed.Snippet#
The code snippet below illustrates reduction of 5 batches across 2 threads in each of 4 logical warps meaning more than one output per thread (using
cuda::std::spaninputs and outputs):// We can enable physical warp synchronization since all non-exited lanes do participate in the primitive. using WarpReduceBatched = cub::WarpReduceBatched<int, 5, 2, true>; // Assume 8 threads per block, so 8 / 2 = 4 logical warps __shared__ typename WarpReduceBatched::TempStorage temp_storage[4]; const int logical_warp_id = static_cast<int>(threadIdx.x) / 2; cuda::std::array<int, 5> thread_data{}; thread_data[0] = static_cast<int>(threadIdx.x) - 2; thread_data[1] = static_cast<int>(threadIdx.x) - 1; thread_data[2] = static_cast<int>(threadIdx.x); thread_data[3] = static_cast<int>(threadIdx.x) + 1; thread_data[4] = static_cast<int>(threadIdx.x) + 2; // Use a static size spans to pick the first 3 elements as inputs and the last 2 elements as outputs (aliasing inputs) cuda::std::span<int, 5> inputs{cuda::std::begin(thread_data), 5}; cuda::std::span<int, 3> results{cuda::std::begin(thread_data) + 2, 3}; WarpReduceBatched{temp_storage[logical_warp_id]}.SumToStriped(inputs, results); // results across threads: // [[-3, 1, 5], [-1, 3, ?], [1, 5, 9], [3, 7, ?], ..., [9, 13, 17], [11, 15, ?]]
A subsequent
__syncwarp()warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage) is to be reused or repurposed.- Template Parameters:
InputT – [inferred] The data type to be reduced having member
operator[](int i)and must be statically-sized (size()method or static array)OutputT – [inferred] The data type to hold results having member
operator[](int i)and must be statically-sized (size()method or static array)
- Parameters:
inputs – [in] Statically-sized input range holding
Batchesitemsoutputs – [out] Statically-sized output range holding
ceil_div(Batches, LogicalWarpThreads)items
-
template<typename InputT, typename OutputT>
inline void SumToBlocked(
)# Computes a warp-wide sum for each batch in the calling warp. The user must provide an output range of
max_out_per_thread = ceil_div(Batches, LogicalWarpThreads)items. Logical lane i stores results in its output range in a blocked manner:outputs[0]= result of batchi * max_out_per_thread,outputs[1]= result of batchi * max_out_per_thread + 1, etc. Items in the output range that have no corresponding input batch are invalid.Added in version 3.4.0: First appears in CUDA Toolkit 13.4.
A subsequent
__syncwarp()warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage) is to be reused or repurposed.Snippet#
The code snippet below illustrates reduction of 5 batches across 2 threads in each of 4 logical warps meaning more than one output per thread (using
cuda::std::spaninputs and outputs):// We can enable physical warp synchronization since all non-exited lanes do participate in the primitive. using WarpReduceBatched = cub::WarpReduceBatched<int, 5, 2, true>; // Assume 8 threads per block, so 8 / 2 = 4 logical warps __shared__ typename WarpReduceBatched::TempStorage temp_storage[4]; const int logical_warp_id = static_cast<int>(threadIdx.x) / 2; cuda::std::array<int, 5> thread_data{}; thread_data[0] = static_cast<int>(threadIdx.x) - 2; thread_data[1] = static_cast<int>(threadIdx.x) - 1; thread_data[2] = static_cast<int>(threadIdx.x); thread_data[3] = static_cast<int>(threadIdx.x) + 1; thread_data[4] = static_cast<int>(threadIdx.x) + 2; // Use a static size spans to pick the first 3 elements as inputs and the last 2 elements as outputs (aliasing inputs) cuda::std::span<int, 5> inputs{cuda::std::begin(thread_data), 5}; cuda::std::span<int, 3> results{cuda::std::begin(thread_data) + 2, 3}; WarpReduceBatched{temp_storage[logical_warp_id]}.SumToBlocked(inputs, results); // results across threads: // [[-3, -1, 1], [3, 5, ?], [1, 3, 5], [7, 9, ?], ..., [9, 11, 13], [15, 17, ?]]
A subsequent
__syncwarp()warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage) is to be reused or repurposed.- Template Parameters:
InputT – [inferred] The data type to be reduced having member
operator[](int i)and must be statically-sized (size()method or static array)OutputT – [inferred] The data type to hold results having member
operator[](int i)and must be statically-sized (size()method or static array)
- Parameters:
inputs – [in] Statically-sized input range holding
Batchesitemsoutputs – [out] Statically-sized output range holding
ceil_div(Batches, LogicalWarpThreads)items
-
struct TempStorage : public Uninitialized<_TempStorage>#
The operations exposed by WarpReduceBatched require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the
__shared__keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) orunion’d with other storage allocation types to facilitate memory reuse.