cub::WarpExchange
Defined in cub/warp/warp_exchange.cuh
-
template<typename InputT, int ITEMS_PER_THREAD, int LOGICAL_WARP_THREADS = CUB_PTX_WARP_THREADS, int LEGACY_PTX_ARCH = 0, WarpExchangeAlgorithm WARP_EXCHANGE_ALGORITHM = WARP_EXCHANGE_SMEM>
class WarpExchange : private detail::InternalWarpExchangeImpl<InputT, ITEMS_PER_THREAD, CUB_PTX_WARP_THREADS, WARP_EXCHANGE_SMEM> The WarpExchange class provides methods for rearranging data partitioned across a CUDA warp.
- Overview
It is commonplace for a warp of threads to rearrange data items between threads. For example, the global memory accesses prefer patterns where data items are “striped” across threads (where consecutive threads access consecutive items), yet most warp-wide operations prefer a “blocked” partitioning of items across threads (where consecutive items belong to a single thread).
WarpExchange supports the following types of data exchanges:
- A Simple Example
The code snippet below illustrates the conversion from a “blocked” to a “striped” arrangement of 64 integer items partitioned across 16 threads where each thread owns 4 items.
#include <cub/cub.cuh> // or equivalently <cub/warp/warp_exchange.cuh> __global__ void ExampleKernel(int *d_data, ...) { constexpr int warp_threads = 16; constexpr int block_threads = 256; constexpr int items_per_thread = 4; constexpr int warps_per_block = block_threads / warp_threads; const int warp_id = static_cast<int>(threadIdx.x) / warp_threads; // Specialize WarpExchange for a virtual warp of 16 threads owning 4 integer items each using WarpExchangeT = cub::WarpExchange<int, items_per_thread, warp_threads>; // Allocate shared memory for WarpExchange __shared__ typename WarpExchangeT::TempStorage temp_storage[warps_per_block]; // Load a tile of data striped across threads int thread_data[items_per_thread]; // ... // Collectively exchange data into a blocked arrangement across threads WarpExchangeT(temp_storage[warp_id]).StripedToBlocked(thread_data, thread_data);
Suppose the set of striped input
thread_data
across the block of threads is{ [0,16,32,48], [1,17,33,49], ..., [15, 32, 47, 63] }
. The corresponding outputthread_data
in those threads will be{ [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [60,61,62,63] }
.
- Template Parameters
T – The data type to be exchanged.
ITEMS_PER_THREAD – The number of items partitioned onto each thread.
LOGICAL_WARP_THREADS – [optional] The number of threads per “logical” warp (may be less than the number of hardware warp threads). Default is the warp size of the targeted CUDA compute-capability (e.g., 32 threads for SM86). Must be a power of two.
LEGACY_PTX_ARCH – Unused.
Collective constructors
-
WarpExchange() = delete
-
inline explicit WarpExchange(TempStorage &temp_storage)
Collective constructor using the specified memory allocation as temporary storage.
Data movement
-
template<typename OutputT>
inline void BlockedToStriped(const InputT (&input_items)[ITEMS_PER_THREAD], OutputT (&output_items)[ITEMS_PER_THREAD]) Transposes data items from blocked arrangement to striped arrangement.
A subsequent
__syncwarp()
warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.- Snippet
The code snippet below illustrates the conversion from a “blocked” to a “striped” arrangement of 64 integer items partitioned across 16 threads where each thread owns 4 items.
#include <cub/cub.cuh> // or equivalently <cub/warp/warp_exchange.cuh> __global__ void ExampleKernel(int *d_data, ...) { constexpr int warp_threads = 16; constexpr int block_threads = 256; constexpr int items_per_thread = 4; constexpr int warps_per_block = block_threads / warp_threads; const int warp_id = static_cast<int>(threadIdx.x) / warp_threads; // Specialize WarpExchange for a virtual warp of 16 threads owning 4 integer items each using WarpExchangeT = cub::WarpExchange<int, items_per_thread, warp_threads>; // Allocate shared memory for WarpExchange __shared__ typename WarpExchangeT::TempStorage temp_storage[warps_per_block]; // Obtain a segment of consecutive items that are blocked across threads int thread_data[items_per_thread]; // ... // Collectively exchange data into a striped arrangement across threads WarpExchangeT(temp_storage[warp_id]).BlockedToStriped(thread_data, thread_data);
Suppose the set of striped input
thread_data
across the block of threads is{ [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [60,61,62,63] }
. The corresponding outputthread_data
in those threads will be{ [0,16,32,48], [1,17,33,49], ..., [15, 32, 47, 63] }
.
- Parameters
input_items – [in] Items to exchange, converting between blocked and striped arrangements.
output_items – [out] Items from exchange, converting between striped and blocked arrangements. May be aliased to
input_items
.
-
template<typename OutputT>
inline void StripedToBlocked(const InputT (&input_items)[ITEMS_PER_THREAD], OutputT (&output_items)[ITEMS_PER_THREAD]) Transposes data items from striped arrangement to blocked arrangement.
A subsequent
__syncwarp()
warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.- Snippet
The code snippet below illustrates the conversion from a “striped” to a “blocked” arrangement of 64 integer items partitioned across 16 threads where each thread owns 4 items.
#include <cub/cub.cuh> // or equivalently <cub/warp/warp_exchange.cuh> __global__ void ExampleKernel(int *d_data, ...) { constexpr int warp_threads = 16; constexpr int block_threads = 256; constexpr int items_per_thread = 4; constexpr int warps_per_block = block_threads / warp_threads; const int warp_id = static_cast<int>(threadIdx.x) / warp_threads; // Specialize WarpExchange for a virtual warp of 16 threads owning 4 integer items each using WarpExchangeT = cub::WarpExchange<int, items_per_thread, warp_threads>; // Allocate shared memory for WarpExchange __shared__ typename WarpExchangeT::TempStorage temp_storage[warps_per_block]; // Load a tile of data striped across threads int thread_data[items_per_thread]; // ... // Collectively exchange data into a blocked arrangement across threads WarpExchangeT(temp_storage[warp_id]).StripedToBlocked(thread_data, thread_data);
Suppose the set of striped input
thread_data
across the block of threads is{ [0,16,32,48], [1,17,33,49], ..., [15, 32, 47, 63] }
. The corresponding outputthread_data
in those threads will be{ [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [60,61,62,63] }
.
- Parameters
input_items – [in] Items to exchange
output_items – [out] Items from exchange. May be aliased to
input_items
.
-
template<typename OffsetT>
inline void ScatterToStriped(InputT (&items)[ITEMS_PER_THREAD], OffsetT (&ranks)[ITEMS_PER_THREAD]) Exchanges valid data items annotated by rank into striped arrangement.
A subsequent
__syncwarp()
warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.- Snippet
The code snippet below illustrates the conversion from a “scatter” to a “striped” arrangement of 64 integer items partitioned across 16 threads where each thread owns 4 items.
#include <cub/cub.cuh> // or equivalently <cub/warp/warp_exchange.cuh> __global__ void ExampleKernel(int *d_data, ...) { constexpr int warp_threads = 16; constexpr int block_threads = 256; constexpr int items_per_thread = 4; constexpr int warps_per_block = block_threads / warp_threads; const int warp_id = static_cast<int>(threadIdx.x) / warp_threads; // Specialize WarpExchange for a virtual warp of 16 threads owning 4 integer items each using WarpExchangeT = cub::WarpExchange<int, items_per_thread, warp_threads>; // Allocate shared memory for WarpExchange __shared__ typename WarpExchangeT::TempStorage temp_storage[warps_per_block]; // Obtain a segment of consecutive items that are blocked across threads int thread_data[items_per_thread]; int thread_ranks[items_per_thread]; // ... // Collectively exchange data into a striped arrangement across threads WarpExchangeT(temp_storage[warp_id]).ScatterToStriped( thread_data, thread_ranks);
Suppose the set of input
thread_data
across the block of threads is{ [0,1,2,3], [4,5,6,7], ..., [60,61,62,63] }
, and the set ofthread_ranks
is{ [63,62,61,60], ..., [7,6,5,4], [3,2,1,0] }
. The corresponding outputthread_data
in those threads will be{ [63, 47, 31, 15], [62, 46, 30, 14], ..., [48, 32, 16, 0] }
.
- Template Parameters
OffsetT – [inferred] Signed integer type for local offsets
- Parameters
items – [inout] Items to exchange
ranks – [in] Corresponding scatter ranks
-
template<typename OutputT, typename OffsetT>
inline void ScatterToStriped(const InputT (&input_items)[ITEMS_PER_THREAD], OutputT (&output_items)[ITEMS_PER_THREAD], OffsetT (&ranks)[ITEMS_PER_THREAD]) Exchanges valid data items annotated by rank into striped arrangement.
A subsequent
__syncwarp()
warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.- Snippet
The code snippet below illustrates the conversion from a “scatter” to a “striped” arrangement of 64 integer items partitioned across 16 threads where each thread owns 4 items.
#include <cub/cub.cuh> // or equivalently <cub/warp/warp_exchange.cuh> __global__ void ExampleKernel(int *d_data, ...) { constexpr int warp_threads = 16; constexpr int block_threads = 256; constexpr int items_per_thread = 4; constexpr int warps_per_block = block_threads / warp_threads; const int warp_id = static_cast<int>(threadIdx.x) / warp_threads; // Specialize WarpExchange for a virtual warp of 16 threads owning 4 integer items each using WarpExchangeT = cub::WarpExchange<int, items_per_thread, warp_threads>; // Allocate shared memory for WarpExchange __shared__ typename WarpExchangeT::TempStorage temp_storage[warps_per_block]; // Obtain a segment of consecutive items that are blocked across threads int thread_input[items_per_thread]; int thread_ranks[items_per_thread]; // ... // Collectively exchange data into a striped arrangement across threads int thread_output[items_per_thread]; WarpExchangeT(temp_storage[warp_id]).ScatterToStriped( thread_input, thread_output, thread_ranks);
Suppose the set of input
thread_input
across the block of threads is{ [0,1,2,3], [4,5,6,7], ..., [60,61,62,63] }
, and the set ofthread_ranks
is{ [63,62,61,60], ..., [7,6,5,4], [3,2,1,0] }
. The correspondingthread_output
in those threads will be{ [63, 47, 31, 15], [62, 46, 30, 14], ..., [48, 32, 16, 0] }
.
- Template Parameters
OffsetT – [inferred] Signed integer type for local offsets
- Parameters
input_items – [in] Items to exchange
output_items – [out] Items from exchange. May be aliased to
input_items
.ranks – [in] Corresponding scatter ranks
Public Types
-
using TempStorage = typename InternalWarpExchange::TempStorage
The operations exposed by WarpExchange require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the
__shared__
keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) orunion
’d with other storage allocation types to facilitate memory reuse.