cub::BlockShuffle
Defined in cub/block/block_shuffle.cuh
-
template<typename T, int BLOCK_DIM_X, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int LEGACY_PTX_ARCH = 0>
class BlockShuffle The BlockShuffle class provides collective methods for shuffling data partitioned across a CUDA thread block.
Overview
It is commonplace for blocks of threads to rearrange data items between threads. The BlockShuffle abstraction allows threads to efficiently shift items either (a) up to their successor or (b) down to their predecessor
- Template Parameters
T – The data type to be exchanged.
BLOCK_DIM_X – The thread block length in threads along the X dimension
BLOCK_DIM_Y – [optional] The thread block length in threads along the Y dimension (default: 1)
BLOCK_DIM_Z – [optional] The thread block length in threads along the Z dimension (default: 1)
LEGACY_PTX_ARCH – [optional] Unused
Collective constructors
-
inline BlockShuffle()
Collective constructor using a private static allocation of shared memory as temporary storage.
-
inline BlockShuffle(TempStorage &temp_storage)
Collective constructor using the specified memory allocation as temporary storage.
- Parameters
temp_storage – [in] Reference to memory allocation having layout type TempStorage
Shuffle movement
-
inline void Offset(T input, T &output, int distance = 1)
Each threadi obtains the
input
provided by threadi + distance. The offsetdistance
may be negative.A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
- Parameters
input – [in]
The input item from the calling thread (threadi)
output – [out]
The
input
item from the successor (or predecessor) thread threadi + distance (may be aliased toinput
). This value is only updated for for threadi when0 <= (i + distance) < BLOCK_THREADS - 1
distance – [in] Offset distance (may be negative)
-
inline void Rotate(T input, T &output, unsigned int distance = 1)
Each threadi obtains the
input
provided by threadi + distance.A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
- Parameters
input – [in] The calling thread’s input item
output – [out]
The
input
item from thread thread(i + distance>) % BLOCK_THREADS (may be aliased toinput
). This value is not updated for threadBLOCK_THREADS - 1.distance – [in] Offset distance (
0 < distance <
BLOCK_THREADS`)
-
template<int ITEMS_PER_THREAD>
inline void Up(T (&input)[ITEMS_PER_THREAD], T (&prev)[ITEMS_PER_THREAD]) The thread block rotates its blocked arrangement of
input
items, shifting it up by one item.Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants ofcub::BlockLoad
for efficiently gathering a blocked arrangement of elements across threads.A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
- Parameters
input – [in] The calling thread’s input items
prev – [out]
The corresponding predecessor items (may be aliased to
input
). The itemprev[0]
is not updated for thread0.
-
template<int ITEMS_PER_THREAD>
inline void Up(T (&input)[ITEMS_PER_THREAD], T (&prev)[ITEMS_PER_THREAD], T &block_suffix) The thread block rotates its blocked arrangement of
input
items, shifting it up by one item. All threads receive theinput
provided by threadBLOCK_THREADS - 1.Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants ofcub::BlockLoad
for efficiently gathering a blocked arrangement of elements across threads.A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
- Parameters
input – [in] The calling thread’s input items
prev – [out]
The corresponding predecessor items (may be aliased to
input
). The itemprev[0]
is not updated for thread0.block_suffix – [out]
The item
input[ITEMS_PER_THREAD - 1]
from threadBLOCK_THREADS - 1, provided to all threads
-
template<int ITEMS_PER_THREAD>
inline void Down(T (&input)[ITEMS_PER_THREAD], T (&prev)[ITEMS_PER_THREAD]) The thread block rotates its blocked arrangement of
input
items, shifting it down by one item.Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants ofcub::BlockLoad
for efficiently gathering a blocked arrangement of elements across threads.A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
- Parameters
input – [in] The calling thread’s input items
prev – [out]
The corresponding predecessor items (may be aliased to
input
). The valueprev[0]
is not updated for threadBLOCK_THREADS - 1.
-
template<int ITEMS_PER_THREAD>
inline void Down(T (&input)[ITEMS_PER_THREAD], T (&prev)[ITEMS_PER_THREAD], T &block_prefix) The thread block rotates its blocked arrangement of input items, shifting it down by one item. All threads receive
input[0]
provided by thread0.Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants ofcub::BlockLoad
for efficiently gathering a blocked arrangement of elements across threads.A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
- Parameters
input – [in] The calling thread’s input items
prev – [out]
The corresponding predecessor items (may be aliased to
input
). The valueprev[0]
is not updated for threadBLOCK_THREADS - 1.block_prefix – [out]
The item
input[0]
from thread0, provided to all threads
-
struct TempStorage : public Uninitialized<_TempStorage>
The operations exposed by BlockShuffle require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the
__shared__
keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) orunion
’d with other storage allocation types to facilitate memory reuse.