cub::BlockShuffle

template<typename T, int BLOCK_DIM_X, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1> class BlockShuffle

The BlockShuffle class provides collective methods for shuffling data partitioned across a CUDA thread block.

Overview

It is commonplace for blocks of threads to rearrange data items between threads. The BlockShuffle abstraction allows threads to efficiently shift items either (a) up to their successor or (b) down to their predecessor

Template Parameters

T – The data type to be exchanged.
BLOCK_DIM_X – The thread block length in threads along the X dimension
BLOCK_DIM_Y – [optional] The thread block length in threads along the Y dimension (default: 1)
BLOCK_DIM_Z – [optional] The thread block length in threads along the Z dimension (default: 1)

Collective constructors

inline BlockShuffle(): Collective constructor using a private static allocation of shared memory as temporary storage.

inline BlockShuffle(TempStorage &temp_storage)

Collective constructor using the specified memory allocation as temporary storage.

Parameters: temp_storage – [in] Reference to memory allocation having layout type TempStorage

Shuffle movement

inline void Offset(T input, T &output, int distance = 1)

Each thread_i obtains the input provided by thread_{i + distance}. The offset distance may be negative.

A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Parameters

input – [in]
The input item from the calling thread (thread_i)
output – [out]
The input item from the successor (or predecessor) thread thread_{i + distance} (may be aliased to input). This value is only updated for for thread_i when 0 <= (i + distance) < BLOCK_THREADS - 1
distance – [in] Offset distance (may be negative)

inline void Rotate(T input, T &output, unsigned int distance = 1)

Each thread_i obtains the input provided by thread_{i + distance}.

A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Parameters

input – [in] The calling thread’s input item
output – [out]
The input item from thread thread_{(i + distance>) % BLOCK_THREADS} (may be aliased to input). This value is not updated for thread_{BLOCK_THREADS - 1}.
distance – [in] Offset distance (0 < distance <BLOCK_THREADS`)

template<int ITEMS_PER_THREAD> inline void Up(T (&input)[ITEMS_PER_THREAD], T (&prev)[ITEMS_PER_THREAD])

The thread block rotates its blocked arrangement of input items, shifting it up by one item.

Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where thread_i owns the i^th range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
Efficiency is increased with increased granularity ITEMS_PER_THREAD. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.
A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Parameters

input – [in] The calling thread’s input items
prev – [out]
The corresponding predecessor items (may be aliased to input). The item prev[0] is not updated for thread₀.

template<int ITEMS_PER_THREAD> inline void Up(T (&input)[ITEMS_PER_THREAD], T (&prev)[ITEMS_PER_THREAD], T &block_suffix)

The thread block rotates its blocked arrangement of input items, shifting it up by one item. All threads receive the input provided by thread_{BLOCK_THREADS - 1}.

Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where thread_i owns the i^th range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
Efficiency is increased with increased granularity ITEMS_PER_THREAD. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.
A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Parameters

input – [in] The calling thread’s input items
prev – [out]
The corresponding predecessor items (may be aliased to input). The item prev[0] is not updated for thread₀.
block_suffix – [out]
The item input[ITEMS_PER_THREAD - 1] from thread_{BLOCK_THREADS - 1}, provided to all threads

template<int ITEMS_PER_THREAD> inline void Down(T (&input)[ITEMS_PER_THREAD], T (&prev)[ITEMS_PER_THREAD])

The thread block rotates its blocked arrangement of input items, shifting it down by one item.

Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where thread_i owns the i^th range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
Efficiency is increased with increased granularity ITEMS_PER_THREAD. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.
A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Parameters

input – [in] The calling thread’s input items
prev – [out]
The corresponding predecessor items (may be aliased to input). The value prev[0] is not updated for thread_{BLOCK_THREADS - 1}.

template<int ITEMS_PER_THREAD> inline void Down(T (&input)[ITEMS_PER_THREAD], T (&prev)[ITEMS_PER_THREAD], T &block_prefix)

The thread block rotates its blocked arrangement of input items, shifting it down by one item. All threads receive input[0] provided by thread₀.

Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where thread_i owns the i^th range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
Efficiency is increased with increased granularity ITEMS_PER_THREAD. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.
A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Parameters

input – [in] The calling thread’s input items
prev – [out]
The corresponding predecessor items (may be aliased to input). The value prev[0] is not updated for thread_{BLOCK_THREADS - 1}.
block_prefix – [out]
The item input[0] from thread₀, provided to all threads

struct TempStorage : public Uninitialized<_TempStorage>: The operations exposed by BlockShuffle require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union’d with other storage allocation types to facilitate memory reuse.