cub::BlockRadixRank
Defined in cub/block/block_radix_rank.cuh
BlockRadixRank provides operations for ranking unsigned integer types within a CUDA thread block.
Overview
Keys must be in a form suitable for radix ranking (i.e., unsigned bits).
Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
Performance Considerations
Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants ofcub::BlockLoad
for efficiently gathering a blocked arrangement of elements across threads.
#include <cub/cub.cuh> __global__ void ExampleKernel(...) { constexpr int block_threads = 2; constexpr int radix_bits = 5; // Specialize BlockRadixRank for a 1D block of 2 threads // Specialize BlockRadixRank for a 1D block of 2 threads using block_radix_rank = cub::BlockRadixRank<block_threads, radix_bits>; using storage_t = typename block_radix_rank::TempStorage; // Allocate shared memory for BlockRadixSort __shared__ storage_t temp_storage; // Obtain a segment of consecutive items that are blocked across threads int keys[2]; int ranks[2]; ... cub::BFEDigitExtractor<int> extractor(0, radix_bits); block_radix_rank(temp_storage).RankKeys(keys, ranks, extractor); ...
Suppose the set of input
keys
across the block of threads is{ [16,10], [9,11] }
. The corresponding outputranks
in those threads will be{ [3,1], [0,2] }
.- Template Parameters
BLOCK_DIM_X – The thread block length in threads along the X dimension
RADIX_BITS – The number of radix bits per digit place
IS_DESCENDING – Whether or not the sorted-order is high-to-low
MEMOIZE_OUTER_SCAN – [optional] Whether or not to buffer outer raking scan partials to incur fewer shared memory reads at the expense of higher register pressure (default: true for architectures SM35 and newer, false otherwise). See
BlockScanAlgorithm::BLOCK_SCAN_RAKING_MEMOIZE
for more details.INNER_SCAN_ALGORITHM – [optional] The cub::BlockScanAlgorithm algorithm to use (default: cub::BLOCK_SCAN_WARP_SCANS)
SMEM_CONFIG – [optional] Shared memory bank mode (default:
cudaSharedMemBankSizeFourByte
)BLOCK_DIM_Y – [optional] The thread block length in threads along the Y dimension (default: 1)
BLOCK_DIM_Z – [optional] The thread block length in threads along the Z dimension (default: 1)
LEGACY_PTX_ARCH – [optional] Unused.
Collective constructors
-
inline BlockRadixRank()
Collective constructor using a private static allocation of shared memory as temporary storage.
-
inline BlockRadixRank(TempStorage &temp_storage)
Collective constructor using the specified memory allocation as temporary storage.
- Parameters
temp_storage – [in] Reference to memory allocation having layout type TempStorage
Raking
-
template<typename UnsignedBits, int KEYS_PER_THREAD, typename DigitExtractorT>
inline void RankKeys(UnsignedBits (&keys)[KEYS_PER_THREAD], int (&ranks)[KEYS_PER_THREAD], DigitExtractorT digit_extractor) Rank keys.
- Parameters
keys – [in] Keys for this tile
ranks – [out] For each key, the local rank within the tile
digit_extractor – [in] The digit extractor
-
template<typename UnsignedBits, int KEYS_PER_THREAD, typename DigitExtractorT>
inline void RankKeys(UnsignedBits (&keys)[KEYS_PER_THREAD], int (&ranks)[KEYS_PER_THREAD], DigitExtractorT digit_extractor, int (&exclusive_digit_prefix)[BINS_TRACKED_PER_THREAD]) Rank keys.
For the lower
RADIX_DIGITS
threads, digit counts for each digit are provided for the corresponding thread.- Parameters
keys – [in] Keys for this tile
ranks – [out] For each key, the local rank within the tile (out parameter)
digit_extractor – [in] The digit extractor
exclusive_digit_prefix – [out] The exclusive prefix sum for the digits [(threadIdx.x * BINS_TRACKED_PER_THREAD) … (threadIdx.x * BINS_TRACKED_PER_THREAD) + BINS_TRACKED_PER_THREAD - 1]
Public Types
-
struct TempStorage : public Uninitialized<_TempStorage>
The operations exposed by BlockScan require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the
__shared__
keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) orunion
’d with other storage allocation types to facilitate memory reuse.