cub::BlockRadixRank#
BlockRadixRank provides operations for ranking unsigned integer types within a CUDA thread block.
Overview#
Keys must be in a form suitable for radix ranking (i.e., unsigned bits).
Important: BlockRadixRank ranks only
RadixBitsbits at a time from the keys, not the entire key. The digit extractor determines which bits are ranked.Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
Performance Considerations#
Efficiency is increased with increased granularity
ITEMS_PER_THREAD. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants ofcub::BlockLoadfor efficiently gathering a blocked arrangement of elements across threads.
#include <cub/cub.cuh> __global__ void ExampleKernel(...) { constexpr int block_threads = 2; constexpr int radix_bits = 5; // Specialize BlockRadixRank for a 1D block of 2 threads using block_radix_rank = cub::BlockRadixRank<block_threads, radix_bits, false>; using storage_t = typename block_radix_rank::TempStorage; // Allocate shared memory for BlockRadixRank __shared__ storage_t temp_storage; // Obtain a segment of consecutive items that are blocked across threads unsigned int keys[2]; int ranks[2]; ... // Extract the lowest radix_bits from each key cub::BFEDigitExtractor<unsigned> extractor(0, radix_bits); block_radix_rank(temp_storage).RankKeys(keys, ranks, extractor); ... }
Suppose the set of input
keysacross the block of threads is{ [16,10], [9,11] }. The extractor will rank only the lowest 5 bits:{ [16,10], [9,11] }(bits 0-4). The corresponding outputranksin those threads will be{ [3,1], [0,2] }.- Template Parameters:
BlockDimX – The thread block length in threads along the X dimension
RadixBits – The number of radix bits per digit place
IsDescending – Whether or not the sorted-order is high-to-low
MemoizeOuterScan – [optional] Whether or not to buffer outer raking scan partials to incur fewer shared memory reads at the expense of higher register pressure (default: true for architectures SM35 and newer, false otherwise). See
BlockScanAlgorithm::BLOCK_SCAN_RAKING_MEMOIZEfor more details.InnerScanAlgorithm – [optional] The cub::BlockScanAlgorithm algorithm to use (default: cub::BLOCK_SCAN_WARP_SCANS)
SMemConfig – [optional] Shared memory bank mode (default:
cudaSharedMemBankSizeFourByte)BlockDimY – [optional] The thread block length in threads along the Y dimension (default: 1)
BlockDimZ – [optional] The thread block length in threads along the Z dimension (default: 1)
Collective constructors
-
inline BlockRadixRank()#
Collective constructor using a private static allocation of shared memory as temporary storage.
-
inline BlockRadixRank(TempStorage &temp_storage)#
Collective constructor using the specified memory allocation as temporary storage.
- Parameters:
temp_storage – [in] Reference to memory allocation having layout type TempStorage
Raking
end member group
-
template<typename UnsignedBits, int KEYS_PER_THREAD, typename DigitExtractorT>
inline void RankKeys( - UnsignedBits (&keys)[KEYS_PER_THREAD],
- int (&ranks)[KEYS_PER_THREAD],
- DigitExtractorT digit_extractor,
Rank keys.
- Parameters:
keys – [in] Keys for this tile
ranks – [out] For each key, the local rank within the tile
digit_extractor – [in] The digit extractor
-
template<typename UnsignedBits, int KEYS_PER_THREAD, typename DigitExtractorT>
inline void RankKeys( - UnsignedBits (&keys)[KEYS_PER_THREAD],
- int (&ranks)[KEYS_PER_THREAD],
- DigitExtractorT digit_extractor,
- int (&exclusive_digit_prefix)[BINS_TRACKED_PER_THREAD],
Rank keys.
For the lower
RADIX_DIGITSthreads, digit counts for each digit are provided for the corresponding thread.- Parameters:
keys – [in] Keys for this tile
ranks – [out] For each key, the local rank within the tile (out parameter)
digit_extractor – [in] The digit extractor
exclusive_digit_prefix – [out] The exclusive prefix sum for the digits [(threadIdx.x * BINS_TRACKED_PER_THREAD) … (threadIdx.x * BINS_TRACKED_PER_THREAD) + BINS_TRACKED_PER_THREAD - 1]
Public Types
-
struct TempStorage : public Uninitialized<_TempStorage>#
The operations exposed by BlockScan require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the
__shared__keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) orunion’d with other storage allocation types to facilitate memory reuse.