cub::BlockRadixRank

Defined in /home/runner/work/cccl/cccl/cub/cub/block/block_radix_rank.cuh

template<int BLOCK_DIM_X, int RADIX_BITS, bool IS_DESCENDING, bool MEMOIZE_OUTER_SCAN = true, BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int LEGACY_PTX_ARCH = 0>
class BlockRadixRank

BlockRadixRank provides operations for ranking unsigned integer types within a CUDA thread block.

Overview

  • Keys must be in a form suitable for radix ranking (i.e., unsigned bits).

  • Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.

Performance Considerations

  • Efficiency is increased with increased granularity ITEMS_PER_THREAD. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.

#include <cub/cub.cuh>

__global__ void ExampleKernel(...)
{
  constexpr int block_threads = 2;
  constexpr int radix_bits = 5;

  // Specialize BlockRadixRank for a 1D block of 2 threads
  // Specialize BlockRadixRank for a 1D block of 2 threads
  using block_radix_rank = cub::BlockRadixRank<block_threads, radix_bits>;
  using storage_t = typename block_radix_rank::TempStorage;

  // Allocate shared memory for BlockRadixSort
  __shared__ storage_t temp_storage;

  // Obtain a segment of consecutive items that are blocked across threads
  int keys[2];
  int ranks[2];
  ...

  cub::BFEDigitExtractor<int> extractor(0, radix_bits);
  block_radix_rank(temp_storage).RankKeys(keys, ranks, extractor);

  ...

Suppose the set of input keys across the block of threads is { [16,10], [9,11] }. The corresponding output ranks in those threads will be { [3,1], [0,2] }.

Re-using dynamically allocating shared memory

The block/example_block_reduce_dyn_smem.cu example illustrates usage of dynamically shared memory with BlockReduce and how to re-purpose the same memory region. This example can be easily adapted to the storage required by BlockRadixRank.

Template Parameters
  • BLOCK_DIM_X – The thread block length in threads along the X dimension

  • RADIX_BITS – The number of radix bits per digit place

  • IS_DESCENDING – Whether or not the sorted-order is high-to-low

  • MEMOIZE_OUTER_SCAN[optional] Whether or not to buffer outer raking scan partials to incur fewer shared memory reads at the expense of higher register pressure (default: true for architectures SM35 and newer, false otherwise). See BlockScanAlgorithm::BLOCK_SCAN_RAKING_MEMOIZE for more details.

  • INNER_SCAN_ALGORITHM[optional] The cub::BlockScanAlgorithm algorithm to use (default: cub::BLOCK_SCAN_WARP_SCANS)

  • SMEM_CONFIG[optional] Shared memory bank mode (default: cudaSharedMemBankSizeFourByte)

  • BLOCK_DIM_Y[optional] The thread block length in threads along the Y dimension (default: 1)

  • BLOCK_DIM_Z[optional] The thread block length in threads along the Z dimension (default: 1)

  • LEGACY_PTX_ARCH[optional] Unused.

Collective constructors

inline BlockRadixRank()

Collective constructor using a private static allocation of shared memory as temporary storage.

inline BlockRadixRank(TempStorage &temp_storage)

Collective constructor using the specified memory allocation as temporary storage.

Parameters

temp_storage[in] Reference to memory allocation having layout type TempStorage

Raking

template<typename UnsignedBits, int KEYS_PER_THREAD, typename DigitExtractorT>
inline void RankKeys(UnsignedBits (&keys)[KEYS_PER_THREAD], int (&ranks)[KEYS_PER_THREAD], DigitExtractorT digit_extractor)

Rank keys.

Parameters
  • keys[in] Keys for this tile

  • ranks[out] For each key, the local rank within the tile

  • digit_extractor[in] The digit extractor

template<typename UnsignedBits, int KEYS_PER_THREAD, typename DigitExtractorT>
inline void RankKeys(UnsignedBits (&keys)[KEYS_PER_THREAD], int (&ranks)[KEYS_PER_THREAD], DigitExtractorT digit_extractor, int (&exclusive_digit_prefix)[BINS_TRACKED_PER_THREAD])

Rank keys.

For the lower RADIX_DIGITS threads, digit counts for each digit are provided for the corresponding thread.

Parameters
  • keys[in] Keys for this tile

  • ranks[out] For each key, the local rank within the tile (out parameter)

  • digit_extractor[in] The digit extractor

  • exclusive_digit_prefix[out] The exclusive prefix sum for the digits [(threadIdx.x * BINS_TRACKED_PER_THREAD) … (threadIdx.x * BINS_TRACKED_PER_THREAD) + BINS_TRACKED_PER_THREAD - 1]

Public Types

enum [anonymous]

Values:

enumerator BINS_TRACKED_PER_THREAD

Number of bin-starting offsets tracked per thread.

struct TempStorage : public Uninitialized<_TempStorage>

The operations exposed by BlockScan require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union’d with other storage allocation types to facilitate memory reuse.