cub::BlockRunLengthDecode

Defined in /home/runner/work/cccl/cccl/cub/cub/block/block_run_length_decode.cuh

template<typename ItemT, int BLOCK_DIM_X, int RUNS_PER_THREAD, int DECODED_ITEMS_PER_THREAD, typename DecodedOffsetT = uint32_t, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1>
class BlockRunLengthDecode

The BlockRunLengthDecode class supports decoding a run-length encoded array of items. That is, given the two arrays run_value[N] and run_lengths[N], run_value[i] is repeated run_lengths[i] many times in the output array. Due to the nature of the run-length decoding algorithm (“decompression”), the output size of the run-length decoded array is runtime-dependent and potentially without any upper bound. To address this, BlockRunLengthDecode allows retrieving a “window” from the run-length decoded array. The window’s offset can be specified and BLOCK_THREADS * DECODED_ITEMS_PER_THREAD (i.e., referred to as window_size) decoded items from the specified window will be returned.

__global__ void ExampleKernel(...)
{
  // Specialising BlockRunLengthDecode to run-length decode items of type uint64_t
  using RunItemT = uint64_t;
  // Type large enough to index into the run-length decoded array
  using RunLengthT = uint32_t;

  // Specialising BlockRunLengthDecode for a 1D block of 128 threads
  constexpr int BLOCK_DIM_X = 128;
  // Specialising BlockRunLengthDecode to have each thread contribute 2 run-length encoded runs
  constexpr int RUNS_PER_THREAD = 2;
  // Specialising BlockRunLengthDecode to have each thread hold 4 run-length decoded items
  constexpr int DECODED_ITEMS_PER_THREAD = 4;

  // Specialize BlockRadixSort for a 1D block of 128 threads owning 4 integer items each
  using BlockRunLengthDecodeT =
    cub::BlockRunLengthDecode<RunItemT, BLOCK_DIM_X, RUNS_PER_THREAD, DECODED_ITEMS_PER_THREAD>;

  // Allocate shared memory for BlockRunLengthDecode
  __shared__ typename BlockRunLengthDecodeT::TempStorage temp_storage;

  // The run-length encoded items and how often they shall be repeated in the run-length decoded output
  RunItemT run_values[RUNS_PER_THREAD];
  RunLengthT run_lengths[RUNS_PER_THREAD];
  ...

  // Initialize the BlockRunLengthDecode with the runs that we want to run-length decode
  uint32_t total_decoded_size = 0;
  BlockRunLengthDecodeT block_rld(temp_storage, run_values, run_lengths, total_decoded_size);

  // Run-length decode ("decompress") the runs into a window buffer of limited size. This is repeated until all
  runs
  // have been decoded.
  uint32_t decoded_window_offset = 0U;
  while (decoded_window_offset < total_decoded_size)
  {
    RunLengthT relative_offsets[DECODED_ITEMS_PER_THREAD];
    RunItemT decoded_items[DECODED_ITEMS_PER_THREAD];

    // The number of decoded items that are valid within this window (aka pass) of run-length decoding
    uint32_t num_valid_items = total_decoded_size - decoded_window_offset;
    block_rld.RunLengthDecode(decoded_items, relative_offsets, decoded_window_offset);

    decoded_window_offset += BLOCK_DIM_X * DECODED_ITEMS_PER_THREAD;

    ...
  }
}

Suppose the set of input run_values across the block of threads is { [0, 1], [2, 3], [4, 5], [6, 7], ..., [254, 255] } and run_lengths is { [1, 2], [3, 4], [5, 1], [2, 3], ..., [5, 1] }. The corresponding output decoded_items in those threads will be { [0, 1, 1, 2], [2, 2, 3, 3], [3, 3, 4, 4], [4, 4, 4, 5], ..., [169, 169, 170, 171] } and relative_offsets will be { [0, 0, 1, 0], [1, 2, 0, 1], [2, 3, 0, 1], [2, 3, 4, 0], ..., [3, 4, 0, 0] } during the first iteration of the while loop.

Note

Trailing runs of length 0 are supported (i.e., they may only appear at the end of the run_lengths array). A run of length zero may not be followed by a run length that is not zero.

Template Parameters
  • ItemT – The data type of the items being run-length decoded

  • BLOCK_DIM_X – The thread block length in threads along the X dimension

  • RUNS_PER_THREAD – The number of consecutive runs that each thread contributes

  • DECODED_ITEMS_PER_THREAD – The maximum number of decoded items that each thread holds

  • DecodedOffsetT – Type used to index into the block’s decoded items (large enough to hold the sum over all the runs’ lengths)

  • BLOCK_DIM_Y – The thread block length in threads along the Y dimension

  • BLOCK_DIM_Z – The thread block length in threads along the Z dimension

Public Functions

template<typename RunLengthT, typename TotalDecodedSizeT>
inline BlockRunLengthDecode(TempStorage &temp_storage, ItemT (&run_values)[RUNS_PER_THREAD], RunLengthT (&run_lengths)[RUNS_PER_THREAD], TotalDecodedSizeT &total_decoded_size)

Constructor specialised for user-provided temporary storage, initializing using the runs’ lengths. The algorithm’s temporary storage may not be repurposed between the constructor call and subsequent RunLengthDecode calls.

template<typename UserRunOffsetT>
inline BlockRunLengthDecode(TempStorage &temp_storage, ItemT (&run_values)[RUNS_PER_THREAD], UserRunOffsetT (&run_offsets)[RUNS_PER_THREAD])

Constructor specialised for user-provided temporary storage, initializing using the runs’ offsets. The algorithm’s temporary storage may not be repurposed between the constructor call and subsequent RunLengthDecode calls.

template<typename RunLengthT, typename TotalDecodedSizeT>
inline BlockRunLengthDecode(ItemT (&run_values)[RUNS_PER_THREAD], RunLengthT (&run_lengths)[RUNS_PER_THREAD], TotalDecodedSizeT &total_decoded_size)

Constructor specialised for static temporary storage, initializing using the runs’ lengths.

template<typename UserRunOffsetT>
inline BlockRunLengthDecode(ItemT (&run_values)[RUNS_PER_THREAD], UserRunOffsetT (&run_offsets)[RUNS_PER_THREAD])

Constructor specialised for static temporary storage, initializing using the runs’ offsets.

template<typename RelativeOffsetT>
inline void RunLengthDecode(ItemT (&decoded_items)[DECODED_ITEMS_PER_THREAD], RelativeOffsetT (&item_offsets)[DECODED_ITEMS_PER_THREAD], DecodedOffsetT from_decoded_offset = 0)

Run-length decodes the runs previously passed via a call to Init(…) and returns the run-length decoded items in a blocked arrangement to decoded_items.

If the number of run-length decoded items exceeds the run-length decode buffer (i.e., DECODED_ITEMS_PER_THREAD * BLOCK_THREADS), only the items that fit within the buffer are returned. Subsequent calls to RunLengthDecode adjusting from_decoded_offset can be used to retrieve the remaining run-length decoded items. Calling __syncthreads() between any two calls to RunLengthDecode is not required. item_offsets can be used to retrieve each run-length decoded item’s relative index within its run. E.g., the run-length encoded array of 3, 1, 4 with the respective run lengths of 2, 1, 3 would yield the run-length decoded array of 3, 3, 1, 4, 4, 4 with the relative offsets of 0, 1, 0, 0, 1, 2. A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Parameters
  • decoded_items[out] The run-length decoded items to be returned in a blocked arrangement

  • item_offsets[out] The run-length decoded items’ relative offset within the run they belong to

  • from_decoded_offset[in] If invoked with from_decoded_offset that is larger than total_decoded_size results in undefined behavior.

inline void RunLengthDecode(ItemT (&decoded_items)[DECODED_ITEMS_PER_THREAD], DecodedOffsetT from_decoded_offset = 0)

Run-length decodes the runs previously passed via a call to Init(…) and returns the run-length decoded items in a blocked arrangement to decoded_items.

If the number of run-length decoded items exceeds the run-length decode buffer (i.e., DECODED_ITEMS_PER_THREAD * BLOCK_THREADS), only the items that fit within the buffer are returned. Subsequent calls to RunLengthDecode adjusting from_decoded_offset can be used to retrieve the remaining run-length decoded items. Calling __syncthreads() between any two calls to RunLengthDecode is not required.

Parameters
  • decoded_items[out] The run-length decoded items to be returned in a blocked arrangement

  • from_decoded_offset[in] If invoked with from_decoded_offset that is larger than total_decoded_size results in undefined behavior.

struct TempStorage : public Uninitialized<_TempStorage>