cub::BlockRunLengthDecode
Defined in cub/block/block_run_length_decode.cuh
-
template<typename ItemT, int BLOCK_DIM_X, int RUNS_PER_THREAD, int DECODED_ITEMS_PER_THREAD, typename DecodedOffsetT = uint32_t, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1>
class BlockRunLengthDecode The BlockRunLengthDecode class supports decoding a run-length encoded array of items. That is, given the two arrays
run_value[N]
andrun_lengths[N]
,run_value[i]
is repeatedrun_lengths[i]
many times in the output array. Due to the nature of the run-length decoding algorithm (“decompression”), the output size of the run-length decoded array is runtime-dependent and potentially without any upper bound. To address this, BlockRunLengthDecode allows retrieving a “window” from the run-length decoded array. The window’s offset can be specified and BLOCK_THREADS * DECODED_ITEMS_PER_THREAD (i.e., referred to as window_size) decoded items from the specified window will be returned.__global__ void ExampleKernel(...) { // Specialising BlockRunLengthDecode to run-length decode items of type uint64_t using RunItemT = uint64_t; // Type large enough to index into the run-length decoded array using RunLengthT = uint32_t; // Specialising BlockRunLengthDecode for a 1D block of 128 threads constexpr int BLOCK_DIM_X = 128; // Specialising BlockRunLengthDecode to have each thread contribute 2 run-length encoded runs constexpr int RUNS_PER_THREAD = 2; // Specialising BlockRunLengthDecode to have each thread hold 4 run-length decoded items constexpr int DECODED_ITEMS_PER_THREAD = 4; // Specialize BlockRadixSort for a 1D block of 128 threads owning 4 integer items each using BlockRunLengthDecodeT = cub::BlockRunLengthDecode<RunItemT, BLOCK_DIM_X, RUNS_PER_THREAD, DECODED_ITEMS_PER_THREAD>; // Allocate shared memory for BlockRunLengthDecode __shared__ typename BlockRunLengthDecodeT::TempStorage temp_storage; // The run-length encoded items and how often they shall be repeated in the run-length decoded output RunItemT run_values[RUNS_PER_THREAD]; RunLengthT run_lengths[RUNS_PER_THREAD]; ... // Initialize the BlockRunLengthDecode with the runs that we want to run-length decode uint32_t total_decoded_size = 0; BlockRunLengthDecodeT block_rld(temp_storage, run_values, run_lengths, total_decoded_size); // Run-length decode ("decompress") the runs into a window buffer of limited size. This is repeated until all runs // have been decoded. uint32_t decoded_window_offset = 0U; while (decoded_window_offset < total_decoded_size) { RunLengthT relative_offsets[DECODED_ITEMS_PER_THREAD]; RunItemT decoded_items[DECODED_ITEMS_PER_THREAD]; // The number of decoded items that are valid within this window (aka pass) of run-length decoding uint32_t num_valid_items = total_decoded_size - decoded_window_offset; block_rld.RunLengthDecode(decoded_items, relative_offsets, decoded_window_offset); decoded_window_offset += BLOCK_DIM_X * DECODED_ITEMS_PER_THREAD; ... } }
Suppose the set of input
run_values
across the block of threads is{ [0, 1], [2, 3], [4, 5], [6, 7], ..., [254, 255] }
andrun_lengths
is{ [1, 2], [3, 4], [5, 1], [2, 3], ..., [5, 1] }
. The corresponding outputdecoded_items
in those threads will be{ [0, 1, 1, 2], [2, 2, 3, 3], [3, 3, 4, 4], [4, 4, 4, 5], ..., [169, 169, 170, 171] }
andrelative_offsets
will be{ [0, 0, 1, 0], [1, 2, 0, 1], [2, 3, 0, 1], [2, 3, 4, 0], ..., [3, 4, 0, 0] }
during the first iteration of the while loop.Note
Trailing runs of length 0 are supported (i.e., they may only appear at the end of the run_lengths array). A run of length zero may not be followed by a run length that is not zero.
- Template Parameters
ItemT – The data type of the items being run-length decoded
BLOCK_DIM_X – The thread block length in threads along the X dimension
RUNS_PER_THREAD – The number of consecutive runs that each thread contributes
DECODED_ITEMS_PER_THREAD – The maximum number of decoded items that each thread holds
DecodedOffsetT – Type used to index into the block’s decoded items (large enough to hold the sum over all the runs’ lengths)
BLOCK_DIM_Y – The thread block length in threads along the Y dimension
BLOCK_DIM_Z – The thread block length in threads along the Z dimension
Public Functions
-
template<typename RunLengthT, typename TotalDecodedSizeT>
inline BlockRunLengthDecode(TempStorage &temp_storage, ItemT (&run_values)[RUNS_PER_THREAD], RunLengthT (&run_lengths)[RUNS_PER_THREAD], TotalDecodedSizeT &total_decoded_size) Constructor specialised for user-provided temporary storage, initializing using the runs’ lengths. The algorithm’s temporary storage may not be repurposed between the constructor call and subsequent
RunLengthDecode
calls.
-
template<typename UserRunOffsetT>
inline BlockRunLengthDecode(TempStorage &temp_storage, ItemT (&run_values)[RUNS_PER_THREAD], UserRunOffsetT (&run_offsets)[RUNS_PER_THREAD]) Constructor specialised for user-provided temporary storage, initializing using the runs’ offsets. The algorithm’s temporary storage may not be repurposed between the constructor call and subsequent
RunLengthDecode
calls.
-
template<typename RunLengthT, typename TotalDecodedSizeT>
inline BlockRunLengthDecode(ItemT (&run_values)[RUNS_PER_THREAD], RunLengthT (&run_lengths)[RUNS_PER_THREAD], TotalDecodedSizeT &total_decoded_size) Constructor specialised for static temporary storage, initializing using the runs’ lengths.
-
template<typename UserRunOffsetT>
inline BlockRunLengthDecode(ItemT (&run_values)[RUNS_PER_THREAD], UserRunOffsetT (&run_offsets)[RUNS_PER_THREAD]) Constructor specialised for static temporary storage, initializing using the runs’ offsets.
-
template<typename RelativeOffsetT>
inline void RunLengthDecode(ItemT (&decoded_items)[DECODED_ITEMS_PER_THREAD], RelativeOffsetT (&item_offsets)[DECODED_ITEMS_PER_THREAD], DecodedOffsetT from_decoded_offset = 0) Run-length decodes the runs previously passed via a call to Init(…) and returns the run-length decoded items in a blocked arrangement to
decoded_items
.If the number of run-length decoded items exceeds the run-length decode buffer (i.e.,
DECODED_ITEMS_PER_THREAD * BLOCK_THREADS
), only the items that fit within the buffer are returned. Subsequent calls toRunLengthDecode
adjustingfrom_decoded_offset
can be used to retrieve the remaining run-length decoded items. Calling __syncthreads() between any two calls toRunLengthDecode
is not required.item_offsets
can be used to retrieve each run-length decoded item’s relative index within its run. E.g., the run-length encoded array of3, 1, 4
with the respective run lengths of2, 1, 3
would yield the run-length decoded array of3, 3, 1, 4, 4, 4
with the relative offsets of0, 1, 0, 0, 1, 2
. A subsequent__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.- Parameters
decoded_items – [out] The run-length decoded items to be returned in a blocked arrangement
item_offsets – [out] The run-length decoded items’ relative offset within the run they belong to
from_decoded_offset – [in] If invoked with from_decoded_offset that is larger than total_decoded_size results in undefined behavior.
-
inline void RunLengthDecode(ItemT (&decoded_items)[DECODED_ITEMS_PER_THREAD], DecodedOffsetT from_decoded_offset = 0)
Run-length decodes the runs previously passed via a call to Init(…) and returns the run-length decoded items in a blocked arrangement to
decoded_items
.If the number of run-length decoded items exceeds the run-length decode buffer (i.e.,
DECODED_ITEMS_PER_THREAD * BLOCK_THREADS
), only the items that fit within the buffer are returned. Subsequent calls toRunLengthDecode
adjustingfrom_decoded_offset
can be used to retrieve the remaining run-length decoded items. Calling __syncthreads() between any two calls toRunLengthDecode
is not required.- Parameters
decoded_items – [out] The run-length decoded items to be returned in a blocked arrangement
from_decoded_offset – [in] If invoked with from_decoded_offset that is larger than total_decoded_size results in undefined behavior.
-
struct TempStorage : public Uninitialized<_TempStorage>