cub::DeviceRunLengthEncode#
-
struct DeviceRunLengthEncode#
DeviceRunLengthEncode provides device-wide, parallel operations for demarcating “runs” of same-valued items within a sequence residing within device-accessible memory.
Overview#
A run-length encoding computes a simple compressed representation of a sequence of input elements such that each maximal “run” of consecutive same-valued data items is encoded as a single data value along with a count of the elements in that run.
Usage Considerations#
Dynamic parallelism. DeviceRunLengthEncode methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.
Performance#
The work-complexity of run-length encode as a function of input size is linear, resulting in performance throughput that plateaus with problem sizes large enough to saturate the GPU.
Public Static Functions
-
template<typename InputIteratorT, typename UniqueOutputIteratorT, typename LengthsOutputIteratorT, typename NumRunsOutputIteratorT, typename NumItemsT>
static inline cudaError_t Encode( - void *d_temp_storage,
- size_t &temp_storage_bytes,
- InputIteratorT d_in,
- UniqueOutputIteratorT d_unique_out,
- LengthsOutputIteratorT d_counts_out,
- NumRunsOutputIteratorT d_num_runs_out,
- NumItemsT num_items,
- cudaStream_t stream = 0,
Computes a run-length encoding of the sequence
d_in
.For the ith run encountered, the first key of the run and its length are written to
d_unique_out[i]
andd_counts_out[i]
, respectively.The total number of runs encountered is written to
d_num_runs_out
.The
==
equality operator is used to determine whether values are equivalentIn-place operations are not supported. There must be no overlap between any of the provided ranges:
[d_unique_out, d_unique_out + *d_num_runs_out)
[d_counts_out, d_counts_out + *d_num_runs_out)
[d_num_runs_out, d_num_runs_out + 1)
[d_in, d_in + num_items)
When
d_temp_storage
isnullptr
, no work is done and the required allocation size is returned intemp_storage_bytes
.
Snippet#
The code snippet below illustrates the run-length encoding of a sequence of
int
values.#include <cub/cub.cuh> // or equivalently <cub/device/device_run_length_encode.cuh> // Declare, allocate, and initialize device-accessible pointers for // input and output int num_items; // e.g., 8 int *d_in; // e.g., [0, 2, 2, 9, 5, 5, 5, 8] int *d_unique_out; // e.g., [ , , , , , , , ] int *d_counts_out; // e.g., [ , , , , , , , ] int *d_num_runs_out; // e.g., [ ] ... // Determine temporary device storage requirements void *d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceRunLengthEncode::Encode( d_temp_storage, temp_storage_bytes, d_in, d_unique_out, d_counts_out, d_num_runs_out, num_items); // Allocate temporary storage cudaMalloc(&d_temp_storage, temp_storage_bytes); // Run encoding cub::DeviceRunLengthEncode::Encode( d_temp_storage, temp_storage_bytes, d_in, d_unique_out, d_counts_out, d_num_runs_out, num_items); // d_unique_out <-- [0, 2, 9, 5, 8] // d_counts_out <-- [1, 2, 1, 3, 1] // d_num_runs_out <-- [5]
- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
UniqueOutputIteratorT – [inferred] Random-access output iterator type for writing unique output items (may be a simple pointer type)
LengthsOutputIteratorT – [inferred] Random-access output iterator type for writing output counts (may be a simple pointer type)
NumRunsOutputIteratorT – [inferred] Output iterator type for recording the number of runs encountered (may be a simple pointer type)
- Parameters:
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storage
allocationd_in – [in] Pointer to the input sequence of keys
d_unique_out – [out] Pointer to the output sequence of unique keys (one key per run)
d_counts_out – [out] Pointer to the output sequence of run-lengths (one count per run)
d_num_runs_out – [out] Pointer to total number of runs
num_items – [in] Total number of associated key+value pairs (i.e., the length of
d_in_keys
andd_in_values
)stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OffsetsOutputIteratorT, typename LengthsOutputIteratorT, typename NumRunsOutputIteratorT, typename NumItemsT>
static inline cudaError_t NonTrivialRuns( - void *d_temp_storage,
- size_t &temp_storage_bytes,
- InputIteratorT d_in,
- OffsetsOutputIteratorT d_offsets_out,
- LengthsOutputIteratorT d_lengths_out,
- NumRunsOutputIteratorT d_num_runs_out,
- NumItemsT num_items,
- cudaStream_t stream = 0,
Enumerates the starting offsets and lengths of all non-trivial runs (of
length > 1
) of same-valued keys in the sequenced_in
.For the ith non-trivial run, the run’s starting offset and its length are written to
d_offsets_out[i]
andd_lengths_out[i]
, respectively.The total number of runs encountered is written to
d_num_runs_out
.The
==
equality operator is used to determine whether values are equivalentIn-place operations are not supported. There must be no overlap between any of the provided ranges:
[d_offsets_out, d_offsets_out + *d_num_runs_out)
[d_lengths_out, d_lengths_out + *d_num_runs_out)
[d_num_runs_out, d_num_runs_out + 1)
[d_in, d_in + num_items)
When
d_temp_storage
isnullptr
, no work is done and the required allocation size is returned intemp_storage_bytes
.
Snippet#
The code snippet below illustrates the identification of non-trivial runs within a sequence of
int
values.#include <cub/cub.cuh> // or equivalently <cub/device/device_run_length_encode.cuh> // Declare, allocate, and initialize device-accessible pointers // for input and output int num_items; // e.g., 8 int *d_in; // e.g., [0, 2, 2, 9, 5, 5, 5, 8] int *d_offsets_out; // e.g., [ , , , , , , , ] int *d_lengths_out; // e.g., [ , , , , , , , ] int *d_num_runs_out; // e.g., [ ] ... // Determine temporary device storage requirements void *d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceRunLengthEncode::NonTrivialRuns( d_temp_storage, temp_storage_bytes, d_in, d_offsets_out, d_lengths_out, d_num_runs_out, num_items); // Allocate temporary storage cudaMalloc(&d_temp_storage, temp_storage_bytes); // Run encoding cub::DeviceRunLengthEncode::NonTrivialRuns( d_temp_storage, temp_storage_bytes, d_in, d_offsets_out, d_lengths_out, d_num_runs_out, num_items); // d_offsets_out <-- [1, 4] // d_lengths_out <-- [2, 3] // d_num_runs_out <-- [2]
- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OffsetsOutputIteratorT – [inferred] Random-access output iterator type for writing run-offset values (may be a simple pointer type)
LengthsOutputIteratorT – [inferred] Random-access output iterator type for writing run-length values (may be a simple pointer type)
NumRunsOutputIteratorT – [inferred] Output iterator type for recording the number of runs encountered (may be a simple pointer type)
- Parameters:
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storage
allocationd_in – [in] Pointer to input sequence of data items
d_offsets_out – [out] Pointer to output sequence of run-offsets (one offset per non-trivial run)
d_lengths_out – [out] Pointer to output sequence of run-lengths (one count per non-trivial run)
d_num_runs_out – [out] Pointer to total number of runs (i.e., length of
d_offsets_out
)num_items – [in] Total number of associated key+value pairs (i.e., the length of
d_in_keys
andd_in_values
)stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.