cub::DeviceRunLengthEncode
Defined in cub/device/device_run_length_encode.cuh
-
struct DeviceRunLengthEncode
DeviceRunLengthEncode provides device-wide, parallel operations for demarcating “runs” of same-valued items within a sequence residing within device-accessible memory.
Overview
A run-length encoding computes a simple compressed representation of a sequence of input elements such that each maximal “run” of consecutive same-valued data items is encoded as a single data value along with a count of the elements in that run.
Usage Considerations
Dynamic parallelism. DeviceRunLengthEncode methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.
Performance
The work-complexity of run-length encode as a function of input size is linear, resulting in performance throughput that plateaus with problem sizes large enough to saturate the GPU.
Public Static Functions
-
template<typename InputIteratorT, typename UniqueOutputIteratorT, typename LengthsOutputIteratorT, typename NumRunsOutputIteratorT>
static inline cudaError_t Encode(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, UniqueOutputIteratorT d_unique_out, LengthsOutputIteratorT d_counts_out, NumRunsOutputIteratorT d_num_runs_out, int num_items, cudaStream_t stream = 0) Computes a run-length encoding of the sequence
d_in
.For the ith run encountered, the first key of the run and its length are written to
d_unique_out[i]
andd_counts_out[i]
, respectively.The total number of runs encountered is written to
d_num_runs_out
.The
==
equality operator is used to determine whether values are equivalentIn-place operations are not supported. There must be no overlap between any of the provided ranges:
[d_unique_out, d_unique_out + *d_num_runs_out)
[d_counts_out, d_counts_out + *d_num_runs_out)
[d_num_runs_out, d_num_runs_out + 1)
[d_in, d_in + num_items)
When
d_temp_storage
isnullptr
, no work is done and the required allocation size is returned intemp_storage_bytes
.
Snippet
The code snippet below illustrates the run-length encoding of a sequence of
int
values.#include <cub/cub.cuh> // or equivalently <cub/device/device_run_length_encode.cuh> // Declare, allocate, and initialize device-accessible pointers for // input and output int num_items; // e.g., 8 int *d_in; // e.g., [0, 2, 2, 9, 5, 5, 5, 8] int *d_unique_out; // e.g., [ , , , , , , , ] int *d_counts_out; // e.g., [ , , , , , , , ] int *d_num_runs_out; // e.g., [ ] ... // Determine temporary device storage requirements void *d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceRunLengthEncode::Encode( d_temp_storage, temp_storage_bytes, d_in, d_unique_out, d_counts_out, d_num_runs_out, num_items); // Allocate temporary storage cudaMalloc(&d_temp_storage, temp_storage_bytes); // Run encoding cub::DeviceRunLengthEncode::Encode( d_temp_storage, temp_storage_bytes, d_in, d_unique_out, d_counts_out, d_num_runs_out, num_items); // d_unique_out <-- [0, 2, 9, 5, 8] // d_counts_out <-- [1, 2, 1, 3, 1] // d_num_runs_out <-- [5]
- Template Parameters
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
UniqueOutputIteratorT – [inferred] Random-access output iterator type for writing unique output items (may be a simple pointer type)
LengthsOutputIteratorT – [inferred] Random-access output iterator type for writing output counts (may be a simple pointer type)
NumRunsOutputIteratorT – [inferred] Output iterator type for recording the number of runs encountered (may be a simple pointer type)
- Parameters
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storage
allocationd_in – [in] Pointer to the input sequence of keys
d_unique_out – [out] Pointer to the output sequence of unique keys (one key per run)
d_counts_out – [out] Pointer to the output sequence of run-lengths (one count per run)
d_num_runs_out – [out] Pointer to total number of runs
num_items – [in] Total number of associated key+value pairs (i.e., the length of
d_in_keys
andd_in_values
)stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OffsetsOutputIteratorT, typename LengthsOutputIteratorT, typename NumRunsOutputIteratorT>
static inline cudaError_t NonTrivialRuns(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OffsetsOutputIteratorT d_offsets_out, LengthsOutputIteratorT d_lengths_out, NumRunsOutputIteratorT d_num_runs_out, int num_items, cudaStream_t stream = 0) Enumerates the starting offsets and lengths of all non-trivial runs (of
length > 1
) of same-valued keys in the sequenced_in
.For the ith non-trivial run, the run’s starting offset and its length are written to
d_offsets_out[i]
andd_lengths_out[i]
, respectively.The total number of runs encountered is written to
d_num_runs_out
.The
==
equality operator is used to determine whether values are equivalentIn-place operations are not supported. There must be no overlap between any of the provided ranges:
[d_offsets_out, d_offsets_out + *d_num_runs_out)
[d_lengths_out, d_lengths_out + *d_num_runs_out)
[d_num_runs_out, d_num_runs_out + 1)
[d_in, d_in + num_items)
When
d_temp_storage
isnullptr
, no work is done and the required allocation size is returned intemp_storage_bytes
.
Snippet
The code snippet below illustrates the identification of non-trivial runs within a sequence of
int
values.#include <cub/cub.cuh> // or equivalently <cub/device/device_run_length_encode.cuh> // Declare, allocate, and initialize device-accessible pointers // for input and output int num_items; // e.g., 8 int *d_in; // e.g., [0, 2, 2, 9, 5, 5, 5, 8] int *d_offsets_out; // e.g., [ , , , , , , , ] int *d_lengths_out; // e.g., [ , , , , , , , ] int *d_num_runs_out; // e.g., [ ] ... // Determine temporary device storage requirements void *d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceRunLengthEncode::NonTrivialRuns( d_temp_storage, temp_storage_bytes, d_in, d_offsets_out, d_lengths_out, d_num_runs_out, num_items); // Allocate temporary storage cudaMalloc(&d_temp_storage, temp_storage_bytes); // Run encoding cub::DeviceRunLengthEncode::NonTrivialRuns( d_temp_storage, temp_storage_bytes, d_in, d_offsets_out, d_lengths_out, d_num_runs_out, num_items); // d_offsets_out <-- [1, 4] // d_lengths_out <-- [2, 3] // d_num_runs_out <-- [2]
- Template Parameters
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OffsetsOutputIteratorT – [inferred] Random-access output iterator type for writing run-offset values (may be a simple pointer type)
LengthsOutputIteratorT – [inferred] Random-access output iterator type for writing run-length values (may be a simple pointer type)
NumRunsOutputIteratorT – [inferred] Output iterator type for recording the number of runs encountered (may be a simple pointer type)
- Parameters
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storage
allocationd_in – [in] Pointer to input sequence of data items
d_offsets_out – [out] Pointer to output sequence of run-offsets (one offset per non-trivial run)
d_lengths_out – [out] Pointer to output sequence of run-lengths (one count per non-trivial run)
d_num_runs_out – [out] Pointer to total number of runs (i.e., length of
d_offsets_out
)num_items – [in] Total number of associated key+value pairs (i.e., the length of
d_in_keys
andd_in_values
)stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.