cub::DeviceRunLengthEncode#
-
struct DeviceRunLengthEncode#
DeviceRunLengthEncode provides device-wide, parallel operations for demarcating “runs” of same-valued items within a sequence residing within device-accessible memory.
Overview#
A run-length encoding computes a simple compressed representation of a sequence of input elements such that each maximal “run” of consecutive same-valued data items is encoded as a single data value along with a count of the elements in that run.
Usage Considerations#
Dynamic parallelism. DeviceRunLengthEncode methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.
Performance#
The work-complexity of run-length encode as a function of input size is linear, resulting in performance throughput that plateaus with problem sizes large enough to saturate the GPU.
Public Static Functions
-
template<typename InputIteratorT, typename UniqueOutputIteratorT, typename LengthsOutputIteratorT, typename NumRunsOutputIteratorT, typename NumItemsT>
static inline cudaError_t Encode( - void *d_temp_storage,
- size_t &temp_storage_bytes,
- InputIteratorT d_in,
- UniqueOutputIteratorT d_unique_out,
- LengthsOutputIteratorT d_counts_out,
- NumRunsOutputIteratorT d_num_runs_out,
- NumItemsT num_items,
- cudaStream_t stream = 0
Computes a run-length encoding of the sequence
d_in.Added in version 2.2.0: First appears in CUDA Toolkit 12.3.
For the ith run encountered, the first key of the run and its length are written to
d_unique_out[i]andd_counts_out[i], respectively.The total number of runs encountered is written to
d_num_runs_out.The
==equality operator is used to determine whether values are equivalentIn-place operations are not supported. There must be no overlap between any of the provided ranges:
[d_unique_out, d_unique_out + *d_num_runs_out)[d_counts_out, d_counts_out + *d_num_runs_out)[d_num_runs_out, d_num_runs_out + 1)[d_in, d_in + num_items)
When
d_temp_storageisnullptr, no work is done and the required allocation size is returned intemp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.
Snippet#
The code snippet below illustrates the run-length encoding of a sequence of
intvalues.#include <cub/cub.cuh> // or equivalently <cub/device/device_run_length_encode.cuh> // Declare, allocate, and initialize device-accessible pointers for // input and output int num_items; // e.g., 8 int *d_in; // e.g., [0, 2, 2, 9, 5, 5, 5, 8] int *d_unique_out; // e.g., [ , , , , , , , ] int *d_counts_out; // e.g., [ , , , , , , , ] int *d_num_runs_out; // e.g., [ ] ... // Determine temporary device storage requirements void *d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceRunLengthEncode::Encode( d_temp_storage, temp_storage_bytes, d_in, d_unique_out, d_counts_out, d_num_runs_out, num_items); // Allocate temporary storage cudaMalloc(&d_temp_storage, temp_storage_bytes); // Run encoding cub::DeviceRunLengthEncode::Encode( d_temp_storage, temp_storage_bytes, d_in, d_unique_out, d_counts_out, d_num_runs_out, num_items); // d_unique_out <-- [0, 2, 9, 5, 8] // d_counts_out <-- [1, 2, 1, 3, 1] // d_num_runs_out <-- [5]
- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
UniqueOutputIteratorT – [inferred] Random-access output iterator type for writing unique output items (may be a simple pointer type)
LengthsOutputIteratorT – [inferred] Random-access output iterator type for writing output counts (may be a simple pointer type)
NumRunsOutputIteratorT – [inferred] Output iterator type for recording the number of runs encountered (may be a simple pointer type)
- Parameters:
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr, the required allocation size is written totemp_storage_bytesand no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storageallocationd_in – [in] Pointer to the input sequence of keys
d_unique_out – [out] Pointer to the output sequence of unique keys (one key per run)
d_counts_out – [out] Pointer to the output sequence of run-lengths (one count per run)
d_num_runs_out – [out] Pointer to total number of runs
num_items – [in] Total number of associated key+value pairs (i.e., the length of
d_in_keysandd_in_values)stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename UniqueOutputIteratorT, typename LengthsOutputIteratorT, typename NumRunsOutputIteratorT, typename NumItemsT, typename EnvT = ::cuda::std::execution::env<>, ::cuda::std::enable_if_t<!::cuda::std::is_same_v<InputIteratorT, void*>, int> = 0>
static inline cudaError_t Encode( - InputIteratorT d_in,
- UniqueOutputIteratorT d_unique_out,
- LengthsOutputIteratorT d_counts_out,
- NumRunsOutputIteratorT d_num_runs_out,
- NumItemsT num_items,
- EnvT env = {}
Computes a run-length encoding of the sequence
d_in.Added in version 3.4.0: First appears in CUDA Toolkit 13.4.
This is an environment-based API that allows customization of:
Stream: Query via
cuda::get_streamMemory resource: Query via
cuda::mr::get_memory_resourceFor the ith run encountered, the first key of the run and its length are written to
d_unique_out[i]andd_counts_out[i], respectively.The total number of runs encountered is written to
d_num_runs_out.The
==equality operator is used to determine whether values are equivalentIn-place operations are not supported. There must be no overlap between any of the provided ranges:
[d_unique_out, d_unique_out + *d_num_runs_out)[d_counts_out, d_counts_out + *d_num_runs_out)[d_num_runs_out, d_num_runs_out + 1)[d_in, d_in + num_items)
Snippet#
The code snippet below illustrates the run-length encoding of a sequence of
intvalues using environment-based API:auto input = thrust::device_vector<int>{0, 2, 2, 9, 5, 5, 5, 8}; auto unique_out = thrust::device_vector<int>(8); auto counts_out = thrust::device_vector<int>(8); auto num_runs_out = thrust::device_vector<int>(1); cuda::stream stream{cuda::devices[0]}; cuda::stream_ref stream_ref{stream}; auto env = cuda::std::execution::env{stream_ref}; auto error = cub::DeviceRunLengthEncode::Encode( input.begin(), unique_out.begin(), counts_out.begin(), num_runs_out.begin(), input.size(), env); if (error != cudaSuccess) { std::cerr << "cub::DeviceRunLengthEncode::Encode failed with status: " << error << std::endl; } thrust::device_vector<int> expected_unique{0, 2, 9, 5, 8}; thrust::device_vector<int> expected_counts{1, 2, 1, 3, 1}; thrust::device_vector<int> expected_num_runs{5};
- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
UniqueOutputIteratorT – [inferred] Random-access output iterator type for writing unique output items (may be a simple pointer type)
LengthsOutputIteratorT – [inferred] Random-access output iterator type for writing output counts (may be a simple pointer type)
NumRunsOutputIteratorT – [inferred] Output iterator type for recording the number of runs encountered (may be a simple pointer type)
NumItemsT – [inferred] Type of num_items
EnvT – [inferred] Environment type (e.g.,
cuda::std::execution::env<...>)
- Parameters:
d_in – [in] Pointer to the input sequence of keys
d_unique_out – [out] Pointer to the output sequence of unique keys (one key per run)
d_counts_out – [out] Pointer to the output sequence of run-lengths (one count per run)
d_num_runs_out – [out] Pointer to total number of runs
num_items – [in] Total number of input items (i.e., the length of
d_in)env – [in] [optional] Execution environment. Default is
cuda::std::execution::env{}.
-
template<typename InputIteratorT, typename OffsetsOutputIteratorT, typename LengthsOutputIteratorT, typename NumRunsOutputIteratorT, typename NumItemsT>
static inline cudaError_t NonTrivialRuns( - void *d_temp_storage,
- size_t &temp_storage_bytes,
- InputIteratorT d_in,
- OffsetsOutputIteratorT d_offsets_out,
- LengthsOutputIteratorT d_lengths_out,
- NumRunsOutputIteratorT d_num_runs_out,
- NumItemsT num_items,
- cudaStream_t stream = 0
Enumerates the starting offsets and lengths of all non-trivial runs (of
length > 1) of same-valued keys in the sequenced_in.Added in version 2.2.0: First appears in CUDA Toolkit 12.3.
For the ith non-trivial run, the run’s starting offset and its length are written to
d_offsets_out[i]andd_lengths_out[i], respectively.The total number of runs encountered is written to
d_num_runs_out.The
==equality operator is used to determine whether values are equivalentIn-place operations are not supported. There must be no overlap between any of the provided ranges:
[d_offsets_out, d_offsets_out + *d_num_runs_out)[d_lengths_out, d_lengths_out + *d_num_runs_out)[d_num_runs_out, d_num_runs_out + 1)[d_in, d_in + num_items)
When
d_temp_storageisnullptr, no work is done and the required allocation size is returned intemp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.
Snippet#
The code snippet below illustrates the identification of non-trivial runs within a sequence of
intvalues.#include <cub/cub.cuh> // or equivalently <cub/device/device_run_length_encode.cuh> // Declare, allocate, and initialize device-accessible pointers // for input and output int num_items; // e.g., 8 int *d_in; // e.g., [0, 2, 2, 9, 5, 5, 5, 8] int *d_offsets_out; // e.g., [ , , , , , , , ] int *d_lengths_out; // e.g., [ , , , , , , , ] int *d_num_runs_out; // e.g., [ ] ... // Determine temporary device storage requirements void *d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceRunLengthEncode::NonTrivialRuns( d_temp_storage, temp_storage_bytes, d_in, d_offsets_out, d_lengths_out, d_num_runs_out, num_items); // Allocate temporary storage cudaMalloc(&d_temp_storage, temp_storage_bytes); // Run encoding cub::DeviceRunLengthEncode::NonTrivialRuns( d_temp_storage, temp_storage_bytes, d_in, d_offsets_out, d_lengths_out, d_num_runs_out, num_items); // d_offsets_out <-- [1, 4] // d_lengths_out <-- [2, 3] // d_num_runs_out <-- [2]
- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OffsetsOutputIteratorT – [inferred] Random-access output iterator type for writing run-offset values (may be a simple pointer type)
LengthsOutputIteratorT – [inferred] Random-access output iterator type for writing run-length values (may be a simple pointer type)
NumRunsOutputIteratorT – [inferred] Output iterator type for recording the number of runs encountered (may be a simple pointer type)
- Parameters:
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr, the required allocation size is written totemp_storage_bytesand no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storageallocationd_in – [in] Pointer to input sequence of data items
d_offsets_out – [out] Pointer to output sequence of run-offsets (one offset per non-trivial run)
d_lengths_out – [out] Pointer to output sequence of run-lengths (one count per non-trivial run)
d_num_runs_out – [out] Pointer to total number of runs (i.e., length of
d_offsets_out)num_items – [in] Total number of associated key+value pairs (i.e., the length of
d_in_keysandd_in_values)stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OffsetsOutputIteratorT, typename LengthsOutputIteratorT, typename NumRunsOutputIteratorT, typename NumItemsT, typename EnvT = ::cuda::std::execution::env<>, ::cuda::std::enable_if_t<!::cuda::std::is_same_v<InputIteratorT, void*>, int> = 0>
static inline cudaError_t NonTrivialRuns( - InputIteratorT d_in,
- OffsetsOutputIteratorT d_offsets_out,
- LengthsOutputIteratorT d_lengths_out,
- NumRunsOutputIteratorT d_num_runs_out,
- NumItemsT num_items,
- EnvT env = {}
Enumerates the starting offsets and lengths of all non-trivial runs (of
length > 1) of same-valued keys in the sequenced_in.Added in version 3.4.0: First appears in CUDA Toolkit 13.4.
This is an environment-based API that allows customization of:
Stream: Query via
cuda::get_streamMemory resource: Query via
cuda::mr::get_memory_resourceFor the ith non-trivial run, the run’s starting offset and its length are written to
d_offsets_out[i]andd_lengths_out[i], respectively.The total number of runs encountered is written to
d_num_runs_out.The
==equality operator is used to determine whether values are equivalentIn-place operations are not supported. There must be no overlap between any of the provided ranges:
[d_offsets_out, d_offsets_out + *d_num_runs_out)[d_lengths_out, d_lengths_out + *d_num_runs_out)[d_num_runs_out, d_num_runs_out + 1)[d_in, d_in + num_items)
Snippet#
The code snippet below illustrates the identification of non-trivial runs within a sequence of
intvalues using environment-based API:auto input = thrust::device_vector<int>{0, 2, 2, 9, 5, 5, 5, 8}; auto offsets_out = thrust::device_vector<int>(8); auto lengths_out = thrust::device_vector<int>(8); auto num_runs_out = thrust::device_vector<int>(1); cuda::stream stream{cuda::devices[0]}; cuda::stream_ref stream_ref{stream}; auto env = cuda::std::execution::env{stream_ref}; auto error = cub::DeviceRunLengthEncode::NonTrivialRuns( input.begin(), offsets_out.begin(), lengths_out.begin(), num_runs_out.begin(), input.size(), env); if (error != cudaSuccess) { std::cerr << "cub::DeviceRunLengthEncode::NonTrivialRuns failed with status: " << error << std::endl; } thrust::device_vector<int> expected_offsets{1, 4}; thrust::device_vector<int> expected_lengths{2, 3}; thrust::device_vector<int> expected_num_runs{2};
- Template Parameters:
InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OffsetsOutputIteratorT – [inferred] Random-access output iterator type for writing run-offset values (may be a simple pointer type)
LengthsOutputIteratorT – [inferred] Random-access output iterator type for writing run-length values (may be a simple pointer type)
NumRunsOutputIteratorT – [inferred] Output iterator type for recording the number of runs encountered (may be a simple pointer type)
NumItemsT – [inferred] Type of num_items
EnvT – [inferred] Environment type (e.g.,
cuda::std::execution::env<...>)
- Parameters:
d_in – [in] Pointer to input sequence of data items
d_offsets_out – [out] Pointer to output sequence of run-offsets (one offset per non-trivial run)
d_lengths_out – [out] Pointer to output sequence of run-lengths (one count per non-trivial run)
d_num_runs_out – [out] Pointer to total number of runs (i.e., length of
d_offsets_out)num_items – [in] Total number of input items (i.e., the length of
d_in)env – [in] [optional] Execution environment. Default is
cuda::std::execution::env{}.