cub::DeviceRunLengthEncode#

struct DeviceRunLengthEncode#

DeviceRunLengthEncode provides device-wide, parallel operations for demarcating “runs” of same-valued items within a sequence residing within device-accessible memory.

Overview#

A run-length encoding computes a simple compressed representation of a sequence of input elements such that each maximal “run” of consecutive same-valued data items is encoded as a single data value along with a count of the elements in that run.

Usage Considerations#

Dynamic parallelism. DeviceRunLengthEncode methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.

Performance#

The work-complexity of run-length encode as a function of input size is linear, resulting in performance throughput that plateaus with problem sizes large enough to saturate the GPU.

Public Static Functions

template<typename InputIteratorT, typename UniqueOutputIteratorT, typename LengthsOutputIteratorT, typename NumRunsOutputIteratorT, typename NumItemsT> static inline cudaError_t Encode( void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, UniqueOutputIteratorT d_unique_out, LengthsOutputIteratorT d_counts_out, NumRunsOutputIteratorT d_num_runs_out, NumItemsT num_items, cudaStream_t stream = nullptr )#

Computes a run-length encoding of the sequence d_in.

Added in version 2.2.0: First appears in CUDA Toolkit 12.3.

For the i^th run encountered, the first key of the run and its length are written to d_unique_out[i] and d_counts_out[i], respectively.
The total number of runs encountered is written to d_num_runs_out.
The == equality operator is used to determine whether values are equivalent
In-place operations are not supported. There must be no overlap between any of the provided ranges:
- [d_unique_out, d_unique_out + *d_num_runs_out)
- [d_counts_out, d_counts_out + *d_num_runs_out)
- [d_num_runs_out, d_num_runs_out + 1)
- [d_in, d_in + num_items)
When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes. See Two-Phase API (explicit temporary storage management) for usage guidance.

Snippet#

The code snippet below illustrates the run-length encoding of a sequence of int values.

#include <cub/cub.cuh> // or equivalently <cub/device/device_run_length_encode.cuh>

// Declare, allocate, and initialize device-accessible pointers for
// input and output
int          num_items;          // e.g., 8
int          *d_in;              // e.g., [0, 2, 2, 9, 5, 5, 5, 8]
int          *d_unique_out;      // e.g., [ ,  ,  ,  ,  ,  ,  ,  ]
int          *d_counts_out;      // e.g., [ ,  ,  ,  ,  ,  ,  ,  ]
int          *d_num_runs_out;    // e.g., [ ]
...

// Determine temporary device storage requirements
void     *d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceRunLengthEncode::Encode(
  d_temp_storage, temp_storage_bytes,
  d_in, d_unique_out, d_counts_out, d_num_runs_out, num_items);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run encoding
cub::DeviceRunLengthEncode::Encode(
  d_temp_storage, temp_storage_bytes,
  d_in, d_unique_out, d_counts_out, d_num_runs_out, num_items);

// d_unique_out      <-- [0, 2, 9, 5, 8]
// d_counts_out      <-- [1, 2, 1, 3, 1]
// d_num_runs_out    <-- [5]

Template Parameters:

InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
UniqueOutputIteratorT – [inferred] Random-access output iterator type for writing unique output items (may be a simple pointer type)
LengthsOutputIteratorT – [inferred] Random-access output iterator type for writing output counts (may be a simple pointer type)
NumRunsOutputIteratorT – [inferred] Output iterator type for recording the number of runs encountered (may be a simple pointer type)
NumItemsT – [inferred] Type of num_items

Parameters:

d_temp_storage – [in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.
temp_storage_bytes – [inout] Reference to size in bytes of d_temp_storage allocation
d_in – [in] Pointer to the input sequence of keys
d_unique_out – [out] Pointer to the output sequence of unique keys (one key per run)
d_counts_out – [out] Pointer to the output sequence of run-lengths (one count per run)
d_num_runs_out – [out] Pointer to total number of runs
num_items – [in] Total number of input items (i.e., the length of d_in)
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream₀.

template<typename InputIteratorT, typename UniqueOutputIteratorT, typename LengthsOutputIteratorT, typename NumRunsOutputIteratorT, typename NumItemsT, typename EnvT = ::cuda::std::execution::env<>> static inline cudaError_t Encode( InputIteratorT d_in, UniqueOutputIteratorT d_unique_out, LengthsOutputIteratorT d_counts_out, NumRunsOutputIteratorT d_num_runs_out, NumItemsT num_items, EnvT env = {} )#

Computes a run-length encoding of the sequence d_in.

Added in version 3.4.0: First appears in CUDA Toolkit 13.4.

This is an environment-based API that allows customization of:

Stream: Query via cuda::get_stream
Memory resource: Query via cuda::mr::get_memory_resource
For the i^th run encountered, the first key of the run and its length are written to d_unique_out[i] and d_counts_out[i], respectively.
The total number of runs encountered is written to d_num_runs_out.
The == equality operator is used to determine whether values are equivalent
In-place operations are not supported. There must be no overlap between any of the provided ranges:
- [d_unique_out, d_unique_out + *d_num_runs_out)
- [d_counts_out, d_counts_out + *d_num_runs_out)
- [d_num_runs_out, d_num_runs_out + 1)
- [d_in, d_in + num_items)

Snippet#

The code snippet below illustrates the run-length encoding of a sequence of int values using environment-based API:

auto input        = thrust::device_vector<int>{0, 2, 2, 9, 5, 5, 5, 8};
auto unique_out   = thrust::device_vector<int>(8);
auto counts_out   = thrust::device_vector<int>(8);
auto num_runs_out = thrust::device_vector<int>(1);

cuda::stream stream{cuda::devices[0]};
cuda::stream_ref stream_ref{stream};

auto error = cub::DeviceRunLengthEncode::Encode(
  input.begin(), unique_out.begin(), counts_out.begin(), num_runs_out.begin(), input.size(), stream_ref);
if (error != cudaSuccess)
{
  std::cerr << "cub::DeviceRunLengthEncode::Encode failed with status: " << error << '\n';
}

thrust::device_vector<int> expected_unique{0, 2, 9, 5, 8};
thrust::device_vector<int> expected_counts{1, 2, 1, 3, 1};
thrust::device_vector<int> expected_num_runs{5};

Template Parameters:

InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
UniqueOutputIteratorT – [inferred] Random-access output iterator type for writing unique output items (may be a simple pointer type)
LengthsOutputIteratorT – [inferred] Random-access output iterator type for writing output counts (may be a simple pointer type)
NumRunsOutputIteratorT – [inferred] Output iterator type for recording the number of runs encountered (may be a simple pointer type)
NumItemsT – [inferred] Type of num_items
EnvT – [inferred] Environment type (e.g., cuda::std::execution::env<...>)

Parameters:

d_in – [in] Pointer to the input sequence of keys
d_unique_out – [out] Pointer to the output sequence of unique keys (one key per run)
d_counts_out – [out] Pointer to the output sequence of run-lengths (one count per run)
d_num_runs_out – [out] Pointer to total number of runs
num_items – [in] Total number of input items (i.e., the length of d_in)
env – [in] [optional] Execution environment. Default is cuda::std::execution::env{}.

template<typename InputIteratorT, typename OffsetsOutputIteratorT, typename LengthsOutputIteratorT, typename NumRunsOutputIteratorT, typename NumItemsT> static inline cudaError_t NonTrivialRuns( void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OffsetsOutputIteratorT d_offsets_out, LengthsOutputIteratorT d_lengths_out, NumRunsOutputIteratorT d_num_runs_out, NumItemsT num_items, cudaStream_t stream = nullptr )#

Enumerates the starting offsets and lengths of all non-trivial runs (of length > 1) of same-valued keys in the sequence d_in.

Added in version 2.2.0: First appears in CUDA Toolkit 12.3.

For the i^th non-trivial run, the run’s starting offset and its length are written to d_offsets_out[i] and d_lengths_out[i], respectively.
The total number of runs encountered is written to d_num_runs_out.
The == equality operator is used to determine whether values are equivalent
In-place operations are not supported. There must be no overlap between any of the provided ranges:
- [d_offsets_out, d_offsets_out + *d_num_runs_out)
- [d_lengths_out, d_lengths_out + *d_num_runs_out)
- [d_num_runs_out, d_num_runs_out + 1)
- [d_in, d_in + num_items)
When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes. See Two-Phase API (explicit temporary storage management) for usage guidance.

Snippet#

The code snippet below illustrates the identification of non-trivial runs within a sequence of int values.

#include <cub/cub.cuh> // or equivalently <cub/device/device_run_length_encode.cuh>

// Declare, allocate, and initialize device-accessible pointers
// for input and output
int          num_items;          // e.g., 8
int          *d_in;              // e.g., [0, 2, 2, 9, 5, 5, 5, 8]
int          *d_offsets_out;     // e.g., [ ,  ,  ,  ,  ,  ,  ,  ]
int          *d_lengths_out;     // e.g., [ ,  ,  ,  ,  ,  ,  ,  ]
int          *d_num_runs_out;    // e.g., [ ]
...

// Determine temporary device storage requirements
void     *d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceRunLengthEncode::NonTrivialRuns(
  d_temp_storage, temp_storage_bytes,
  d_in, d_offsets_out, d_lengths_out, d_num_runs_out, num_items);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run encoding
cub::DeviceRunLengthEncode::NonTrivialRuns(
  d_temp_storage, temp_storage_bytes,
  d_in, d_offsets_out, d_lengths_out, d_num_runs_out, num_items);

// d_offsets_out         <-- [1, 4]
// d_lengths_out         <-- [2, 3]
// d_num_runs_out        <-- [2]

Template Parameters:

InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OffsetsOutputIteratorT – [inferred] Random-access output iterator type for writing run-offset values (may be a simple pointer type)
LengthsOutputIteratorT – [inferred] Random-access output iterator type for writing run-length values (may be a simple pointer type)
NumRunsOutputIteratorT – [inferred] Output iterator type for recording the number of runs encountered (may be a simple pointer type)
NumItemsT – [inferred] Type of num_items

Parameters:

d_temp_storage – [in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.
temp_storage_bytes – [inout] Reference to size in bytes of d_temp_storage allocation
d_in – [in] Pointer to input sequence of data items
d_offsets_out – [out] Pointer to output sequence of run-offsets (one offset per non-trivial run)
d_lengths_out – [out] Pointer to output sequence of run-lengths (one count per non-trivial run)
d_num_runs_out – [out] Pointer to total number of runs (i.e., length of d_offsets_out)
num_items – [in] Total number of input items (i.e., the length of d_in)
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream₀.

template<typename InputIteratorT, typename OffsetsOutputIteratorT, typename LengthsOutputIteratorT, typename NumRunsOutputIteratorT, typename NumItemsT, typename EnvT = ::cuda::std::execution::env<>> static inline cudaError_t NonTrivialRuns( InputIteratorT d_in, OffsetsOutputIteratorT d_offsets_out, LengthsOutputIteratorT d_lengths_out, NumRunsOutputIteratorT d_num_runs_out, NumItemsT num_items, EnvT env = {} )#

Enumerates the starting offsets and lengths of all non-trivial runs (of length > 1) of same-valued keys in the sequence d_in.

Added in version 3.4.0: First appears in CUDA Toolkit 13.4.

This is an environment-based API that allows customization of:

Stream: Query via cuda::get_stream
Memory resource: Query via cuda::mr::get_memory_resource
For the i^th non-trivial run, the run’s starting offset and its length are written to d_offsets_out[i] and d_lengths_out[i], respectively.
The total number of runs encountered is written to d_num_runs_out.
The == equality operator is used to determine whether values are equivalent
In-place operations are not supported. There must be no overlap between any of the provided ranges:
- [d_offsets_out, d_offsets_out + *d_num_runs_out)
- [d_lengths_out, d_lengths_out + *d_num_runs_out)
- [d_num_runs_out, d_num_runs_out + 1)
- [d_in, d_in + num_items)

Snippet#

The code snippet below illustrates the identification of non-trivial runs within a sequence of int values using environment-based API:

auto input        = thrust::device_vector<int>{0, 2, 2, 9, 5, 5, 5, 8};
auto offsets_out  = thrust::device_vector<int>(8);
auto lengths_out  = thrust::device_vector<int>(8);
auto num_runs_out = thrust::device_vector<int>(1);

cuda::stream stream{cuda::devices[0]};
cuda::stream_ref stream_ref{stream};

auto error = cub::DeviceRunLengthEncode::NonTrivialRuns(
  input.begin(), offsets_out.begin(), lengths_out.begin(), num_runs_out.begin(), input.size(), stream_ref);
if (error != cudaSuccess)
{
  std::cerr << "cub::DeviceRunLengthEncode::NonTrivialRuns failed with status: " << error << '\n';
}

thrust::device_vector<int> expected_offsets{1, 4};
thrust::device_vector<int> expected_lengths{2, 3};
thrust::device_vector<int> expected_num_runs{2};

Template Parameters:

InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OffsetsOutputIteratorT – [inferred] Random-access output iterator type for writing run-offset values (may be a simple pointer type)
LengthsOutputIteratorT – [inferred] Random-access output iterator type for writing run-length values (may be a simple pointer type)
NumRunsOutputIteratorT – [inferred] Output iterator type for recording the number of runs encountered (may be a simple pointer type)
NumItemsT – [inferred] Type of num_items
EnvT – [inferred] Environment type (e.g., cuda::std::execution::env<...>)

Parameters:

d_in – [in] Pointer to input sequence of data items
d_offsets_out – [out] Pointer to output sequence of run-offsets (one offset per non-trivial run)
d_lengths_out – [out] Pointer to output sequence of run-lengths (one count per non-trivial run)
d_num_runs_out – [out] Pointer to total number of runs (i.e., length of d_offsets_out)
num_items – [in] Total number of input items (i.e., the length of d_in)
env – [in] [optional] Execution environment. Default is cuda::std::execution::env{}.