cub::DeviceRunLengthEncode

Defined in cub/device/device_run_length_encode.cuh

struct DeviceRunLengthEncode

DeviceRunLengthEncode provides device-wide, parallel operations for demarcating “runs” of same-valued items within a sequence residing within device-accessible memory.

Overview

A run-length encoding computes a simple compressed representation of a sequence of input elements such that each maximal “run” of consecutive same-valued data items is encoded as a single data value along with a count of the elements in that run.

Usage Considerations

Dynamic parallelism. DeviceRunLengthEncode methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.

Performance

The work-complexity of run-length encode as a function of input size is linear, resulting in performance throughput that plateaus with problem sizes large enough to saturate the GPU.

Public Static Functions

template<typename InputIteratorT, typename UniqueOutputIteratorT, typename LengthsOutputIteratorT, typename NumRunsOutputIteratorT> static inline cudaError_t Encode(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, UniqueOutputIteratorT d_unique_out, LengthsOutputIteratorT d_counts_out, NumRunsOutputIteratorT d_num_runs_out, int num_items, cudaStream_t stream = 0)

Computes a run-length encoding of the sequence d_in.

For the i^th run encountered, the first key of the run and its length are written to d_unique_out[i] and d_counts_out[i], respectively.
The total number of runs encountered is written to d_num_runs_out.
The == equality operator is used to determine whether values are equivalent
In-place operations are not supported. There must be no overlap between any of the provided ranges:
- [d_unique_out, d_unique_out + *d_num_runs_out)
- [d_counts_out, d_counts_out + *d_num_runs_out)
- [d_num_runs_out, d_num_runs_out + 1)
- [d_in, d_in + num_items)
When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the run-length encoding of a sequence of int values.

#include <cub/cub.cuh> // or equivalently <cub/device/device_run_length_encode.cuh>

// Declare, allocate, and initialize device-accessible pointers for
// input and output
int          num_items;          // e.g., 8
int          *d_in;              // e.g., [0, 2, 2, 9, 5, 5, 5, 8]
int          *d_unique_out;      // e.g., [ ,  ,  ,  ,  ,  ,  ,  ]
int          *d_counts_out;      // e.g., [ ,  ,  ,  ,  ,  ,  ,  ]
int          *d_num_runs_out;    // e.g., [ ]
...

// Determine temporary device storage requirements
void     *d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceRunLengthEncode::Encode(
  d_temp_storage, temp_storage_bytes,
  d_in, d_unique_out, d_counts_out, d_num_runs_out, num_items);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run encoding
cub::DeviceRunLengthEncode::Encode(
  d_temp_storage, temp_storage_bytes,
  d_in, d_unique_out, d_counts_out, d_num_runs_out, num_items);

// d_unique_out      <-- [0, 2, 9, 5, 8]
// d_counts_out      <-- [1, 2, 1, 3, 1]
// d_num_runs_out    <-- [5]

Template Parameters

InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
UniqueOutputIteratorT – [inferred] Random-access output iterator type for writing unique output items (may be a simple pointer type)
LengthsOutputIteratorT – [inferred] Random-access output iterator type for writing output counts (may be a simple pointer type)
NumRunsOutputIteratorT – [inferred] Output iterator type for recording the number of runs encountered (may be a simple pointer type)

Parameters

d_temp_storage – [in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.
temp_storage_bytes – [inout] Reference to size in bytes of d_temp_storage allocation
d_in – [in] Pointer to the input sequence of keys
d_unique_out – [out] Pointer to the output sequence of unique keys (one key per run)
d_counts_out – [out] Pointer to the output sequence of run-lengths (one count per run)
d_num_runs_out – [out] Pointer to total number of runs
num_items – [in] Total number of associated key+value pairs (i.e., the length of d_in_keys and d_in_values)
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream₀.

template<typename InputIteratorT, typename OffsetsOutputIteratorT, typename LengthsOutputIteratorT, typename NumRunsOutputIteratorT> static inline cudaError_t NonTrivialRuns(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OffsetsOutputIteratorT d_offsets_out, LengthsOutputIteratorT d_lengths_out, NumRunsOutputIteratorT d_num_runs_out, int num_items, cudaStream_t stream = 0)

Enumerates the starting offsets and lengths of all non-trivial runs (of length > 1) of same-valued keys in the sequence d_in.

For the i^th non-trivial run, the run’s starting offset and its length are written to d_offsets_out[i] and d_lengths_out[i], respectively.
The total number of runs encountered is written to d_num_runs_out.
The == equality operator is used to determine whether values are equivalent
In-place operations are not supported. There must be no overlap between any of the provided ranges:
- [d_offsets_out, d_offsets_out + *d_num_runs_out)
- [d_lengths_out, d_lengths_out + *d_num_runs_out)
- [d_num_runs_out, d_num_runs_out + 1)
- [d_in, d_in + num_items)
When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the identification of non-trivial runs within a sequence of int values.

#include <cub/cub.cuh> // or equivalently <cub/device/device_run_length_encode.cuh>

// Declare, allocate, and initialize device-accessible pointers
// for input and output
int          num_items;          // e.g., 8
int          *d_in;              // e.g., [0, 2, 2, 9, 5, 5, 5, 8]
int          *d_offsets_out;     // e.g., [ ,  ,  ,  ,  ,  ,  ,  ]
int          *d_lengths_out;     // e.g., [ ,  ,  ,  ,  ,  ,  ,  ]
int          *d_num_runs_out;    // e.g., [ ]
...

// Determine temporary device storage requirements
void     *d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceRunLengthEncode::NonTrivialRuns(
  d_temp_storage, temp_storage_bytes,
  d_in, d_offsets_out, d_lengths_out, d_num_runs_out, num_items);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run encoding
cub::DeviceRunLengthEncode::NonTrivialRuns(
  d_temp_storage, temp_storage_bytes,
  d_in, d_offsets_out, d_lengths_out, d_num_runs_out, num_items);

// d_offsets_out         <-- [1, 4]
// d_lengths_out         <-- [2, 3]
// d_num_runs_out        <-- [2]

Template Parameters

InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OffsetsOutputIteratorT – [inferred] Random-access output iterator type for writing run-offset values (may be a simple pointer type)
LengthsOutputIteratorT – [inferred] Random-access output iterator type for writing run-length values (may be a simple pointer type)
NumRunsOutputIteratorT – [inferred] Output iterator type for recording the number of runs encountered (may be a simple pointer type)

Parameters

d_temp_storage – [in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.
temp_storage_bytes – [inout] Reference to size in bytes of d_temp_storage allocation
d_in – [in] Pointer to input sequence of data items
d_offsets_out – [out] Pointer to output sequence of run-offsets (one offset per non-trivial run)
d_lengths_out – [out] Pointer to output sequence of run-lengths (one count per non-trivial run)
d_num_runs_out – [out] Pointer to total number of runs (i.e., length of d_offsets_out)
num_items – [in] Total number of associated key+value pairs (i.e., the length of d_in_keys and d_in_values)
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream₀.