cub::DeviceRunLengthEncode

Defined in cub/device/device_run_length_encode.cuh

struct DeviceRunLengthEncode

DeviceRunLengthEncode provides device-wide, parallel operations for demarcating “runs” of same-valued items within a sequence residing within device-accessible memory.

Overview

A run-length encoding computes a simple compressed representation of a sequence of input elements such that each maximal “run” of consecutive same-valued data items is encoded as a single data value along with a count of the elements in that run.

Usage Considerations

  • Dynamic parallelism. DeviceRunLengthEncode methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.

Performance

The work-complexity of run-length encode as a function of input size is linear, resulting in performance throughput that plateaus with problem sizes large enough to saturate the GPU.

Public Static Functions

template<typename InputIteratorT, typename UniqueOutputIteratorT, typename LengthsOutputIteratorT, typename NumRunsOutputIteratorT>
static inline cudaError_t Encode(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, UniqueOutputIteratorT d_unique_out, LengthsOutputIteratorT d_counts_out, NumRunsOutputIteratorT d_num_runs_out, int num_items, cudaStream_t stream = 0)

Computes a run-length encoding of the sequence d_in.

  • For the ith run encountered, the first key of the run and its length are written to d_unique_out[i] and d_counts_out[i], respectively.

  • The total number of runs encountered is written to d_num_runs_out.

  • The == equality operator is used to determine whether values are equivalent

  • In-place operations are not supported. There must be no overlap between any of the provided ranges:

    • [d_unique_out, d_unique_out + *d_num_runs_out)

    • [d_counts_out, d_counts_out + *d_num_runs_out)

    • [d_num_runs_out, d_num_runs_out + 1)

    • [d_in, d_in + num_items)

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the run-length encoding of a sequence of int values.

#include <cub/cub.cuh> // or equivalently <cub/device/device_run_length_encode.cuh>

// Declare, allocate, and initialize device-accessible pointers for
// input and output
int          num_items;          // e.g., 8
int          *d_in;              // e.g., [0, 2, 2, 9, 5, 5, 5, 8]
int          *d_unique_out;      // e.g., [ ,  ,  ,  ,  ,  ,  ,  ]
int          *d_counts_out;      // e.g., [ ,  ,  ,  ,  ,  ,  ,  ]
int          *d_num_runs_out;    // e.g., [ ]
...

// Determine temporary device storage requirements
void     *d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceRunLengthEncode::Encode(
  d_temp_storage, temp_storage_bytes,
  d_in, d_unique_out, d_counts_out, d_num_runs_out, num_items);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run encoding
cub::DeviceRunLengthEncode::Encode(
  d_temp_storage, temp_storage_bytes,
  d_in, d_unique_out, d_counts_out, d_num_runs_out, num_items);

// d_unique_out      <-- [0, 2, 9, 5, 8]
// d_counts_out      <-- [1, 2, 1, 3, 1]
// d_num_runs_out    <-- [5]

Template Parameters
  • InputIteratorT[inferred] Random-access input iterator type for reading input items (may be a simple pointer type)

  • UniqueOutputIteratorT[inferred] Random-access output iterator type for writing unique output items (may be a simple pointer type)

  • LengthsOutputIteratorT[inferred] Random-access output iterator type for writing output counts (may be a simple pointer type)

  • NumRunsOutputIteratorT[inferred] Output iterator type for recording the number of runs encountered (may be a simple pointer type)

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Pointer to the input sequence of keys

  • d_unique_out[out] Pointer to the output sequence of unique keys (one key per run)

  • d_counts_out[out] Pointer to the output sequence of run-lengths (one count per run)

  • d_num_runs_out[out] Pointer to total number of runs

  • num_items[in] Total number of associated key+value pairs (i.e., the length of d_in_keys and d_in_values)

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename InputIteratorT, typename OffsetsOutputIteratorT, typename LengthsOutputIteratorT, typename NumRunsOutputIteratorT>
static inline cudaError_t NonTrivialRuns(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OffsetsOutputIteratorT d_offsets_out, LengthsOutputIteratorT d_lengths_out, NumRunsOutputIteratorT d_num_runs_out, int num_items, cudaStream_t stream = 0)

Enumerates the starting offsets and lengths of all non-trivial runs (of length > 1) of same-valued keys in the sequence d_in.

  • For the ith non-trivial run, the run’s starting offset and its length are written to d_offsets_out[i] and d_lengths_out[i], respectively.

  • The total number of runs encountered is written to d_num_runs_out.

  • The == equality operator is used to determine whether values are equivalent

  • In-place operations are not supported. There must be no overlap between any of the provided ranges:

    • [d_offsets_out, d_offsets_out + *d_num_runs_out)

    • [d_lengths_out, d_lengths_out + *d_num_runs_out)

    • [d_num_runs_out, d_num_runs_out + 1)

    • [d_in, d_in + num_items)

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the identification of non-trivial runs within a sequence of int values.

#include <cub/cub.cuh> // or equivalently <cub/device/device_run_length_encode.cuh>

// Declare, allocate, and initialize device-accessible pointers
// for input and output
int          num_items;          // e.g., 8
int          *d_in;              // e.g., [0, 2, 2, 9, 5, 5, 5, 8]
int          *d_offsets_out;     // e.g., [ ,  ,  ,  ,  ,  ,  ,  ]
int          *d_lengths_out;     // e.g., [ ,  ,  ,  ,  ,  ,  ,  ]
int          *d_num_runs_out;    // e.g., [ ]
...

// Determine temporary device storage requirements
void     *d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceRunLengthEncode::NonTrivialRuns(
  d_temp_storage, temp_storage_bytes,
  d_in, d_offsets_out, d_lengths_out, d_num_runs_out, num_items);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run encoding
cub::DeviceRunLengthEncode::NonTrivialRuns(
  d_temp_storage, temp_storage_bytes,
  d_in, d_offsets_out, d_lengths_out, d_num_runs_out, num_items);

// d_offsets_out         <-- [1, 4]
// d_lengths_out         <-- [2, 3]
// d_num_runs_out        <-- [2]

Template Parameters
  • InputIteratorT[inferred] Random-access input iterator type for reading input items (may be a simple pointer type)

  • OffsetsOutputIteratorT[inferred] Random-access output iterator type for writing run-offset values (may be a simple pointer type)

  • LengthsOutputIteratorT[inferred] Random-access output iterator type for writing run-length values (may be a simple pointer type)

  • NumRunsOutputIteratorT[inferred] Output iterator type for recording the number of runs encountered (may be a simple pointer type)

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Pointer to input sequence of data items

  • d_offsets_out[out] Pointer to output sequence of run-offsets (one offset per non-trivial run)

  • d_lengths_out[out] Pointer to output sequence of run-lengths (one count per non-trivial run)

  • d_num_runs_out[out] Pointer to total number of runs (i.e., length of d_offsets_out)

  • num_items[in] Total number of associated key+value pairs (i.e., the length of d_in_keys and d_in_values)

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.