cub::DeviceSelect

Defined in cub/device/device_select.cuh

struct DeviceSelect

DeviceSelect provides device-wide, parallel operations for compacting selected items from sequences of data items residing within device-accessible memory.

Overview

These operations apply a selection criterion to selectively copy items from a specified input sequence to a compact output sequence.

Usage Considerations

  • Dynamic parallelism. DeviceSelect methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.

Performance

The work-complexity of select-flagged, select-if, and select-unique as a function of input size is linear, resulting in performance throughput that plateaus with problem sizes large enough to saturate the GPU.

Public Static Functions

template<typename InputIteratorT, typename FlagIterator, typename OutputIteratorT, typename NumSelectedIteratorT>
static inline cudaError_t Flagged(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, FlagIterator d_flags, OutputIteratorT d_out, NumSelectedIteratorT d_num_selected_out, int num_items, cudaStream_t stream = 0)

Uses the d_flags sequence to selectively copy the corresponding items from d_in into d_out. The total number of items selected is written to d_num_selected_out.

  • The value type of d_flags must be castable to bool (e.g., bool, char, int, etc.).

  • Copies of the selected items are compacted into d_out and maintain their original relative ordering.

  • The range [d_out, d_out + *d_num_selected_out) shall not overlap [d_in, d_in + num_items),
    [d_flags, d_flags + num_items) nor d_num_selected_out in any way.
  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the compaction of items selected from an int device vector.

#include <cub/cub.cuh>  // or equivalently <cub/device/device_select.cuh>

// Declare, allocate, and initialize device-accessible pointers for input,
// flags, and output
int  num_items;              // e.g., 8
int  *d_in;                  // e.g., [1, 2, 3, 4, 5, 6, 7, 8]
char *d_flags;               // e.g., [1, 0, 0, 1, 0, 1, 1, 0]
int  *d_out;                 // e.g., [ ,  ,  ,  ,  ,  ,  ,  ]
int  *d_num_selected_out;    // e.g., [ ]
...

// Determine temporary device storage requirements
void     *d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceSelect::Flagged(
  d_temp_storage, temp_storage_bytes,
  d_in, d_flags, d_out, d_num_selected_out, num_items);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run selection
cub::DeviceSelect::Flagged(
  d_temp_storage, temp_storage_bytes,
  d_in, d_flags, d_out, d_num_selected_out, num_items);

// d_out                 <-- [1, 4, 6, 7]
// d_num_selected_out    <-- [4]

Template Parameters
  • InputIteratorT[inferred] Random-access input iterator type for reading input items (may be a simple pointer type)

  • FlagIterator[inferred] Random-access input iterator type for reading selection flags (may be a simple pointer type)

  • OutputIteratorT[inferred] Random-access output iterator type for writing selected items (may be a simple pointer type)

  • NumSelectedIteratorT[inferred] Output iterator type for recording the number of items selected (may be a simple pointer type)

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Pointer to the input sequence of data items

  • d_flags[in] Pointer to the input sequence of selection flags

  • d_out[out] Pointer to the output sequence of selected data items

  • d_num_selected_out[out] Pointer to the output total number of items selected (i.e., length of d_out)

  • num_items[in] Total number of input items (i.e., length of d_in)

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename IteratorT, typename FlagIterator, typename NumSelectedIteratorT>
static inline cudaError_t Flagged(void *d_temp_storage, size_t &temp_storage_bytes, IteratorT d_data, FlagIterator d_flags, NumSelectedIteratorT d_num_selected_out, int num_items, cudaStream_t stream = 0)

Uses the d_flags sequence to selectively compact the items in d_data`. The total number of items selected is written to d_num_selected_out.

  • The value type of d_flags must be castable to bool (e.g., bool, char, int, etc.).

  • Copies of the selected items are compacted in-place and maintain their original relative ordering.

  • The d_data may equal d_flags. The range [d_data, d_data + num_items) shall not overlap
    [d_flags, d_flags + num_items) in any other way.
  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the compaction of items selected from an int device vector.

#include <cub/cub.cuh>  // or equivalently <cub/device/device_select.cuh>

// Declare, allocate, and initialize device-accessible pointers for input,
// flags, and output
int  num_items;              // e.g., 8
int  *d_data;                // e.g., [1, 2, 3, 4, 5, 6, 7, 8]
char *d_flags;               // e.g., [1, 0, 0, 1, 0, 1, 1, 0]
int  *d_num_selected_out;    // e.g., [ ]
...

// Determine temporary device storage requirements
void     *d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceSelect::Flagged(
  d_temp_storage, temp_storage_bytes,
  d_in, d_flags, d_num_selected_out, num_items);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run selection
cub::DeviceSelect::Flagged(
  d_temp_storage, temp_storage_bytes,
  d_in, d_flags, d_num_selected_out, num_items);

// d_data                <-- [1, 4, 6, 7]
// d_num_selected_out    <-- [4]

Template Parameters
  • IteratorT[inferred] Random-access iterator type for reading and writing selected items (may be a simple pointer type)

  • FlagIterator[inferred] Random-access input iterator type for reading selection flags (may be a simple pointer type)

  • NumSelectedIteratorT[inferred] Output iterator type for recording the number of items selected (may be a simple pointer type)

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_data[inout] Pointer to the sequence of data items

  • d_flags[in] Pointer to the input sequence of selection flags

  • d_num_selected_out[out] Pointer to the output total number of items selected

  • num_items[in] Total number of input items (i.e., length of d_data)

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename InputIteratorT, typename OutputIteratorT, typename NumSelectedIteratorT, typename SelectOp>
static inline cudaError_t If(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, NumSelectedIteratorT d_num_selected_out, int num_items, SelectOp select_op, cudaStream_t stream = 0)

Uses the select_op functor to selectively copy items from d_in into d_out. The total number of items selected is written to d_num_selected_out.

  • Copies of the selected items are compacted into d_out and maintain their original relative ordering.

  • The range [d_out, d_out + *d_num_selected_out) shall not overlap
    [d_in, d_in + num_items) nor d_num_selected_out in any way.
  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the compaction of items selected from an int device vector.

#include <cub/cub.cuh>   // or equivalently <cub/device/device_select.cuh>

// Functor type for selecting values less than some criteria
struct LessThan
{
    int compare;

    __host__ __device__ __forceinline__
    LessThan(int compare) : compare(compare) {}

    __host__ __device__ __forceinline__
    bool operator()(const int &a) const {
        return (a < compare);
    }
};

// Declare, allocate, and initialize device-accessible pointers
// for input and output
int      num_items;              // e.g., 8
int      *d_in;                  // e.g., [0, 2, 3, 9, 5, 2, 81, 8]
int      *d_out;                 // e.g., [ ,  ,  ,  ,  ,  ,  ,  ]
int      *d_num_selected_out;    // e.g., [ ]
LessThan select_op(7);
...

// Determine temporary device storage requirements
void     *d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceSelect::If(
  d_temp_storage, temp_storage_bytes,
  d_in, d_out, d_num_selected_out, num_items, select_op);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run selection
cub::DeviceSelect::If(
  d_temp_storage, temp_storage_bytes,
  d_in, d_out, d_num_selected_out, num_items, select_op);

// d_out                 <-- [0, 2, 3, 5, 2]
// d_num_selected_out    <-- [5]

Template Parameters
  • InputIteratorT[inferred] Random-access input iterator type for reading input items (may be a simple pointer type)

  • OutputIteratorT[inferred] Random-access output iterator type for writing selected items (may be a simple pointer type)

  • NumSelectedIteratorT[inferred] Output iterator type for recording the number of items selected (may be a simple pointer type)

  • SelectOp[inferred] Selection operator type having member bool operator()(const T &a)

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Pointer to the input sequence of data items

  • d_out[out] Pointer to the output sequence of selected data items

  • d_num_selected_out[out] Pointer to the output total number of items selected (i.e., length of d_out)

  • num_items[in] Total number of input items (i.e., length of d_in)

  • select_op[in] Unary selection operator

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename IteratorT, typename NumSelectedIteratorT, typename SelectOp>
static inline cudaError_t If(void *d_temp_storage, size_t &temp_storage_bytes, IteratorT d_data, NumSelectedIteratorT d_num_selected_out, int num_items, SelectOp select_op, cudaStream_t stream = 0)

Uses the select_op functor to selectively compact items in d_data. The total number of items selected is written to d_num_selected_out.

  • Copies of the selected items are compacted in d_data and maintain
    their original relative ordering.
  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the compaction of items selected from an int device vector.

#include <cub/cub.cuh>   // or equivalently <cub/device/device_select.cuh>

// Functor type for selecting values less than some criteria
struct LessThan
{
    int compare;

    __host__ __device__ __forceinline__
    LessThan(int compare) : compare(compare) {}

    __host__ __device__ __forceinline__
    bool operator()(const int &a) const {
        return (a < compare);
    }
};

// Declare, allocate, and initialize device-accessible pointers
// for input and output
int      num_items;              // e.g., 8
int      *d_data;                // e.g., [0, 2, 3, 9, 5, 2, 81, 8]
int      *d_num_selected_out;    // e.g., [ ]
LessThan select_op(7);
...

// Determine temporary device storage requirements
void     *d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceSelect::If(
  d_temp_storage, temp_storage_bytes,
  d_data, d_num_selected_out, num_items, select_op);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run selection
cub::DeviceSelect::If(
  d_temp_storage, temp_storage_bytes,
  d_data, d_num_selected_out, num_items, select_op);

// d_data                <-- [0, 2, 3, 5, 2]
// d_num_selected_out    <-- [5]

Template Parameters
  • IteratorT[inferred] Random-access input iterator type for reading and writing items (may be a simple pointer type)

  • NumSelectedIteratorT[inferred] Output iterator type for recording the number of items selected (may be a simple pointer type)

  • SelectOp[inferred] Selection operator type having member bool operator()(const T &a)

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_data[inout] Pointer to the sequence of data items

  • d_num_selected_out[out] Pointer to the output total number of items selected

  • num_items[in] Total number of input items (i.e., length of d_data)

  • select_op[in] Unary selection operator

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename InputIteratorT, typename FlagIterator, typename OutputIteratorT, typename NumSelectedIteratorT, typename SelectOp>
static inline cudaError_t FlaggedIf(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, FlagIterator d_flags, OutputIteratorT d_out, NumSelectedIteratorT d_num_selected_out, int num_items, SelectOp select_op, cudaStream_t stream = 0)

Uses the select_op functor applied to d_flags to selectively copy the corresponding items from d_in into d_out. The total number of items selected is written to d_num_selected_out.

  • The expression select_op(flag) must be convertible to bool, where the type of flag corresponds to the value type of FlagIterator.

  • Copies of the selected items are compacted into d_out and maintain their original relative ordering.

  • The range [d_out, d_out + *d_num_selected_out) shall not overlap
    [d_in, d_in + num_items) nor d_num_selected_out in any way.
  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the compaction of items selected from an int device vector.

struct is_even_t
{
  __host__ __device__ bool operator()(int flag) const
  {
    return !(flag % 2);
  }
};
constexpr int num_items            = 8;
thrust::device_vector<int> d_in    = {0, 1, 2, 3, 4, 5, 6, 7};
thrust::device_vector<int> d_flags = {8, 6, 7, 5, 3, 0, 9, 3};
thrust::device_vector<int> d_out(num_items);
thrust::device_vector<int> d_num_selected_out(num_items);
is_even_t is_even{};

// Determine temporary device storage requirements
size_t temp_storage_bytes = 0;
cub::DeviceSelect::FlaggedIf(
  nullptr,
  temp_storage_bytes,
  d_in.begin(),
  d_flags.begin(),
  d_out.begin(),
  d_num_selected_out.data(),
  num_items,
  is_even);

// Allocate temporary storage
c2h::device_vector<char> temp_storage(temp_storage_bytes);

// Run selection
cub::DeviceSelect::FlaggedIf(
  thrust::raw_pointer_cast(temp_storage.data()),
  temp_storage_bytes,
  d_in.begin(),
  d_flags.begin(),
  d_out.begin(),
  d_num_selected_out.data(),
  num_items,
  is_even);

thrust::device_vector<int> expected{0, 1, 5};

Template Parameters
  • InputIteratorT[inferred] Random-access input iterator type for reading input items (may be a simple pointer type)

  • FlagIterator[inferred] Random-access input iterator type for reading selection flags (may be a simple pointer type)

  • OutputIteratorT[inferred] Random-access output iterator type for writing selected items (may be a simple pointer type)

  • NumSelectedIteratorT[inferred] Output iterator type for recording the number of items selected (may be a simple pointer type)

  • SelectOp[inferred] Selection operator type having member bool operator()(const T &a)

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Pointer to the input sequence of data items

  • d_flags[in] Pointer to the input sequence of selection flags

  • d_out[out] Pointer to the output sequence of selected data items

  • d_num_selected_out[out] Pointer to the output total number of items selected (i.e., length of d_out)

  • num_items[in] Total number of input items (i.e., length of d_in)

  • select_op[in] Unary selection operator

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename IteratorT, typename FlagIterator, typename NumSelectedIteratorT, typename SelectOp>
static inline cudaError_t FlaggedIf(void *d_temp_storage, size_t &temp_storage_bytes, IteratorT d_data, FlagIterator d_flags, NumSelectedIteratorT d_num_selected_out, int num_items, SelectOp select_op, cudaStream_t stream = 0)

Uses the select_op functor applied to d_flags to selectively compact the corresponding items in d_data. The total number of items selected is written to d_num_selected_out.

  • The expression select_op(flag) must be convertible to bool, where the type of flag corresponds to the value type of FlagIterator.

  • Copies of the selected items are compacted in-place and maintain their original relative ordering.

  • The d_data may equal d_flags. The range [d_data, d_data + num_items) shall not overlap
    [d_flags, d_flags + num_items) in any other way.
  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the compaction of items selected from an int device vector.

struct is_even_t
{
  __host__ __device__ bool operator()(int flag) const
  {
    return !(flag % 2);
  }
};
constexpr int num_items            = 8;
thrust::device_vector<int> d_data  = {0, 1, 2, 3, 4, 5, 6, 7};
thrust::device_vector<int> d_flags = {8, 6, 7, 5, 3, 0, 9, 3};
thrust::device_vector<int> d_num_selected_out(num_items);
is_even_t is_even{};

// Determine temporary device storage requirements
size_t temp_storage_bytes = 0;
cub::DeviceSelect::FlaggedIf(
  nullptr, temp_storage_bytes, d_data.begin(), d_flags.begin(), d_num_selected_out.data(), num_items, is_even);

// Allocate temporary storage
c2h::device_vector<char> temp_storage(temp_storage_bytes);

// Run selection
cub::DeviceSelect::FlaggedIf(
  thrust::raw_pointer_cast(temp_storage.data()),
  temp_storage_bytes,
  d_data.begin(),
  d_flags.begin(),
  d_num_selected_out.data(),
  num_items,
  is_even);

thrust::device_vector<int> expected{0, 1, 5};

Template Parameters
  • IteratorT[inferred] Random-access iterator type for reading and writing selected items (may be a simple pointer type)

  • FlagIterator[inferred] Random-access input iterator type for reading selection flags (may be a simple pointer type)

  • NumSelectedIteratorT[inferred] Output iterator type for recording the number of items selected (may be a simple pointer type)

  • SelectOp[inferred] Selection operator type having member bool operator()(const T &a)

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_data[inout] Pointer to the sequence of data items

  • d_flags[in] Pointer to the input sequence of selection flags

  • d_num_selected_out[out] Pointer to the output total number of items selected

  • num_items[in] Total number of input items (i.e., length of d_data)

  • select_op[in] Unary selection operator

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename InputIteratorT, typename OutputIteratorT, typename NumSelectedIteratorT>
static inline cudaError_t Unique(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, NumSelectedIteratorT d_num_selected_out, int num_items, cudaStream_t stream = 0)

Given an input sequence d_in having runs of consecutive equal-valued keys, only the first key from each run is selectively copied to d_out. The total number of items selected is written to d_num_selected_out.

  • The == equality operator is used to determine whether keys are equivalent

  • Copies of the selected items are compacted into d_out and maintain their original relative ordering.

  • The range [d_out, d_out + *d_num_selected_out) shall not overlap
    [d_in, d_in + num_items) nor d_num_selected_out in any way.
  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the compaction of items selected from an int device vector.

#include <cub/cub.cuh>   // or equivalently <cub/device/device_select.cuh>

// Declare, allocate, and initialize device-accessible pointers
// for input and output
int  num_items;              // e.g., 8
int  *d_in;                  // e.g., [0, 2, 2, 9, 5, 5, 5, 8]
int  *d_out;                 // e.g., [ ,  ,  ,  ,  ,  ,  ,  ]
int  *d_num_selected_out;    // e.g., [ ]
...

// Determine temporary device storage requirements
void     *d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceSelect::Unique(
  d_temp_storage, temp_storage_bytes,
  d_in, d_out, d_num_selected_out, num_items);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run selection
cub::DeviceSelect::Unique(
  d_temp_storage, temp_storage_bytes,
  d_in, d_out, d_num_selected_out, num_items);

// d_out                 <-- [0, 2, 9, 5, 8]
// d_num_selected_out    <-- [5]

Template Parameters
  • InputIteratorT[inferred] Random-access input iterator type for reading input items (may be a simple pointer type)

  • OutputIteratorT[inferred] Random-access output iterator type for writing selected items (may be a simple pointer type)

  • NumSelectedIteratorT[inferred] Output iterator type for recording the number of items selected (may be a simple pointer type)

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Pointer to the input sequence of data items

  • d_out[out] Pointer to the output sequence of selected data items

  • d_num_selected_out[out] Pointer to the output total number of items selected (i.e., length of d_out)

  • num_items[in] Total number of input items (i.e., length of d_in)

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename KeyInputIteratorT, typename ValueInputIteratorT, typename KeyOutputIteratorT, typename ValueOutputIteratorT, typename NumSelectedIteratorT, typename NumItemsT, typename EqualityOpT>
static inline typename ::cuda::std::enable_if<!::cuda::std::is_convertible<EqualityOpT, cudaStream_t>::value, cudaError_t>::type UniqueByKey(void *d_temp_storage, size_t &temp_storage_bytes, KeyInputIteratorT d_keys_in, ValueInputIteratorT d_values_in, KeyOutputIteratorT d_keys_out, ValueOutputIteratorT d_values_out, NumSelectedIteratorT d_num_selected_out, NumItemsT num_items, EqualityOpT equality_op, cudaStream_t stream = 0)

Given an input sequence d_keys_in and d_values_in with runs of key-value pairs with consecutive equal-valued keys, only the first key and its value from each run is selectively copied to d_keys_out and d_values_out. The total number of items selected is written to d_num_selected_out.

  • The user-provided equality operator, equality_op, is used to determine whether keys are equivalent

  • Copies of the selected items are compacted into d_out and maintain their original relative ordering.

  • In-place operations are not supported. There must be no overlap between any of the provided ranges:

    • [d_keys_in,          d_keys_in    + num_items)

    • [d_keys_out,         d_keys_out   + *d_num_selected_out)

    • [d_values_in,        d_values_in  + num_items)

    • [d_values_out,       d_values_out + *d_num_selected_out)

    • [d_num_selected_out, d_num_selected_out + 1)

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the compaction of items selected from an int device vector.

#include <cub/cub.cuh>   // or equivalently <cub/device/device_select.cuh>

// Declare, allocate, and initialize device-accessible pointers
// for input and output
int  num_items;              // e.g., 8
int  *d_keys_in;             // e.g., [0, 2, 2, 9, 5, 5, 5, 8]
int  *d_values_in;           // e.g., [1, 2, 3, 4, 5, 6, 7, 8]
int  *d_keys_out;            // e.g., [ ,  ,  ,  ,  ,  ,  ,  ]
int  *d_values_out;          // e.g., [ ,  ,  ,  ,  ,  ,  ,  ]
int  *d_num_selected_out;    // e.g., [ ]
...

// Determine temporary device storage requirements
void     *d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceSelect::UniqueByKey(
  d_temp_storage, temp_storage_bytes,
  d_keys_in, d_values_in,
  d_keys_out, d_values_out, d_num_selected_out, num_items);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run selection
cub::DeviceSelect::UniqueByKey(
  d_temp_storage, temp_storage_bytes,
  d_keys_in, d_values_in,
  d_keys_out, d_values_out, d_num_selected_out, num_items);

// d_keys_out            <-- [0, 2, 9, 5, 8]
// d_values_out          <-- [1, 2, 4, 5, 8]
// d_num_selected_out    <-- [5]

Template Parameters
  • KeyInputIteratorT[inferred] Random-access input iterator type for reading input keys (may be a simple pointer type)

  • ValueInputIteratorT[inferred] Random-access input iterator type for reading input values (may be a simple pointer type)

  • KeyOutputIteratorT[inferred] Random-access output iterator type for writing selected keys (may be a simple pointer type)

  • ValueOutputIteratorT[inferred] Random-access output iterator type for writing selected values (may be a simple pointer type)

  • NumSelectedIteratorT[inferred] Output iterator type for recording the number of items selected (may be a simple pointer type)

  • NumItemsT[inferred] Type of num_items

  • EqualityOpT[inferred] Type of equality_op

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_keys_in[in] Pointer to the input sequence of keys

  • d_values_in[in] Pointer to the input sequence of values

  • d_keys_out[out] Pointer to the output sequence of selected keys

  • d_values_out[out] Pointer to the output sequence of selected values

  • d_num_selected_out[out] Pointer to the total number of items selected (i.e., length of d_keys_out or d_values_out)

  • num_items[in] Total number of input items (i.e., length of d_keys_in or d_values_in)

  • equality_op[in] Binary predicate to determine equality

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename KeyInputIteratorT, typename ValueInputIteratorT, typename KeyOutputIteratorT, typename ValueOutputIteratorT, typename NumSelectedIteratorT, typename NumItemsT>
static inline cudaError_t UniqueByKey(void *d_temp_storage, size_t &temp_storage_bytes, KeyInputIteratorT d_keys_in, ValueInputIteratorT d_values_in, KeyOutputIteratorT d_keys_out, ValueOutputIteratorT d_values_out, NumSelectedIteratorT d_num_selected_out, NumItemsT num_items, cudaStream_t stream = 0)

Given an input sequence d_keys_in and d_values_in with runs of key-value pairs with consecutive equal-valued keys, only the first key and its value from each run is selectively copied to d_keys_out and d_values_out. The total number of items selected is written to d_num_selected_out.

  • The == equality operator is used to determine whether keys are equivalent

  • Copies of the selected items are compacted into d_out and maintain their original relative ordering.

  • In-place operations are not supported. There must be no overlap between any of the provided ranges:

    • [d_keys_in,          d_keys_in    + num_items)

    • [d_keys_out,         d_keys_out   + *d_num_selected_out)

    • [d_values_in,        d_values_in  + num_items)

    • [d_values_out,       d_values_out + *d_num_selected_out)

    • [d_num_selected_out, d_num_selected_out + 1)

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the compaction of items selected from an int device vector.

#include <cub/cub.cuh>   // or equivalently <cub/device/device_select.cuh>

// Declare, allocate, and initialize device-accessible pointers
// for input and output
int  num_items;              // e.g., 8
int  *d_keys_in;             // e.g., [0, 2, 2, 9, 5, 5, 5, 8]
int  *d_values_in;           // e.g., [1, 2, 3, 4, 5, 6, 7, 8]
int  *d_keys_out;            // e.g., [ ,  ,  ,  ,  ,  ,  ,  ]
int  *d_values_out;          // e.g., [ ,  ,  ,  ,  ,  ,  ,  ]
int  *d_num_selected_out;    // e.g., [ ]
...

// Determine temporary device storage requirements
void     *d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceSelect::UniqueByKey(
  d_temp_storage, temp_storage_bytes,
  d_keys_in, d_values_in,
  d_keys_out, d_values_out, d_num_selected_out, num_items);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run selection
cub::DeviceSelect::UniqueByKey(
  d_temp_storage, temp_storage_bytes,
  d_keys_in, d_values_in,
  d_keys_out, d_values_out, d_num_selected_out, num_items);

// d_keys_out            <-- [0, 2, 9, 5, 8]
// d_values_out          <-- [1, 2, 4, 5, 8]
// d_num_selected_out    <-- [5]

Template Parameters
  • KeyInputIteratorT[inferred] Random-access input iterator type for reading input keys (may be a simple pointer type)

  • ValueInputIteratorT[inferred] Random-access input iterator type for reading input values (may be a simple pointer type)

  • KeyOutputIteratorT[inferred] Random-access output iterator type for writing selected keys (may be a simple pointer type)

  • ValueOutputIteratorT[inferred] Random-access output iterator type for writing selected values (may be a simple pointer type)

  • NumSelectedIteratorT[inferred] Output iterator type for recording the number of items selected (may be a simple pointer type)

  • NumItemsT[inferred] Type of num_items

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_keys_in[in] Pointer to the input sequence of keys

  • d_values_in[in] Pointer to the input sequence of values

  • d_keys_out[out] Pointer to the output sequence of selected keys

  • d_values_out[out] Pointer to the output sequence of selected values

  • d_num_selected_out[out] Pointer to the total number of items selected (i.e., length of d_keys_out or d_values_out)

  • num_items[in] Total number of input items (i.e., length of d_keys_in or d_values_in)

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.