cub::DeviceSelect#

struct DeviceSelect#

DeviceSelect provides device-wide, parallel operations for compacting selected items from sequences of data items residing within device-accessible memory. It is similar to DevicePartition, except that non-selected items are discarded, whereas DevicePartition retains them.

Overview#

These operations apply a selection criterion to selectively copy items from a specified input sequence to a compact output sequence.

Usage Considerations#

Dynamic parallelism. DeviceSelect methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.

Performance#

The work-complexity of select-flagged, select-if, and select-unique as a function of input size is linear, resulting in performance throughput that plateaus with problem sizes large enough to saturate the GPU.

Public Static Functions

template<typename InputIteratorT, typename FlagIterator, typename OutputIteratorT, typename NumSelectedIteratorT> static inline cudaError_t Flagged( void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, FlagIterator d_flags, OutputIteratorT d_out, NumSelectedIteratorT d_num_selected_out, ::cuda::std::int64_t num_items, cudaStream_t stream = 0 )#

Uses the d_flags sequence to selectively copy the corresponding items from d_in into d_out. The total number of items selected is written to d_num_selected_out.

Added in version 2.2.0: First appears in CUDA Toolkit 12.3.

The value type of d_flags must be castable to bool (e.g., bool, char, int, etc.).
Copies of the selected items are compacted into d_out and maintain their original relative ordering.
The range [d_out, d_out + *d_num_selected_out) shall not overlap [d_in, d_in + num_items),

[d_flags, d_flags + num_items) nor d_num_selected_out in any way.
When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.

Snippet#

The code snippet below illustrates the compaction of items selected from an int device vector.

#include <cub/cub.cuh>  // or equivalently <cub/device/device_select.cuh>

// Declare, allocate, and initialize device-accessible pointers for input,
// flags, and output
int  num_items;              // e.g., 8
int  *d_in;                  // e.g., [1, 2, 3, 4, 5, 6, 7, 8]
char *d_flags;               // e.g., [1, 0, 0, 1, 0, 1, 1, 0]
int  *d_out;                 // e.g., [ ,  ,  ,  ,  ,  ,  ,  ]
int  *d_num_selected_out;    // e.g., [ ]
...

// Determine temporary device storage requirements
void     *d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceSelect::Flagged(
  d_temp_storage, temp_storage_bytes,
  d_in, d_flags, d_out, d_num_selected_out, num_items);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run selection
cub::DeviceSelect::Flagged(
  d_temp_storage, temp_storage_bytes,
  d_in, d_flags, d_out, d_num_selected_out, num_items);

// d_out                 <-- [1, 4, 6, 7]
// d_num_selected_out    <-- [4]

Template Parameters:

InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
FlagIterator – [inferred] Random-access input iterator type for reading selection flags (may be a simple pointer type)
OutputIteratorT – [inferred] Random-access output iterator type for writing selected items (may be a simple pointer type)
NumSelectedIteratorT – [inferred] Output iterator type for recording the number of items selected (may be a simple pointer type)

Parameters:

d_temp_storage – [in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.
temp_storage_bytes – [inout] Reference to size in bytes of d_temp_storage allocation
d_in – [in] Pointer to the input sequence of data items
d_flags – [in] Pointer to the input sequence of selection flags
d_out – [out] Pointer to the output sequence of selected data items
d_num_selected_out – [out] Pointer to the output total number of items selected (i.e., length of d_out)
num_items – [in] Total number of input items (i.e., length of d_in)
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream₀.

template<typename IteratorT, typename FlagIterator, typename NumSelectedIteratorT> static inline cudaError_t Flagged( void *d_temp_storage, size_t &temp_storage_bytes, IteratorT d_data, FlagIterator d_flags, NumSelectedIteratorT d_num_selected_out, ::cuda::std::int64_t num_items, cudaStream_t stream = 0 )#

Uses the d_flags sequence to selectively compact the items in d_data`. The total number of items selected is written to d_num_selected_out.

Added in version 2.2.0: First appears in CUDA Toolkit 12.3.

The value type of d_flags must be castable to bool (e.g., bool, char, int, etc.).
Copies of the selected items are compacted in-place and maintain their original relative ordering.
The d_data may equal d_flags. The range [d_data, d_data + num_items) shall not overlap

[d_flags, d_flags + num_items) in any other way.
When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.

Snippet#

The code snippet below illustrates the compaction of items selected from an int device vector.

#include <cub/cub.cuh>  // or equivalently <cub/device/device_select.cuh>

// Declare, allocate, and initialize device-accessible pointers for input,
// flags, and output
int  num_items;              // e.g., 8
int  *d_data;                // e.g., [1, 2, 3, 4, 5, 6, 7, 8]
char *d_flags;               // e.g., [1, 0, 0, 1, 0, 1, 1, 0]
int  *d_num_selected_out;    // e.g., [ ]
...

// Determine temporary device storage requirements
void     *d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceSelect::Flagged(
  d_temp_storage, temp_storage_bytes,
  d_in, d_flags, d_num_selected_out, num_items);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run selection
cub::DeviceSelect::Flagged(
  d_temp_storage, temp_storage_bytes,
  d_in, d_flags, d_num_selected_out, num_items);

// d_data                <-- [1, 4, 6, 7]
// d_num_selected_out    <-- [4]

Template Parameters:

IteratorT – [inferred] Random-access iterator type for reading and writing selected items (may be a simple pointer type)
FlagIterator – [inferred] Random-access input iterator type for reading selection flags (may be a simple pointer type)
NumSelectedIteratorT – [inferred] Output iterator type for recording the number of items selected (may be a simple pointer type)

Parameters:

d_temp_storage – [in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.
temp_storage_bytes – [inout] Reference to size in bytes of d_temp_storage allocation
d_data – [inout] Pointer to the sequence of data items
d_flags – [in] Pointer to the input sequence of selection flags
d_num_selected_out – [out] Pointer to the output total number of items selected
num_items – [in] Total number of input items (i.e., length of d_data)
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream₀.

template<typename InputIteratorT, typename OutputIteratorT, typename NumSelectedIteratorT, typename SelectOp> static inline cudaError_t If( void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, NumSelectedIteratorT d_num_selected_out, ::cuda::std::int64_t num_items, SelectOp select_op, cudaStream_t stream = 0 )#

Uses the select_op functor to selectively copy items from d_in into d_out. The total number of items selected is written to d_num_selected_out.

Added in version 2.2.0: First appears in CUDA Toolkit 12.3.

Copies of the selected items are compacted into d_out and maintain their original relative ordering.
The range [d_out, d_out + *d_num_selected_out) shall not overlap

[d_in, d_in + num_items) nor d_num_selected_out in any way.
When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.

Snippet#

The code snippet below illustrates the compaction of items selected from an int device vector.

#include <cub/cub.cuh>   // or equivalently <cub/device/device_select.cuh>

// Functor type for selecting values less than some criteria
struct LessThan
{
    int compare;

    __host__ __device__ __forceinline__
    LessThan(int compare) : compare(compare) {}

    __host__ __device__ __forceinline__
    bool operator()(const int &a) const {
        return (a < compare);
    }
};

// Declare, allocate, and initialize device-accessible pointers
// for input and output
int      num_items;              // e.g., 8
int      *d_in;                  // e.g., [0, 2, 3, 9, 5, 2, 81, 8]
int      *d_out;                 // e.g., [ ,  ,  ,  ,  ,  ,  ,  ]
int      *d_num_selected_out;    // e.g., [ ]
LessThan select_op(7);
...

// Determine temporary device storage requirements
void     *d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceSelect::If(
  d_temp_storage, temp_storage_bytes,
  d_in, d_out, d_num_selected_out, num_items, select_op);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run selection
cub::DeviceSelect::If(
  d_temp_storage, temp_storage_bytes,
  d_in, d_out, d_num_selected_out, num_items, select_op);

// d_out                 <-- [0, 2, 3, 5, 2]
// d_num_selected_out    <-- [5]

Template Parameters:

InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Random-access output iterator type for writing selected items (may be a simple pointer type)
NumSelectedIteratorT – [inferred] Output iterator type for recording the number of items selected (may be a simple pointer type)
SelectOp – [inferred] Selection operator type having member bool operator()(const T &a)

Parameters:

d_temp_storage – [in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.
temp_storage_bytes – [inout] Reference to size in bytes of d_temp_storage allocation
d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output sequence of selected data items
d_num_selected_out – [out] Pointer to the output total number of items selected (i.e., length of d_out)
num_items – [in] Total number of input items (i.e., length of d_in)
select_op – [in] Unary selection operator
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream₀.

template<typename IteratorT, typename NumSelectedIteratorT, typename SelectOp> static inline cudaError_t If( void *d_temp_storage, size_t &temp_storage_bytes, IteratorT d_data, NumSelectedIteratorT d_num_selected_out, ::cuda::std::int64_t num_items, SelectOp select_op, cudaStream_t stream = 0 )#

Uses the select_op functor to selectively compact items in d_data. The total number of items selected is written to d_num_selected_out.

Added in version 2.2.0: First appears in CUDA Toolkit 12.3.

Copies of the selected items are compacted in d_data and maintain

their original relative ordering.
When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.

Snippet#

The code snippet below illustrates the compaction of items selected from an int device vector.

#include <cub/cub.cuh>   // or equivalently <cub/device/device_select.cuh>

// Functor type for selecting values less than some criteria
struct LessThan
{
    int compare;

    __host__ __device__ __forceinline__
    LessThan(int compare) : compare(compare) {}

    __host__ __device__ __forceinline__
    bool operator()(const int &a) const {
        return (a < compare);
    }
};

// Declare, allocate, and initialize device-accessible pointers
// for input and output
int      num_items;              // e.g., 8
int      *d_data;                // e.g., [0, 2, 3, 9, 5, 2, 81, 8]
int      *d_num_selected_out;    // e.g., [ ]
LessThan select_op(7);
...

// Determine temporary device storage requirements
void     *d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceSelect::If(
  d_temp_storage, temp_storage_bytes,
  d_data, d_num_selected_out, num_items, select_op);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run selection
cub::DeviceSelect::If(
  d_temp_storage, temp_storage_bytes,
  d_data, d_num_selected_out, num_items, select_op);

// d_data                <-- [0, 2, 3, 5, 2]
// d_num_selected_out    <-- [5]

Template Parameters:

IteratorT – [inferred] Random-access input iterator type for reading and writing items (may be a simple pointer type)
NumSelectedIteratorT – [inferred] Output iterator type for recording the number of items selected (may be a simple pointer type)
SelectOp – [inferred] Selection operator type having member bool operator()(const T &a)

Parameters:

d_temp_storage – [in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.
temp_storage_bytes – [inout] Reference to size in bytes of d_temp_storage allocation
d_data – [inout] Pointer to the sequence of data items
d_num_selected_out – [out] Pointer to the output total number of items selected
num_items – [in] Total number of input items (i.e., length of d_data)
select_op – [in] Unary selection operator
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream₀.

template<typename InputIteratorT, typename FlagIterator, typename OutputIteratorT, typename NumSelectedIteratorT, typename SelectOp> static inline cudaError_t FlaggedIf( void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, FlagIterator d_flags, OutputIteratorT d_out, NumSelectedIteratorT d_num_selected_out, ::cuda::std::int64_t num_items, SelectOp select_op, cudaStream_t stream = 0 )#

Uses the select_op functor applied to d_flags to selectively copy the corresponding items from d_in into d_out. The total number of items selected is written to d_num_selected_out.

Added in version 2.2.0: First appears in CUDA Toolkit 12.3.

The expression select_op(flag) must be convertible to bool, where the type of flag corresponds to the value type of FlagIterator.
Copies of the selected items are compacted into d_out and maintain their original relative ordering.
The range [d_out, d_out + *d_num_selected_out) shall not overlap

[d_in, d_in + num_items) nor d_num_selected_out in any way.
When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.

Snippet#

The code snippet below illustrates the compaction of items selected from an int device vector.

struct is_even_t
{
  __host__ __device__ bool operator()(int flag) const
  {
    return !(flag % 2);
  }
};

constexpr int num_items         = 8;
c2h::device_vector<int> d_in    = {0, 1, 2, 3, 4, 5, 6, 7};
c2h::device_vector<int> d_flags = {8, 6, 7, 5, 3, 0, 9, 3};
c2h::device_vector<int> d_out(num_items);
c2h::device_vector<int> d_num_selected_out(num_items);
is_even_t is_even{};

// Determine temporary device storage requirements
size_t temp_storage_bytes = 0;
cub::DeviceSelect::FlaggedIf(
  nullptr,
  temp_storage_bytes,
  d_in.begin(),
  d_flags.begin(),
  d_out.begin(),
  d_num_selected_out.data(),
  num_items,
  is_even);

// Allocate temporary storage
c2h::device_vector<char> temp_storage(temp_storage_bytes);

// Run selection
cub::DeviceSelect::FlaggedIf(
  thrust::raw_pointer_cast(temp_storage.data()),
  temp_storage_bytes,
  d_in.begin(),
  d_flags.begin(),
  d_out.begin(),
  d_num_selected_out.data(),
  num_items,
  is_even);

c2h::device_vector<int> expected{0, 1, 5};

Template Parameters:

InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
FlagIterator – [inferred] Random-access input iterator type for reading selection flags (may be a simple pointer type)
OutputIteratorT – [inferred] Random-access output iterator type for writing selected items (may be a simple pointer type)
NumSelectedIteratorT – [inferred] Output iterator type for recording the number of items selected (may be a simple pointer type)
SelectOp – [inferred] Selection operator type having member bool operator()(const T &a)

Parameters:

d_temp_storage – [in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.
temp_storage_bytes – [inout] Reference to size in bytes of d_temp_storage allocation
d_in – [in] Pointer to the input sequence of data items
d_flags – [in] Pointer to the input sequence of selection flags
d_out – [out] Pointer to the output sequence of selected data items
d_num_selected_out – [out] Pointer to the output total number of items selected (i.e., length of d_out)
num_items – [in] Total number of input items (i.e., length of d_in)
select_op – [in] Unary selection operator
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream₀.

template<typename IteratorT, typename FlagIterator, typename NumSelectedIteratorT, typename SelectOp> static inline cudaError_t FlaggedIf( void *d_temp_storage, size_t &temp_storage_bytes, IteratorT d_data, FlagIterator d_flags, NumSelectedIteratorT d_num_selected_out, ::cuda::std::int64_t num_items, SelectOp select_op, cudaStream_t stream = 0 )#

Uses the select_op functor applied to d_flags to selectively compact the corresponding items in d_data. The total number of items selected is written to d_num_selected_out.

Added in version 2.2.0: First appears in CUDA Toolkit 12.3.

The expression select_op(flag) must be convertible to bool, where the type of flag corresponds to the value type of FlagIterator.
Copies of the selected items are compacted in-place and maintain their original relative ordering.
The d_data may equal d_flags. The range [d_data, d_data + num_items) shall not overlap

[d_flags, d_flags + num_items) in any other way.
When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.

Snippet#

The code snippet below illustrates the compaction of items selected from an int device vector.

struct is_even_t
{
  __host__ __device__ bool operator()(int flag) const
  {
    return !(flag % 2);
  }
};

constexpr int num_items         = 8;
c2h::device_vector<int> d_data  = {0, 1, 2, 3, 4, 5, 6, 7};
c2h::device_vector<int> d_flags = {8, 6, 7, 5, 3, 0, 9, 3};
c2h::device_vector<int> d_num_selected_out(num_items);
is_even_t is_even{};

// Determine temporary device storage requirements
size_t temp_storage_bytes = 0;
cub::DeviceSelect::FlaggedIf(
  nullptr, temp_storage_bytes, d_data.begin(), d_flags.begin(), d_num_selected_out.data(), num_items, is_even);

// Allocate temporary storage
c2h::device_vector<char> temp_storage(temp_storage_bytes);

// Run selection
cub::DeviceSelect::FlaggedIf(
  thrust::raw_pointer_cast(temp_storage.data()),
  temp_storage_bytes,
  d_data.begin(),
  d_flags.begin(),
  d_num_selected_out.data(),
  num_items,
  is_even);

c2h::device_vector<int> expected{0, 1, 5};

Template Parameters:

IteratorT – [inferred] Random-access iterator type for reading and writing selected items (may be a simple pointer type)
FlagIterator – [inferred] Random-access input iterator type for reading selection flags (may be a simple pointer type)
NumSelectedIteratorT – [inferred] Output iterator type for recording the number of items selected (may be a simple pointer type)
SelectOp – [inferred] Selection operator type having member bool operator()(const T &a)

Parameters:

d_temp_storage – [in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.
temp_storage_bytes – [inout] Reference to size in bytes of d_temp_storage allocation
d_data – [inout] Pointer to the sequence of data items
d_flags – [in] Pointer to the input sequence of selection flags
d_num_selected_out – [out] Pointer to the output total number of items selected
num_items – [in] Total number of input items (i.e., length of d_data)
select_op – [in] Unary selection operator
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream₀.

template<typename InputIteratorT, typename OutputIteratorT, typename NumSelectedIteratorT> static inline cudaError_t Unique( void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, NumSelectedIteratorT d_num_selected_out, ::cuda::std::int64_t num_items, cudaStream_t stream = 0 )#

Given an input sequence d_in having runs of consecutive equal-valued keys, only the first key from each run is selectively copied to d_out. The total number of items selected is written to d_num_selected_out.

Added in version 2.2.0: First appears in CUDA Toolkit 12.3.

The == equality operator is used to determine whether keys are equivalent
Copies of the selected items are compacted into d_out and maintain their original relative ordering.
The range [d_out, d_out + *d_num_selected_out) shall not overlap

[d_in, d_in + num_items) nor d_num_selected_out in any way.
When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.

Snippet#

The code snippet below illustrates the compaction of items selected from an int device vector.

#include <cub/cub.cuh>   // or equivalently <cub/device/device_select.cuh>

// Declare, allocate, and initialize device-accessible pointers
// for input and output
int  num_items;              // e.g., 8
int  *d_in;                  // e.g., [0, 2, 2, 9, 5, 5, 5, 8]
int  *d_out;                 // e.g., [ ,  ,  ,  ,  ,  ,  ,  ]
int  *d_num_selected_out;    // e.g., [ ]
...

// Determine temporary device storage requirements
void     *d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceSelect::Unique(
  d_temp_storage, temp_storage_bytes,
  d_in, d_out, d_num_selected_out, num_items);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run selection
cub::DeviceSelect::Unique(
  d_temp_storage, temp_storage_bytes,
  d_in, d_out, d_num_selected_out, num_items);

// d_out                 <-- [0, 2, 9, 5, 8]
// d_num_selected_out    <-- [5]

Template Parameters:

InputIteratorT – [inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OutputIteratorT – [inferred] Random-access output iterator type for writing selected items (may be a simple pointer type)
NumSelectedIteratorT – [inferred] Output iterator type for recording the number of items selected (may be a simple pointer type)

Parameters:

d_temp_storage – [in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.
temp_storage_bytes – [inout] Reference to size in bytes of d_temp_storage allocation
d_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output sequence of selected data items
d_num_selected_out – [out] Pointer to the output total number of items selected (i.e., length of d_out)
num_items – [in] Total number of input items (i.e., length of d_in)
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream₀.

template<typename KeyInputIteratorT, typename ValueInputIteratorT, typename KeyOutputIteratorT, typename ValueOutputIteratorT, typename NumSelectedIteratorT, typename NumItemsT, typename EqualityOpT> static inline ::cuda::std::enable_if_t<!::cuda::std::is_convertible_v<EqualityOpT, cudaStream_t>, cudaError_t> UniqueByKey( void *d_temp_storage, size_t &temp_storage_bytes, KeyInputIteratorT d_keys_in, ValueInputIteratorT d_values_in, KeyOutputIteratorT d_keys_out, ValueOutputIteratorT d_values_out, NumSelectedIteratorT d_num_selected_out, NumItemsT num_items, EqualityOpT equality_op, cudaStream_t stream = 0 )#

Given an input sequence d_keys_in and d_values_in with runs of key-value pairs with consecutive equal-valued keys, only the first key and its value from each run is selectively copied to d_keys_out and d_values_out. The total number of items selected is written to d_num_selected_out.

Added in version 2.2.0: First appears in CUDA Toolkit 12.3.

The user-provided equality operator, equality_op, is used to determine whether keys are equivalent
Copies of the selected items are compacted into d_out and maintain their original relative ordering.
In-place operations are not supported. There must be no overlap between any of the provided ranges:
- [d_keys_in, d_keys_in + num_items)
- [d_keys_out, d_keys_out + *d_num_selected_out)
- [d_values_in, d_values_in + num_items)
- [d_values_out, d_values_out + *d_num_selected_out)
- [d_num_selected_out, d_num_selected_out + 1)
When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.

Snippet#

The code snippet below illustrates the compaction of items selected from an int device vector.

#include <cub/cub.cuh>   // or equivalently <cub/device/device_select.cuh>

// Declare, allocate, and initialize device-accessible pointers
// for input and output
int  num_items;              // e.g., 8
int  *d_keys_in;             // e.g., [0, 2, 2, 9, 5, 5, 5, 8]
int  *d_values_in;           // e.g., [1, 2, 3, 4, 5, 6, 7, 8]
int  *d_keys_out;            // e.g., [ ,  ,  ,  ,  ,  ,  ,  ]
int  *d_values_out;          // e.g., [ ,  ,  ,  ,  ,  ,  ,  ]
int  *d_num_selected_out;    // e.g., [ ]
...

// Determine temporary device storage requirements
void     *d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceSelect::UniqueByKey(
  d_temp_storage, temp_storage_bytes,
  d_keys_in, d_values_in,
  d_keys_out, d_values_out, d_num_selected_out, num_items);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run selection
cub::DeviceSelect::UniqueByKey(
  d_temp_storage, temp_storage_bytes,
  d_keys_in, d_values_in,
  d_keys_out, d_values_out, d_num_selected_out, num_items);

// d_keys_out            <-- [0, 2, 9, 5, 8]
// d_values_out          <-- [1, 2, 4, 5, 8]
// d_num_selected_out    <-- [5]

Template Parameters:

KeyInputIteratorT – [inferred] Random-access input iterator type for reading input keys (may be a simple pointer type)
ValueInputIteratorT – [inferred] Random-access input iterator type for reading input values (may be a simple pointer type)
KeyOutputIteratorT – [inferred] Random-access output iterator type for writing selected keys (may be a simple pointer type)
ValueOutputIteratorT – [inferred] Random-access output iterator type for writing selected values (may be a simple pointer type)
NumSelectedIteratorT – [inferred] Output iterator type for recording the number of items selected (may be a simple pointer type)
NumItemsT – [inferred] Type of num_items
EqualityOpT – [inferred] Type of equality_op

Parameters:

d_temp_storage – [in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.
temp_storage_bytes – [inout] Reference to size in bytes of d_temp_storage allocation
d_keys_in – [in] Pointer to the input sequence of keys
d_values_in – [in] Pointer to the input sequence of values
d_keys_out – [out] Pointer to the output sequence of selected keys
d_values_out – [out] Pointer to the output sequence of selected values
d_num_selected_out – [out] Pointer to the total number of items selected (i.e., length of d_keys_out or d_values_out)
num_items – [in] Total number of input items (i.e., length of d_keys_in or d_values_in)
equality_op – [in] Binary predicate to determine equality
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream₀.

template<typename KeyInputIteratorT, typename ValueInputIteratorT, typename KeyOutputIteratorT, typename ValueOutputIteratorT, typename NumSelectedIteratorT, typename NumItemsT> static inline cudaError_t UniqueByKey( void *d_temp_storage, size_t &temp_storage_bytes, KeyInputIteratorT d_keys_in, ValueInputIteratorT d_values_in, KeyOutputIteratorT d_keys_out, ValueOutputIteratorT d_values_out, NumSelectedIteratorT d_num_selected_out, NumItemsT num_items, cudaStream_t stream = 0 )#

Given an input sequence d_keys_in and d_values_in with runs of key-value pairs with consecutive equal-valued keys, only the first key and its value from each run is selectively copied to d_keys_out and d_values_out. The total number of items selected is written to d_num_selected_out.

Added in version 2.2.0: First appears in CUDA Toolkit 12.3.

The == equality operator is used to determine whether keys are equivalent
Copies of the selected items are compacted into d_out and maintain their original relative ordering.
In-place operations are not supported. There must be no overlap between any of the provided ranges:
- [d_keys_in, d_keys_in + num_items)
- [d_keys_out, d_keys_out + *d_num_selected_out)
- [d_values_in, d_values_in + num_items)
- [d_values_out, d_values_out + *d_num_selected_out)
- [d_num_selected_out, d_num_selected_out + 1)
When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes. See Determining Temporary Storage Requirements for usage guidance.

Snippet#

The code snippet below illustrates the compaction of items selected from an int device vector.

#include <cub/cub.cuh>   // or equivalently <cub/device/device_select.cuh>

// Declare, allocate, and initialize device-accessible pointers
// for input and output
int  num_items;              // e.g., 8
int  *d_keys_in;             // e.g., [0, 2, 2, 9, 5, 5, 5, 8]
int  *d_values_in;           // e.g., [1, 2, 3, 4, 5, 6, 7, 8]
int  *d_keys_out;            // e.g., [ ,  ,  ,  ,  ,  ,  ,  ]
int  *d_values_out;          // e.g., [ ,  ,  ,  ,  ,  ,  ,  ]
int  *d_num_selected_out;    // e.g., [ ]
...

// Determine temporary device storage requirements
void     *d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceSelect::UniqueByKey(
  d_temp_storage, temp_storage_bytes,
  d_keys_in, d_values_in,
  d_keys_out, d_values_out, d_num_selected_out, num_items);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run selection
cub::DeviceSelect::UniqueByKey(
  d_temp_storage, temp_storage_bytes,
  d_keys_in, d_values_in,
  d_keys_out, d_values_out, d_num_selected_out, num_items);

// d_keys_out            <-- [0, 2, 9, 5, 8]
// d_values_out          <-- [1, 2, 4, 5, 8]
// d_num_selected_out    <-- [5]

Template Parameters:

KeyInputIteratorT – [inferred] Random-access input iterator type for reading input keys (may be a simple pointer type)
ValueInputIteratorT – [inferred] Random-access input iterator type for reading input values (may be a simple pointer type)
KeyOutputIteratorT – [inferred] Random-access output iterator type for writing selected keys (may be a simple pointer type)
ValueOutputIteratorT – [inferred] Random-access output iterator type for writing selected values (may be a simple pointer type)
NumSelectedIteratorT – [inferred] Output iterator type for recording the number of items selected (may be a simple pointer type)
NumItemsT – [inferred] Type of num_items

Parameters:

d_temp_storage – [in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.
temp_storage_bytes – [inout] Reference to size in bytes of d_temp_storage allocation
d_keys_in – [in] Pointer to the input sequence of keys
d_values_in – [in] Pointer to the input sequence of values
d_keys_out – [out] Pointer to the output sequence of selected keys
d_values_out – [out] Pointer to the output sequence of selected values
d_num_selected_out – [out] Pointer to the total number of items selected (i.e., length of d_keys_out or d_values_out)
num_items – [in] Total number of input items (i.e., length of d_keys_in or d_values_in)
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream₀.