cub::DeviceTopK#

struct DeviceTopK#

DeviceTopK provides device-wide, parallel operations for finding the largest (or smallest) K items from sequences of unordered data items residing within device-accessible memory.

Overview#

The TopK algorithm tries to find the largest (or smallest) K items in an unordered list. A related problem is called K selection problem, which finds the Kth largest (or smallest) values in a list. DeviceTopK will return K items in an unspecified order as results. It is based on an algorithm called AIR TopK.

Supported Types#

DeviceTopK can process all of the built-in C++ numeric primitive types (unsigned char, int, double, etc.) as well as CUDA’s __half and __nv_bfloat16 16-bit floating-point types. User-defined types are supported as long as a decomposer object is provided.

Determinism, tie-breaking, and output ordering#

The result of DeviceTopK is governed by two orthogonal execution requirements: which items are selected (cuda::execution::determinism, optionally refined by cuda::execution::tie_break) and the order in which they are written (cuda::execution::output_ordering). When the caller does not opt out, the committed default is the most reproducible behavior: deterministic results (cuda::execution::determinism::gpu_to_gpu), ties resolved toward the smaller (lower) source index (cuda::execution::tie_break::prefer_smaller_index), and stable-sorted output (cuda::execution::output_ordering::stable_sorted). Callers opt out of these guarantees to obtain faster implementations.

See Top-K: Determinism, Tie-Breaking, and Output Ordering for the full requirement model, worked examples, and guidance on choosing requirements.

Usage Considerations#

Dynamic parallelism. DeviceTopK methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.

Performance#

The work-complexity of top-k as a function of input size is linear, resulting in performance throughput that plateaus with problem sizes large enough to saturate the GPU.

Note

Current support. This release only implements the fully opted-out configuration, which must be requested explicitly: cuda::execution::require(cuda::execution::determinism::not_guaranteed, cuda::execution::output_ordering::unsorted). Any other combination (including an empty, no-requirement environment) is rejected at compile time. In this configuration the output is unordered and may be non-deterministic: if multiple items tie at the K-th position, the subset of tied elements returned is not uniquely defined and may vary between runs.

Public Static Functions

template<typename KeyInputIteratorT, typename KeyOutputIteratorT, typename ValueInputIteratorT, typename ValueOutputIteratorT, typename NumItemsT, typename NumOutItemsT, typename EnvT = ::cuda::std::execution::env<>, ::cuda::std::enable_if_t<!detail::radix::is_valid_decomposer<detail::it_value_t<KeyInputIteratorT>, EnvT>, int> = 0> static inline cudaError_t MaxPairs( void *d_temp_storage, size_t &temp_storage_bytes, KeyInputIteratorT d_keys_in, KeyOutputIteratorT d_keys_out, ValueInputIteratorT d_values_in, ValueOutputIteratorT d_values_out, NumItemsT num_items, NumOutItemsT k, const EnvT &env = {} )#

Overview#

Finds the largest K keys and their corresponding values from an unordered input sequence of key-value pairs.

Temporary storage for this operation. If d_temp_storage is nullptr, the required size is written to temp_storage_bytes without dereferencing iterators or launching kernels. Otherwise, d_temp_storage must point to a device-accessible allocation of at least temp_storage_bytes bytes. No special alignment is required. See Two-Phase API (explicit temporary storage management) for usage guidance.

Added in version 3.3.0.

A Simple Example#

The following code snippet demonstrates how to use the cub::DeviceTopK::MaxPairs function to find the largest K items:

const int k     = 4;
auto keys       = thrust::device_vector<int>{5, -3, 1, 7, 8, 2, 4, 6};
auto values     = cuda::make_counting_iterator<int>(0);
auto keys_out   = thrust::device_vector<int>(k, thrust::no_init);
auto values_out = thrust::device_vector<int>(k, thrust::no_init);

// Specify that we do not require a specific output order and do not require deterministic results
auto requirements =
  cuda::execution::require(cuda::execution::determinism::not_guaranteed, cuda::execution::output_ordering::unsorted);

// Prepare CUDA stream
cudaStream_t stream = nullptr;
cudaStreamCreate(&stream);
cuda::stream_ref stream_ref{stream};

// Create the environment with the stream and requirements
auto env = cuda::std::execution::env{stream_ref, requirements};

// Query temporary storage requirements
size_t temp_storage_bytes{};
cub::DeviceTopK::MaxPairs(
  nullptr, temp_storage_bytes, keys.begin(), keys_out.begin(), values, values_out.begin(), keys.size(), k, env);

// Allocate temporary storage
thrust::device_vector<char> temp_storage(temp_storage_bytes, thrust::no_init);

cub::DeviceTopK::MaxPairs(
  thrust::raw_pointer_cast(temp_storage.data()),
  temp_storage_bytes,
  keys.begin(),
  keys_out.begin(),
  values,
  values_out.begin(),
  keys.size(),
  k,
  env);

// Get the top-k results into sorted order for easy comparison
thrust::sort_by_key(keys_out.begin(), keys_out.end(), values_out.begin(), cuda::std::greater<>{});
thrust::host_vector<int> expected_keys{8, 7, 6, 5};
thrust::host_vector<int> expected_values{4, 3, 7, 0};

Note

The behavior is undefined if the input and output ranges overlap in any way.

Template Parameters:

KeyInputIteratorT – [inferred] Random-access input iterator type for reading input keys (may be a simple pointer type)
KeyOutputIteratorT – [inferred] Random-access output iterator type for writing output keys (may be a simple pointer type)
ValueInputIteratorT – [inferred] Random-access input iterator type for reading input values (may be a simple pointer type)
ValueOutputIteratorT – [inferred] Random-access input iterator type for writing output values (may be a simple pointer type)
NumItemsT – The integral type of variable num_items
NumOutItemsT – The integral type of variable k

Parameters:

d_temp_storage – [in] Temporary storage for this operation. If d_temp_storage is nullptr, the required size is written to temp_storage_bytes without dereferencing iterators or launching kernels. Otherwise, d_temp_storage must point to a device-accessible allocation of at least temp_storage_bytes bytes. No special alignment is required. See :ref:device-temp-storage for usage guidance.
temp_storage_bytes – [inout] Reference to size in bytes of d_temp_storage allocation
d_keys_in – [in] Random-access iterator to the input sequence containing the keys
d_keys_out – [out] Random-access iterator to the output sequence of keys, where K values will be written to
d_values_in – [in] Random-access iterator to the input sequence containing the values associated to each key
d_values_out – [out] Random-access iterator to the output sequence of values, corresponding to the top k keys, where k values will be written to
num_items – [in] Number of items to be read and processed from d_keys_in and d_values_in each
k – [in] The value of K, which is the number of largest pairs to find from num_items pairs. Capped to a maximum of num_items.
env – [in]
[optional] Execution environment. Default is cuda::std::execution::env{}.

template<typename KeyInputIteratorT, typename KeyOutputIteratorT, typename ValueInputIteratorT, typename ValueOutputIteratorT, typename NumItemsT, typename NumOutItemsT, typename EnvT = ::cuda::std::execution::env<>, ::cuda::std::enable_if_t<!detail::radix::is_valid_decomposer<detail::it_value_t<KeyInputIteratorT>, EnvT>, int> = 0> static inline cudaError_t MaxPairs( KeyInputIteratorT d_keys_in, KeyOutputIteratorT d_keys_out, ValueInputIteratorT d_values_in, ValueOutputIteratorT d_values_out, NumItemsT num_items, NumOutItemsT k, const EnvT &env = {} )#

Finds the largest K keys and their corresponding values from an unordered input sequence of key-value pairs.

Added in version 3.5.0: First appears in CUDA Toolkit 13.5.

This is an environment-based API that allows customization of:

Stream: Query via cuda::get_stream
Memory resource: Query via cuda::mr::get_memory_resource

Unlike the temp-storage overload, this overload allocates and manages the required temporary storage internally using the memory resource queried from the environment.

Snippet#

auto d_keys_in    = thrust::device_vector<int>{8, 6, 7, 5, 3, 0, 9, 1, 4, 2};
auto d_values_in  = thrust::device_vector<int>{0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
auto d_keys_out   = thrust::device_vector<int>(3);
auto d_values_out = thrust::device_vector<int>(3);
int k             = 3;

cuda::stream stream{cuda::devices[0]};
cuda::stream_ref stream_ref{stream};
auto env = cuda::std::execution::env{
  cuda::execution::require(cuda::execution::determinism::not_guaranteed, //
                           cuda::execution::output_ordering::unsorted),
  stream_ref};

auto error = cub::DeviceTopK::MaxPairs(
  d_keys_in.begin(),
  d_keys_out.begin(),
  d_values_in.begin(),
  d_values_out.begin(),
  static_cast<int>(d_keys_in.size()),
  k,
  env);
if (error != cudaSuccess)
{
  std::cerr << "cub::DeviceTopK::MaxPairs failed with status: " << error << '\n';
}
thrust::device_vector<int> expected_keys{9, 8, 7}; // possibly in different order

Note

The behavior is undefined if the input and output ranges overlap in any way.

Template Parameters:

KeyInputIteratorT – [inferred] Random-access input iterator type for reading input keys (may be a simple pointer type)
KeyOutputIteratorT – [inferred] Random-access output iterator type for writing output keys (may be a simple pointer type)
ValueInputIteratorT – [inferred] Random-access input iterator type for reading input values (may be a simple pointer type)
ValueOutputIteratorT – [inferred] Random-access input iterator type for writing output values (may be a simple pointer type)
NumItemsT – The integral type of variable num_items
NumOutItemsT – The integral type of variable k
EnvT – [inferred] Execution environment type. Default is cuda::std::execution::env<>.

Parameters:

d_keys_in – [in] Random-access iterator to the input sequence containing the keys
d_keys_out – [out] Random-access iterator to the output sequence of keys, where K values will be written to
d_values_in – [in] Random-access iterator to the input sequence containing the values associated to each key
d_values_out – [out] Random-access iterator to the output sequence of values, corresponding to the top k keys, where k values will be written to
num_items – [in] Number of items to be read and processed from d_keys_in and d_values_in each
k – [in] The value of K, which is the number of largest pairs to find from num_items pairs. Capped to a maximum of num_items.
env – [in]
[optional] Execution environment. Default is cuda::std::execution::env{}.

template<typename KeyInputIteratorT, typename KeyOutputIteratorT, typename ValueInputIteratorT, typename ValueOutputIteratorT, typename NumItemsT, typename NumOutItemsT, typename DecomposerT, typename EnvT = ::cuda::std::execution::env<>> static inline ::cuda::std::enable_if_t<detail::radix::is_valid_decomposer<detail::it_value_t<KeyInputIteratorT>, DecomposerT>, cudaError_t> MaxPairs( void *d_temp_storage, size_t &temp_storage_bytes, KeyInputIteratorT d_keys_in, KeyOutputIteratorT d_keys_out, ValueInputIteratorT d_values_in, ValueOutputIteratorT d_values_out, NumItemsT num_items, NumOutItemsT k, DecomposerT decomposer, const EnvT &env = {} )#

Overview#

Finds the largest K keys and their corresponding values from an unordered input sequence of key-value pairs, using a decomposer to interpret user-defined key types.

Temporary storage for this operation. If d_temp_storage is nullptr, the required size is written to temp_storage_bytes without dereferencing iterators or launching kernels. Otherwise, d_temp_storage must point to a device-accessible allocation of at least temp_storage_bytes bytes. No special alignment is required. See Two-Phase API (explicit temporary storage management) for usage guidance.

Added in version 3.4.0: First appears in CUDA Toolkit 13.4.

A Simple Example#

Let’s consider a user-defined custom_t type below. To find the top-k elements of an array of custom_t objects, we have to tell CUB about relevant members of the custom_t type. We do this by providing a decomposer that returns a tuple of references to relevant members of the key.

struct custom_t
{
  float f;
  int unused;
  long long int lli;

  custom_t() = default;
  custom_t(float f, long long int lli)
      : f(f)
      , unused(42)
      , lli(lli)
  {}
};

struct decomposer_t
{
  __host__ __device__ cuda::std::tuple<float&, long long int&> operator()(custom_t& key) const
  {
    return {key.f, key.lli};
  }
};

The following snippet shows how to find the top-k largest pairs of custom_t objects using cub::DeviceTopK::MaxPairs:

constexpr int num_items = 6;
constexpr int k         = 3;

thrust::device_vector<custom_t> keys_in = {
  {+2.5f, 4}, //
  {-2.5f, 0}, //
  {+1.1f, 3}, //
  {+0.0f, 1}, //
  {-0.0f, 2}, //
  {+3.7f, 5} //
};

thrust::device_vector<custom_t> keys_out(k);

const custom_t* d_keys_in = thrust::raw_pointer_cast(keys_in.data());
custom_t* d_keys_out      = thrust::raw_pointer_cast(keys_out.data());

thrust::device_vector<int> vals_in = {0, 1, 2, 3, 4, 5};
thrust::device_vector<int> vals_out(k);

const int* d_vals_in = thrust::raw_pointer_cast(vals_in.data());
int* d_vals_out      = thrust::raw_pointer_cast(vals_out.data());

auto requirements = cuda::execution::require(
  cuda::execution::determinism::not_guaranteed, cuda::execution::output_ordering::unsorted);

std::uint8_t* d_temp_storage{};
std::size_t temp_storage_bytes{};

cub::DeviceTopK::MaxPairs(
  d_temp_storage,
  temp_storage_bytes,
  d_keys_in,
  d_keys_out,
  d_vals_in,
  d_vals_out,
  num_items,
  k,
  decomposer_t{},
  requirements);

thrust::device_vector<std::uint8_t> temp_storage(temp_storage_bytes);
d_temp_storage = thrust::raw_pointer_cast(temp_storage.data());

cub::DeviceTopK::MaxPairs(
  d_temp_storage,
  temp_storage_bytes,
  d_keys_in,
  d_keys_out,
  d_vals_in,
  d_vals_out,
  num_items,
  k,
  decomposer_t{},
  requirements);

// Sort by key for comparison (output order is not guaranteed)
thrust::sort_by_key(keys_out.begin(), keys_out.end(), vals_out.begin(), cuda::std::greater<>{});

thrust::device_vector<custom_t> expected_keys = {
  {+3.7f, 5}, //
  {+2.5f, 4}, //
  {+1.1f, 3} //
};

thrust::device_vector<int> expected_vals = {5, 0, 2};

Note

The behavior is undefined if the input and output ranges overlap in any way.

Template Parameters:

KeyInputIteratorT – [inferred] Random-access input iterator type for reading input keys (may be a simple pointer type)
KeyOutputIteratorT – [inferred] Random-access output iterator type for writing output keys (may be a simple pointer type)
ValueInputIteratorT – [inferred] Random-access input iterator type for reading input values (may be a simple pointer type)
ValueOutputIteratorT – [inferred] Random-access input iterator type for writing output values (may be a simple pointer type)
NumItemsT – The integral type of variable num_items
NumOutItemsT – The integral type of variable k
DecomposerT – [inferred] Type of a callable object responsible for decomposing a key into a tuple of references to its constituent arithmetic types.

Parameters:

d_temp_storage – [in] Temporary storage for this operation. If d_temp_storage is nullptr, the required size is written to temp_storage_bytes without dereferencing iterators or launching kernels. Otherwise, d_temp_storage must point to a device-accessible allocation of at least temp_storage_bytes bytes. No special alignment is required. See :ref:device-temp-storage for usage guidance.
temp_storage_bytes – [inout] Reference to size in bytes of d_temp_storage allocation
d_keys_in – [in] Random-access iterator to the input sequence containing the keys
d_keys_out – [out] Random-access iterator to the output sequence of keys, where K values will be written to
d_values_in – [in] Random-access iterator to the input sequence containing the values associated to each key
d_values_out – [out] Random-access iterator to the output sequence of values, corresponding to the top k keys, where k values will be written to
num_items – [in] Number of items to be read and processed from d_keys_in and d_values_in each
k – [in] The value of K, which is the number of largest pairs to find from num_items pairs. Capped to a maximum of num_items.
decomposer – Callable object responsible for decomposing a key into a tuple of references to its constituent arithmetic types.
env – [in]
[optional] Execution environment. Default is cuda::std::execution::env{}.

template<typename KeyInputIteratorT, typename KeyOutputIteratorT, typename ValueInputIteratorT, typename ValueOutputIteratorT, typename NumItemsT, typename NumOutItemsT, typename DecomposerT, typename EnvT = ::cuda::std::execution::env<>, ::cuda::std::enable_if_t<detail::radix::is_valid_decomposer<detail::it_value_t<KeyInputIteratorT>, DecomposerT>, int> = 0> static inline cudaError_t MaxPairs( KeyInputIteratorT d_keys_in, KeyOutputIteratorT d_keys_out, ValueInputIteratorT d_values_in, ValueOutputIteratorT d_values_out, NumItemsT num_items, NumOutItemsT k, DecomposerT decomposer, const EnvT &env = {} )#

Finds the largest K keys and their corresponding values from an unordered input sequence of key-value pairs, using a decomposer to interpret user-defined key types.

Added in version 3.5.0: First appears in CUDA Toolkit 13.5.

This is an environment-based API that allows customization of:

Stream: Query via cuda::get_stream
Memory resource: Query via cuda::mr::get_memory_resource

Unlike the temp-storage overload, this overload allocates and manages the required temporary storage internally using the memory resource queried from the environment.

Snippet#

thrust::host_vector<topk_custom_t> h_keys_in{
  {8, 0}, {6, 1}, {7, 2}, {5, 3}, {3, 4}, {0, 5}, {9, 6}, {1, 7}, {4, 8}, {2, 9}};
thrust::device_vector<topk_custom_t> d_keys_in = h_keys_in;
thrust::device_vector<int> d_values_in{0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
thrust::device_vector<topk_custom_t> d_keys_out(3);
thrust::device_vector<int> d_values_out(3);
int k = 3;

cuda::stream stream{cuda::devices[0]};
cuda::stream_ref stream_ref{stream};
auto env = cuda::std::execution::env{
  cuda::execution::require(cuda::execution::determinism::not_guaranteed, //
                           cuda::execution::output_ordering::unsorted),
  stream_ref};

auto error = cub::DeviceTopK::MaxPairs(
  d_keys_in.begin(),
  d_keys_out.begin(),
  d_values_in.begin(),
  d_values_out.begin(),
  static_cast<int>(d_keys_in.size()),
  k,
  topk_custom_decomposer_t{},
  env);
if (error != cudaSuccess)
{
  std::cerr << "cub::DeviceTopK::MaxPairs failed with status: " << error << '\n';
}
thrust::host_vector<int> expected_ranks{9, 8, 7}; // possibly in different order

Note

The behavior is undefined if the input and output ranges overlap in any way.

Template Parameters:

KeyInputIteratorT – [inferred] Random-access input iterator type for reading input keys (may be a simple pointer type)
KeyOutputIteratorT – [inferred] Random-access output iterator type for writing output keys (may be a simple pointer type)
ValueInputIteratorT – [inferred] Random-access input iterator type for reading input values (may be a simple pointer type)
ValueOutputIteratorT – [inferred] Random-access input iterator type for writing output values (may be a simple pointer type)
NumItemsT – The integral type of variable num_items
NumOutItemsT – The integral type of variable k
DecomposerT – [inferred] Type of a callable object responsible for decomposing a key into a tuple of references to its constituent arithmetic types.
EnvT – [inferred] Execution environment type. Default is cuda::std::execution::env<>.

Parameters:

d_keys_in – [in] Random-access iterator to the input sequence containing the keys
d_keys_out – [out] Random-access iterator to the output sequence of keys, where K values will be written to
d_values_in – [in] Random-access iterator to the input sequence containing the values associated to each key
d_values_out – [out] Random-access iterator to the output sequence of values, corresponding to the top k keys, where k values will be written to
num_items – [in] Number of items to be read and processed from d_keys_in and d_values_in each
k – [in] The value of K, which is the number of largest pairs to find from num_items pairs. Capped to a maximum of num_items.
decomposer – [in] Callable object responsible for decomposing a key into a tuple of references to its constituent arithmetic types.
env – [in]
[optional] Execution environment. Default is cuda::std::execution::env{}.

template<typename KeyInputIteratorT, typename KeyOutputIteratorT, typename ValueInputIteratorT, typename ValueOutputIteratorT, typename NumItemsT, typename NumOutItemsT, typename EnvT = ::cuda::std::execution::env<>, ::cuda::std::enable_if_t<!detail::radix::is_valid_decomposer<detail::it_value_t<KeyInputIteratorT>, EnvT>, int> = 0> static inline cudaError_t MinPairs( void *d_temp_storage, size_t &temp_storage_bytes, KeyInputIteratorT d_keys_in, KeyOutputIteratorT d_keys_out, ValueInputIteratorT d_values_in, ValueOutputIteratorT d_values_out, NumItemsT num_items, NumOutItemsT k, const EnvT &env = {} )#

Overview#

Finds the lowest K keys and their corresponding values from an unordered input sequence of key-value pairs.

Temporary storage for this operation. If d_temp_storage is nullptr, the required size is written to temp_storage_bytes without dereferencing iterators or launching kernels. Otherwise, d_temp_storage must point to a device-accessible allocation of at least temp_storage_bytes bytes. No special alignment is required. See Two-Phase API (explicit temporary storage management) for usage guidance.

Added in version 3.3.0.

A Simple Example#

The following code snippet demonstrates how to use the cub::DeviceTopK::MinPairs function to find the lowest K items:

const int k     = 4;
auto keys       = thrust::device_vector<int>{5, -3, 1, 7, 8, 2, 4, 6};
auto values     = cuda::make_counting_iterator<int>(0);
auto keys_out   = thrust::device_vector<int>(k, thrust::no_init);
auto values_out = thrust::device_vector<int>(k, thrust::no_init);

// Specify that we do not require a specific output order and do not require deterministic results
auto requirements =
  cuda::execution::require(cuda::execution::determinism::not_guaranteed, cuda::execution::output_ordering::unsorted);

// Prepare CUDA stream
cudaStream_t stream = nullptr;
cudaStreamCreate(&stream);
cuda::stream_ref stream_ref{stream};

// Create the environment with the stream and requirements
auto env = cuda::std::execution::env{stream_ref, requirements};

// Query temporary storage requirements
size_t temp_storage_bytes{};
cub::DeviceTopK::MinPairs(
  nullptr, temp_storage_bytes, keys.begin(), keys_out.begin(), values, values_out.begin(), keys.size(), k, env);

// Allocate temporary storage
thrust::device_vector<char> temp_storage(temp_storage_bytes, thrust::no_init);

cub::DeviceTopK::MinPairs(
  thrust::raw_pointer_cast(temp_storage.data()),
  temp_storage_bytes,
  keys.begin(),
  keys_out.begin(),
  values,
  values_out.begin(),
  keys.size(),
  k,
  env);

// Get the top-k results into sorted order for easy comparison
thrust::sort_by_key(keys_out.begin(), keys_out.end(), values_out.begin());
thrust::host_vector<int> expected_keys{-3, 1, 2, 4};
thrust::host_vector<int> expected_values{1, 2, 5, 6};

Note

The behavior is undefined if the input and output ranges overlap in any way.

Template Parameters:

KeyInputIteratorT – [inferred] Random-access input iterator type for reading input keys (may be a simple pointer type)
KeyOutputIteratorT – [inferred] Random-access output iterator type for writing output keys (may be a simple pointer type)
ValueInputIteratorT – [inferred] Random-access input iterator type for reading input values (may be a simple pointer type)
ValueOutputIteratorT – [inferred] Random-access input iterator type for writing output values (may be a simple pointer type)
NumItemsT – The integral type of variable num_items
NumOutItemsT – The integral type of variable k

Parameters:

d_temp_storage – [in] Temporary storage for this operation. If d_temp_storage is nullptr, the required size is written to temp_storage_bytes without dereferencing iterators or launching kernels. Otherwise, d_temp_storage must point to a device-accessible allocation of at least temp_storage_bytes bytes. No special alignment is required. See :ref:device-temp-storage for usage guidance.
temp_storage_bytes – [inout] Reference to size in bytes of d_temp_storage allocation
d_keys_in – [in] Random-access iterator to the input sequence containing the keys
d_keys_out – [out] Random-access iterator to the output sequence of keys, where K values will be written to
d_values_in – [in] Random-access iterator to the input sequence containing the values associated to each key
d_values_out – [out] Random-access iterator to the output sequence of values, corresponding to the top k keys, where k values will be written to
num_items – [in] Number of items to be read and processed from d_keys_in and d_values_in each
k – [in] The value of K, which is the number of lowest pairs to find from num_items pairs. Capped to a maximum of num_items.
env – [in]
[optional] Execution environment. Default is cuda::std::execution::env{}.

template<typename KeyInputIteratorT, typename KeyOutputIteratorT, typename ValueInputIteratorT, typename ValueOutputIteratorT, typename NumItemsT, typename NumOutItemsT, typename EnvT = ::cuda::std::execution::env<>, ::cuda::std::enable_if_t<!detail::radix::is_valid_decomposer<detail::it_value_t<KeyInputIteratorT>, EnvT>, int> = 0> static inline cudaError_t MinPairs( KeyInputIteratorT d_keys_in, KeyOutputIteratorT d_keys_out, ValueInputIteratorT d_values_in, ValueOutputIteratorT d_values_out, NumItemsT num_items, NumOutItemsT k, const EnvT &env = {} )#

Finds the smallest K keys and their corresponding values from an unordered input sequence of key-value pairs.

Added in version 3.5.0: First appears in CUDA Toolkit 13.5.

This is an environment-based API that allows customization of:

Stream: Query via cuda::get_stream
Memory resource: Query via cuda::mr::get_memory_resource

Unlike the temp-storage overload, this overload allocates and manages the required temporary storage internally using the memory resource queried from the environment.

Snippet#

auto d_keys_in    = thrust::device_vector<int>{8, 6, 7, 5, 3, 0, 9, 1, 4, 2};
auto d_values_in  = thrust::device_vector<int>{0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
auto d_keys_out   = thrust::device_vector<int>(3);
auto d_values_out = thrust::device_vector<int>(3);
int k             = 3;

cuda::stream stream{cuda::devices[0]};
cuda::stream_ref stream_ref{stream};
auto env = cuda::std::execution::env{
  cuda::execution::require(cuda::execution::determinism::not_guaranteed, //
                           cuda::execution::output_ordering::unsorted),
  stream_ref};

auto error = cub::DeviceTopK::MinPairs(
  d_keys_in.begin(),
  d_keys_out.begin(),
  d_values_in.begin(),
  d_values_out.begin(),
  static_cast<int>(d_keys_in.size()),
  k,
  env);
if (error != cudaSuccess)
{
  std::cerr << "cub::DeviceTopK::MinPairs failed with status: " << error << '\n';
}
thrust::device_vector<int> expected_keys{0, 1, 2}; // possibly in different order

Note

The behavior is undefined if the input and output ranges overlap in any way.

Template Parameters:

KeyInputIteratorT – [inferred] Random-access input iterator type for reading input keys (may be a simple pointer type)
KeyOutputIteratorT – [inferred] Random-access output iterator type for writing output keys (may be a simple pointer type)
ValueInputIteratorT – [inferred] Random-access input iterator type for reading input values (may be a simple pointer type)
ValueOutputIteratorT – [inferred] Random-access input iterator type for writing output values (may be a simple pointer type)
NumItemsT – The integral type of variable num_items
NumOutItemsT – The integral type of variable k
EnvT – [inferred] Execution environment type. Default is cuda::std::execution::env<>.

Parameters:

d_keys_in – [in] Random-access iterator to the input sequence containing the keys
d_keys_out – [out] Random-access iterator to the output sequence of keys, where K values will be written to
d_values_in – [in] Random-access iterator to the input sequence containing the values associated to each key
d_values_out – [out] Random-access iterator to the output sequence of values, corresponding to the top k keys, where k values will be written to
num_items – [in] Number of items to be read and processed from d_keys_in and d_values_in each
k – [in] The value of K, which is the number of lowest pairs to find from num_items pairs. Capped to a maximum of num_items.
env – [in]
[optional] Execution environment. Default is cuda::std::execution::env{}.

template<typename KeyInputIteratorT, typename KeyOutputIteratorT, typename ValueInputIteratorT, typename ValueOutputIteratorT, typename NumItemsT, typename NumOutItemsT, typename DecomposerT, typename EnvT = ::cuda::std::execution::env<>> static inline ::cuda::std::enable_if_t<detail::radix::is_valid_decomposer<detail::it_value_t<KeyInputIteratorT>, DecomposerT>, cudaError_t> MinPairs( void *d_temp_storage, size_t &temp_storage_bytes, KeyInputIteratorT d_keys_in, KeyOutputIteratorT d_keys_out, ValueInputIteratorT d_values_in, ValueOutputIteratorT d_values_out, NumItemsT num_items, NumOutItemsT k, DecomposerT decomposer, const EnvT &env = {} )#

Overview#

Finds the lowest K keys and their corresponding values from an unordered input sequence of key-value pairs, using a decomposer to interpret user-defined key types.

Temporary storage for this operation. If d_temp_storage is nullptr, the required size is written to temp_storage_bytes without dereferencing iterators or launching kernels. Otherwise, d_temp_storage must point to a device-accessible allocation of at least temp_storage_bytes bytes. No special alignment is required. See Two-Phase API (explicit temporary storage management) for usage guidance.

Added in version 3.4.0: First appears in CUDA Toolkit 13.4.

A Simple Example#

Let’s consider a user-defined custom_t type below. To find the top-k elements of an array of custom_t objects, we have to tell CUB about relevant members of the custom_t type. We do this by providing a decomposer that returns a tuple of references to relevant members of the key.

struct custom_t
{
  float f;
  int unused;
  long long int lli;

  custom_t() = default;
  custom_t(float f, long long int lli)
      : f(f)
      , unused(42)
      , lli(lli)
  {}
};

struct decomposer_t
{
  __host__ __device__ cuda::std::tuple<float&, long long int&> operator()(custom_t& key) const
  {
    return {key.f, key.lli};
  }
};

The following snippet shows how to find the top-k smallest pairs of custom_t objects using cub::DeviceTopK::MinPairs:

constexpr int num_items = 6;
constexpr int k         = 3;

thrust::device_vector<custom_t> keys_in = {
  {+2.5f, 4}, //
  {-2.5f, 0}, //
  {+1.1f, 3}, //
  {+0.0f, 1}, //
  {-0.0f, 2}, //
  {+3.7f, 5} //
};

thrust::device_vector<custom_t> keys_out(k);

const custom_t* d_keys_in = thrust::raw_pointer_cast(keys_in.data());
custom_t* d_keys_out      = thrust::raw_pointer_cast(keys_out.data());

thrust::device_vector<int> vals_in = {0, 1, 2, 3, 4, 5};
thrust::device_vector<int> vals_out(k);

const int* d_vals_in = thrust::raw_pointer_cast(vals_in.data());
int* d_vals_out      = thrust::raw_pointer_cast(vals_out.data());

auto requirements = cuda::execution::require(
  cuda::execution::determinism::not_guaranteed, cuda::execution::output_ordering::unsorted);

std::uint8_t* d_temp_storage{};
std::size_t temp_storage_bytes{};

cub::DeviceTopK::MinPairs(
  d_temp_storage,
  temp_storage_bytes,
  d_keys_in,
  d_keys_out,
  d_vals_in,
  d_vals_out,
  num_items,
  k,
  decomposer_t{},
  requirements);

thrust::device_vector<std::uint8_t> temp_storage(temp_storage_bytes);
d_temp_storage = thrust::raw_pointer_cast(temp_storage.data());

cub::DeviceTopK::MinPairs(
  d_temp_storage,
  temp_storage_bytes,
  d_keys_in,
  d_keys_out,
  d_vals_in,
  d_vals_out,
  num_items,
  k,
  decomposer_t{},
  requirements);

// Sort by key for comparison (output order is not guaranteed)
thrust::sort_by_key(keys_out.begin(), keys_out.end(), vals_out.begin());

thrust::device_vector<custom_t> expected_keys = {
  {-2.5f, 0}, //
  {+0.0f, 1}, //
  {-0.0f, 2} //
};

thrust::device_vector<int> expected_vals = {1, 3, 4};

Note

The behavior is undefined if the input and output ranges overlap in any way.

Template Parameters:

KeyInputIteratorT – [inferred] Random-access input iterator type for reading input keys (may be a simple pointer type)
KeyOutputIteratorT – [inferred] Random-access output iterator type for writing output keys (may be a simple pointer type)
ValueInputIteratorT – [inferred] Random-access input iterator type for reading input values (may be a simple pointer type)
ValueOutputIteratorT – [inferred] Random-access input iterator type for writing output values (may be a simple pointer type)
NumItemsT – The integral type of variable num_items
NumOutItemsT – The integral type of variable k
DecomposerT – [inferred] Type of a callable object responsible for decomposing a key into a tuple of references to its constituent arithmetic types.

Parameters:

d_temp_storage – [in] Temporary storage for this operation. If d_temp_storage is nullptr, the required size is written to temp_storage_bytes without dereferencing iterators or launching kernels. Otherwise, d_temp_storage must point to a device-accessible allocation of at least temp_storage_bytes bytes. No special alignment is required. See :ref:device-temp-storage for usage guidance.
temp_storage_bytes – [inout] Reference to size in bytes of d_temp_storage allocation
d_keys_in – [in] Random-access iterator to the input sequence containing the keys
d_keys_out – [out] Random-access iterator to the output sequence of keys, where K values will be written to
d_values_in – [in] Random-access iterator to the input sequence containing the values associated to each key
d_values_out – [out] Random-access iterator to the output sequence of values, corresponding to the top k keys, where k values will be written to
num_items – [in] Number of items to be read and processed from d_keys_in and d_values_in each
k – [in] The value of K, which is the number of lowest pairs to find from num_items pairs. Capped to a maximum of num_items.
decomposer – Callable object responsible for decomposing a key into a tuple of references to its constituent arithmetic types.
env – [in]
[optional] Execution environment. Default is cuda::std::execution::env{}.

template<typename KeyInputIteratorT, typename KeyOutputIteratorT, typename ValueInputIteratorT, typename ValueOutputIteratorT, typename NumItemsT, typename NumOutItemsT, typename DecomposerT, typename EnvT = ::cuda::std::execution::env<>, ::cuda::std::enable_if_t<detail::radix::is_valid_decomposer<detail::it_value_t<KeyInputIteratorT>, DecomposerT>, int> = 0> static inline cudaError_t MinPairs( KeyInputIteratorT d_keys_in, KeyOutputIteratorT d_keys_out, ValueInputIteratorT d_values_in, ValueOutputIteratorT d_values_out, NumItemsT num_items, NumOutItemsT k, DecomposerT decomposer, const EnvT &env = {} )#

Finds the smallest K keys and their corresponding values from an unordered input sequence of key-value pairs, using a decomposer to interpret user-defined key types.

Added in version 3.5.0: First appears in CUDA Toolkit 13.5.

This is an environment-based API that allows customization of:

Stream: Query via cuda::get_stream
Memory resource: Query via cuda::mr::get_memory_resource

Unlike the temp-storage overload, this overload allocates and manages the required temporary storage internally using the memory resource queried from the environment.

Snippet#

thrust::host_vector<topk_custom_t> h_keys_in{
  {8, 0}, {6, 1}, {7, 2}, {5, 3}, {3, 4}, {0, 5}, {9, 6}, {1, 7}, {4, 8}, {2, 9}};
thrust::device_vector<topk_custom_t> d_keys_in = h_keys_in;
thrust::device_vector<int> d_values_in{0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
thrust::device_vector<topk_custom_t> d_keys_out(3);
thrust::device_vector<int> d_values_out(3);
int k = 3;

cuda::stream stream{cuda::devices[0]};
cuda::stream_ref stream_ref{stream};
auto env = cuda::std::execution::env{
  cuda::execution::require(cuda::execution::determinism::not_guaranteed, //
                           cuda::execution::output_ordering::unsorted),
  stream_ref};

auto error = cub::DeviceTopK::MinPairs(
  d_keys_in.begin(),
  d_keys_out.begin(),
  d_values_in.begin(),
  d_values_out.begin(),
  static_cast<int>(d_keys_in.size()),
  k,
  topk_custom_decomposer_t{},
  env);
if (error != cudaSuccess)
{
  std::cerr << "cub::DeviceTopK::MinPairs failed with status: " << error << '\n';
}
thrust::host_vector<int> expected_ranks{0, 1, 2}; // possibly in different order

Note

The behavior is undefined if the input and output ranges overlap in any way.

Template Parameters:

KeyInputIteratorT – [inferred] Random-access input iterator type for reading input keys (may be a simple pointer type)
KeyOutputIteratorT – [inferred] Random-access output iterator type for writing output keys (may be a simple pointer type)
ValueInputIteratorT – [inferred] Random-access input iterator type for reading input values (may be a simple pointer type)
ValueOutputIteratorT – [inferred] Random-access input iterator type for writing output values (may be a simple pointer type)
NumItemsT – The integral type of variable num_items
NumOutItemsT – The integral type of variable k
DecomposerT – [inferred] Type of a callable object responsible for decomposing a key into a tuple of references to its constituent arithmetic types.
EnvT – [inferred] Execution environment type. Default is cuda::std::execution::env<>.

Parameters:

d_keys_in – [in] Random-access iterator to the input sequence containing the keys
d_keys_out – [out] Random-access iterator to the output sequence of keys, where K values will be written to
d_values_in – [in] Random-access iterator to the input sequence containing the values associated to each key
d_values_out – [out] Random-access iterator to the output sequence of values, corresponding to the top k keys, where k values will be written to
num_items – [in] Number of items to be read and processed from d_keys_in and d_values_in each
k – [in] The value of K, which is the number of lowest pairs to find from num_items pairs. Capped to a maximum of num_items.
decomposer – [in] Callable object responsible for decomposing a key into a tuple of references to its constituent arithmetic types.
env – [in]
[optional] Execution environment. Default is cuda::std::execution::env{}.

template<typename KeyInputIteratorT, typename KeyOutputIteratorT, typename NumItemsT, typename NumOutItemsT, typename EnvT = ::cuda::std::execution::env<>, ::cuda::std::enable_if_t<!detail::radix::is_valid_decomposer<detail::it_value_t<KeyInputIteratorT>, EnvT>, int> = 0> static inline cudaError_t MaxKeys( void *d_temp_storage, size_t &temp_storage_bytes, KeyInputIteratorT d_keys_in, KeyOutputIteratorT d_keys_out, NumItemsT num_items, NumOutItemsT k, const EnvT &env = {} )#

Overview#

Finds the largest K keys from an unordered input sequence of keys.

Temporary storage for this operation. If d_temp_storage is nullptr, the required size is written to temp_storage_bytes without dereferencing iterators or launching kernels. Otherwise, d_temp_storage must point to a device-accessible allocation of at least temp_storage_bytes bytes. No special alignment is required. See Two-Phase API (explicit temporary storage management) for usage guidance.

Added in version 3.3.0.

A Simple Example#

The following code snippet demonstrates how to use the cub::DeviceTopK::MinKeys function to find the largest K items:

const int k = 4;
auto input  = thrust::device_vector<int>{5, -3, 1, 7, 8, 2, 4, 6};
auto output = thrust::device_vector<int>(k, thrust::no_init);

// Specify that we do not require a specific output order and do not require deterministic results
auto requirements =
  cuda::execution::require(cuda::execution::determinism::not_guaranteed, cuda::execution::output_ordering::unsorted);

// Prepare CUDA stream
cudaStream_t stream = nullptr;
cudaStreamCreate(&stream);
cuda::stream_ref stream_ref{stream};

// Create the environment with the stream and requirements
auto env = cuda::std::execution::env{stream_ref, requirements};

// Query temporary storage requirements
size_t temp_storage_bytes{};
cub::DeviceTopK::MaxKeys(nullptr, temp_storage_bytes, input.begin(), output.begin(), input.size(), k, env);

// Allocate temporary storage
thrust::device_vector<char> temp_storage(temp_storage_bytes, thrust::no_init);

cub::DeviceTopK::MaxKeys(
  thrust::raw_pointer_cast(temp_storage.data()),
  temp_storage_bytes,
  input.begin(),
  output.begin(),
  input.size(),
  k,
  env);

// Get the top-k results into sorted order for easy comparison
thrust::sort(output.begin(), output.end(), cuda::std::greater{});
thrust::host_vector<int> expected{8, 7, 6, 5};

Note

The behavior is undefined if the input and output ranges overlap in any way.

Template Parameters:

KeyInputIteratorT – [inferred] Random-access input iterator type for reading input keys (may be a simple pointer type)
KeyOutputIteratorT – [inferred] Random-access output iterator type for writing output keys (may be a simple pointer type)
NumItemsT – The integral type of variable num_items
NumOutItemsT – The integral type of variable k

Parameters:

d_temp_storage – [in] Temporary storage for this operation. If d_temp_storage is nullptr, the required size is written to temp_storage_bytes without dereferencing iterators or launching kernels. Otherwise, d_temp_storage must point to a device-accessible allocation of at least temp_storage_bytes bytes. No special alignment is required. See :ref:device-temp-storage for usage guidance.
temp_storage_bytes – [inout] Reference to size in bytes of d_temp_storage allocation
d_keys_in – [in] Random-access iterator to the input sequence containing the keys
d_keys_out – [out] Random-access iterator to the output sequence of keys, where K values will be written to
num_items – [in] Number of items to be read and processed from d_keys_in
k – [in] The value of K, which is the number of largest pairs to find from num_items pairs. Capped to a maximum of num_items.
env – [in]
[optional] Execution environment. Default is cuda::std::execution::env{}.

template<typename KeyInputIteratorT, typename KeyOutputIteratorT, typename NumItemsT, typename NumOutItemsT, typename EnvT = ::cuda::std::execution::env<>, ::cuda::std::enable_if_t<!detail::radix::is_valid_decomposer<detail::it_value_t<KeyInputIteratorT>, EnvT>, int> = 0> static inline cudaError_t MaxKeys( KeyInputIteratorT d_keys_in, KeyOutputIteratorT d_keys_out, NumItemsT num_items, NumOutItemsT k, const EnvT &env = {} )#

Finds the largest K keys from an unordered input sequence.

Added in version 3.5.0: First appears in CUDA Toolkit 13.5.

This is an environment-based API that allows customization of:

Stream: Query via cuda::get_stream
Memory resource: Query via cuda::mr::get_memory_resource

Unlike the temp-storage overload, this overload allocates and manages the required temporary storage internally using the memory resource queried from the environment.

Snippet#

auto d_in  = thrust::device_vector<int>{8, 6, 7, 5, 3, 0, 9, 1, 4, 2};
auto d_out = thrust::device_vector<int>(3);
int k      = 3;

cuda::stream stream{cuda::devices[0]};
cuda::stream_ref stream_ref{stream};
auto env = cuda::std::execution::env{
  cuda::execution::require(cuda::execution::determinism::not_guaranteed, //
                           cuda::execution::output_ordering::unsorted),
  stream_ref};

auto error = cub::DeviceTopK::MaxKeys(d_in.begin(), d_out.begin(), static_cast<int>(d_in.size()), k, env);
if (error != cudaSuccess)
{
  std::cerr << "cub::DeviceTopK::MaxKeys failed with status: " << error << '\n';
}
thrust::device_vector<int> expected{9, 8, 7}; // possibly in different order

Note

The behavior is undefined if the input and output ranges overlap in any way.

Template Parameters:

KeyInputIteratorT – [inferred] Random-access input iterator type for reading input keys (may be a simple pointer type)
KeyOutputIteratorT – [inferred] Random-access output iterator type for writing output keys (may be a simple pointer type)
NumItemsT – The integral type of variable num_items
NumOutItemsT – The integral type of variable k
EnvT – [inferred] Execution environment type. Default is cuda::std::execution::env<>.

Parameters:

d_keys_in – [in] Random-access iterator to the input sequence containing the keys
d_keys_out – [out] Random-access iterator to the output sequence of keys, where K values will be written to
num_items – [in] Number of items to be read and processed from d_keys_in
k – [in] The value of K, which is the number of largest keys to find from num_items keys. Capped to a maximum of num_items.
env – [in]
[optional] Execution environment. Default is cuda::std::execution::env{}.

template<typename KeyInputIteratorT, typename KeyOutputIteratorT, typename NumItemsT, typename NumOutItemsT, typename DecomposerT, typename EnvT = ::cuda::std::execution::env<>> static inline ::cuda::std::enable_if_t<detail::radix::is_valid_decomposer<detail::it_value_t<KeyInputIteratorT>, DecomposerT>, cudaError_t> MaxKeys( void *d_temp_storage, size_t &temp_storage_bytes, KeyInputIteratorT d_keys_in, KeyOutputIteratorT d_keys_out, NumItemsT num_items, NumOutItemsT k, DecomposerT decomposer, const EnvT &env = {} )#

Overview#

Finds the largest K keys from an unordered input sequence of keys, using a decomposer to interpret user-defined key types.

Temporary storage for this operation. If d_temp_storage is nullptr, the required size is written to temp_storage_bytes without dereferencing iterators or launching kernels. Otherwise, d_temp_storage must point to a device-accessible allocation of at least temp_storage_bytes bytes. No special alignment is required. See Two-Phase API (explicit temporary storage management) for usage guidance.

Added in version 3.4.0: First appears in CUDA Toolkit 13.4.

A Simple Example#

Let’s consider a user-defined custom_t type below. To find the top-k elements of an array of custom_t objects, we have to tell CUB about relevant members of the custom_t type. We do this by providing a decomposer that returns a tuple of references to relevant members of the key.

struct custom_t
{
  float f;
  int unused;
  long long int lli;

  custom_t() = default;
  custom_t(float f, long long int lli)
      : f(f)
      , unused(42)
      , lli(lli)
  {}
};

struct decomposer_t
{
  __host__ __device__ cuda::std::tuple<float&, long long int&> operator()(custom_t& key) const
  {
    return {key.f, key.lli};
  }
};

The following snippet shows how to find the top-k largest keys of custom_t objects using cub::DeviceTopK::MaxKeys:

constexpr int num_items = 6;
constexpr int k         = 3;

thrust::device_vector<custom_t> in = {
  {+2.5f, 4}, //
  {-2.5f, 0}, //
  {+1.1f, 3}, //
  {+0.0f, 1}, //
  {-0.0f, 2}, //
  {+3.7f, 5} //
};

thrust::device_vector<custom_t> out(k);

const custom_t* d_in = thrust::raw_pointer_cast(in.data());
custom_t* d_out      = thrust::raw_pointer_cast(out.data());

auto requirements = cuda::execution::require(
  cuda::execution::determinism::not_guaranteed, cuda::execution::output_ordering::unsorted);

// 1) Get temp storage size
std::uint8_t* d_temp_storage{};
std::size_t temp_storage_bytes{};

cub::DeviceTopK::MaxKeys(
  d_temp_storage, temp_storage_bytes, d_in, d_out, num_items, k, decomposer_t{}, requirements);

// 2) Allocate temp storage
thrust::device_vector<std::uint8_t> temp_storage(temp_storage_bytes);
d_temp_storage = thrust::raw_pointer_cast(temp_storage.data());

// 3) Find the top-k largest keys
cub::DeviceTopK::MaxKeys(
  d_temp_storage, temp_storage_bytes, d_in, d_out, num_items, k, decomposer_t{}, requirements);

// Sort output for comparison (output order is not guaranteed)
thrust::sort(out.begin(), out.end(), cuda::std::greater<>{});
thrust::device_vector<custom_t> expected = {
  {+3.7f, 5}, //
  {+2.5f, 4}, //
  {+1.1f, 3} //
};

Note

The behavior is undefined if the input and output ranges overlap in any way.

Template Parameters:

KeyInputIteratorT – [inferred] Random-access input iterator type for reading input keys (may be a simple pointer type)
KeyOutputIteratorT – [inferred] Random-access output iterator type for writing output keys (may be a simple pointer type)
NumItemsT – The integral type of variable num_items
NumOutItemsT – The integral type of variable k
DecomposerT – [inferred] Type of a callable object responsible for decomposing a key into a tuple of references to its constituent arithmetic types.

Parameters:

d_temp_storage – [in] Temporary storage for this operation. If d_temp_storage is nullptr, the required size is written to temp_storage_bytes without dereferencing iterators or launching kernels. Otherwise, d_temp_storage must point to a device-accessible allocation of at least temp_storage_bytes bytes. No special alignment is required. See :ref:device-temp-storage for usage guidance.
temp_storage_bytes – [inout] Reference to size in bytes of d_temp_storage allocation
d_keys_in – [in] Random-access iterator to the input sequence containing the keys
d_keys_out – [out] Random-access iterator to the output sequence of keys, where K values will be written to
num_items – [in] Number of items to be read and processed from d_keys_in
k – [in] The value of K, which is the number of largest keys to find from num_items keys. Capped to a maximum of num_items.
decomposer – Callable object responsible for decomposing a key into a tuple of references to its constituent arithmetic types.
env – [in]
[optional] Execution environment. Default is cuda::std::execution::env{}.

template<typename KeyInputIteratorT, typename KeyOutputIteratorT, typename NumItemsT, typename NumOutItemsT, typename DecomposerT, typename EnvT = ::cuda::std::execution::env<>, ::cuda::std::enable_if_t<detail::radix::is_valid_decomposer<detail::it_value_t<KeyInputIteratorT>, DecomposerT>, int> = 0> static inline cudaError_t MaxKeys( KeyInputIteratorT d_keys_in, KeyOutputIteratorT d_keys_out, NumItemsT num_items, NumOutItemsT k, DecomposerT decomposer, const EnvT &env = {} )#

Finds the largest K keys from an unordered input sequence, using a decomposer to interpret user-defined key types.

Added in version 3.5.0: First appears in CUDA Toolkit 13.5.

This is an environment-based API that allows customization of:

Stream: Query via cuda::get_stream
Memory resource: Query via cuda::mr::get_memory_resource

Unlike the temp-storage overload, this overload allocates and manages the required temporary storage internally using the memory resource queried from the environment.

Snippet#

thrust::host_vector<topk_custom_t> h_in{
  {8, 0}, {6, 1}, {7, 2}, {5, 3}, {3, 4}, {0, 5}, {9, 6}, {1, 7}, {4, 8}, {2, 9}};
thrust::device_vector<topk_custom_t> d_in = h_in;
thrust::device_vector<topk_custom_t> d_out(3);
int k = 3;

cuda::stream stream{cuda::devices[0]};
cuda::stream_ref stream_ref{stream};
auto env = cuda::std::execution::env{
  cuda::execution::require(cuda::execution::determinism::not_guaranteed, //
                           cuda::execution::output_ordering::unsorted),
  stream_ref};

auto error = cub::DeviceTopK::MaxKeys(
  d_in.begin(), d_out.begin(), static_cast<int>(d_in.size()), k, topk_custom_decomposer_t{}, env);
if (error != cudaSuccess)
{
  std::cerr << "cub::DeviceTopK::MaxKeys failed with status: " << error << '\n';
}
thrust::host_vector<int> expected_ranks{9, 8, 7}; // possibly in different order

Note

The behavior is undefined if the input and output ranges overlap in any way.

Template Parameters:

KeyInputIteratorT – [inferred] Random-access input iterator type for reading input keys (may be a simple pointer type)
KeyOutputIteratorT – [inferred] Random-access output iterator type for writing output keys (may be a simple pointer type)
NumItemsT – The integral type of variable num_items
NumOutItemsT – The integral type of variable k
DecomposerT – [inferred] Type of a callable object responsible for decomposing a key into a tuple of references to its constituent arithmetic types.
EnvT – [inferred] Execution environment type. Default is cuda::std::execution::env<>.

Parameters:

d_keys_in – [in] Random-access iterator to the input sequence containing the keys
d_keys_out – [out] Random-access iterator to the output sequence of keys, where K values will be written to
num_items – [in] Number of items to be read and processed from d_keys_in
k – [in] The value of K, which is the number of largest keys to find from num_items keys. Capped to a maximum of num_items.
decomposer – [in] Callable object responsible for decomposing a key into a tuple of references to its constituent arithmetic types.
env – [in]
[optional] Execution environment. Default is cuda::std::execution::env{}.

template<typename KeyInputIteratorT, typename KeyOutputIteratorT, typename NumItemsT, typename NumOutItemsT, typename EnvT = ::cuda::std::execution::env<>, ::cuda::std::enable_if_t<!detail::radix::is_valid_decomposer<detail::it_value_t<KeyInputIteratorT>, EnvT>, int> = 0> static inline cudaError_t MinKeys( void *d_temp_storage, size_t &temp_storage_bytes, KeyInputIteratorT d_keys_in, KeyOutputIteratorT d_keys_out, NumItemsT num_items, NumOutItemsT k, const EnvT &env = {} )#

Overview#

Finds the lowest K keys from an unordered input sequence of keys.

Temporary storage for this operation. If d_temp_storage is nullptr, the required size is written to temp_storage_bytes without dereferencing iterators or launching kernels. Otherwise, d_temp_storage must point to a device-accessible allocation of at least temp_storage_bytes bytes. No special alignment is required. See Two-Phase API (explicit temporary storage management) for usage guidance.

Added in version 3.3.0.

A Simple Example#

The following code snippet demonstrates how to use the cub::DeviceTopK::MinKeys function to find the lowest K items:

const int k = 4;
auto input  = thrust::device_vector<int>{5, -3, 1, 7, 8, 2, 4, 6};
auto output = thrust::device_vector<int>(k, thrust::no_init);

// Specify that we do not require a specific output order and do not require deterministic results
auto requirements =
  cuda::execution::require(cuda::execution::determinism::not_guaranteed, cuda::execution::output_ordering::unsorted);

// Prepare CUDA stream
cudaStream_t stream = nullptr;
cudaStreamCreate(&stream);
cuda::stream_ref stream_ref{stream};

// Create the environment with the stream and requirements
auto env = cuda::std::execution::env{stream_ref, requirements};

// Query temporary storage requirements
size_t temp_storage_bytes{};
cub::DeviceTopK::MinKeys(nullptr, temp_storage_bytes, input.begin(), output.begin(), input.size(), k, env);

// Allocate temporary storage
thrust::device_vector<char> temp_storage(temp_storage_bytes, thrust::no_init);

cub::DeviceTopK::MinKeys(
  thrust::raw_pointer_cast(temp_storage.data()),
  temp_storage_bytes,
  input.begin(),
  output.begin(),
  input.size(),
  k,
  env);

// Get the top-k results into sorted order for easy comparison
thrust::sort(output.begin(), output.end());
thrust::host_vector<int> expected{-3, 1, 2, 4};

Note

The behavior is undefined if the input and output ranges overlap in any way.

Template Parameters:

KeyInputIteratorT – [inferred] Random-access input iterator type for reading input keys (may be a simple pointer type)
KeyOutputIteratorT – [inferred] Random-access output iterator type for writing output keys (may be a simple pointer type)
NumItemsT – The integral type of variable num_items
NumOutItemsT – The integral type of variable k

Parameters:

d_temp_storage – [in] Temporary storage for this operation. If d_temp_storage is nullptr, the required size is written to temp_storage_bytes without dereferencing iterators or launching kernels. Otherwise, d_temp_storage must point to a device-accessible allocation of at least temp_storage_bytes bytes. No special alignment is required. See :ref:device-temp-storage for usage guidance.
temp_storage_bytes – [inout] Reference to size in bytes of d_temp_storage allocation
d_keys_in – [in] Random-access iterator to the input sequence containing the keys
d_keys_out – [out] Random-access iterator to the output sequence of keys, where K values will be written to
num_items – [in] Number of items to be read and processed from d_keys_in
k – [in] The value of K, which is the number of largest pairs to find from num_items pairs. Capped to a maximum of num_items.
env – [in]
[optional] Execution environment. Default is cuda::std::execution::env{}.

template<typename KeyInputIteratorT, typename KeyOutputIteratorT, typename NumItemsT, typename NumOutItemsT, typename EnvT = ::cuda::std::execution::env<>, ::cuda::std::enable_if_t<!detail::radix::is_valid_decomposer<detail::it_value_t<KeyInputIteratorT>, EnvT>, int> = 0> static inline cudaError_t MinKeys( KeyInputIteratorT d_keys_in, KeyOutputIteratorT d_keys_out, NumItemsT num_items, NumOutItemsT k, const EnvT &env = {} )#

Finds the smallest K keys from an unordered input sequence.

Added in version 3.5.0: First appears in CUDA Toolkit 13.5.

This is an environment-based API that allows customization of:

Stream: Query via cuda::get_stream
Memory resource: Query via cuda::mr::get_memory_resource

Unlike the temp-storage overload, this overload allocates and manages the required temporary storage internally using the memory resource queried from the environment.

Snippet#

auto d_in  = thrust::device_vector<int>{8, 6, 7, 5, 3, 0, 9, 1, 4, 2};
auto d_out = thrust::device_vector<int>(3);
int k      = 3;

cuda::stream stream{cuda::devices[0]};
cuda::stream_ref stream_ref{stream};
auto env = cuda::std::execution::env{
  cuda::execution::require(cuda::execution::determinism::not_guaranteed, //
                           cuda::execution::output_ordering::unsorted),
  stream_ref};

auto error = cub::DeviceTopK::MinKeys(d_in.begin(), d_out.begin(), static_cast<int>(d_in.size()), k, env);
if (error != cudaSuccess)
{
  std::cerr << "cub::DeviceTopK::MinKeys failed with status: " << error << '\n';
}
thrust::device_vector<int> expected{0, 1, 2}; // possibly in different order

Note

The behavior is undefined if the input and output ranges overlap in any way.

Template Parameters:

KeyInputIteratorT – [inferred] Random-access input iterator type for reading input keys (may be a simple pointer type)
KeyOutputIteratorT – [inferred] Random-access output iterator type for writing output keys (may be a simple pointer type)
NumItemsT – The integral type of variable num_items
NumOutItemsT – The integral type of variable k
EnvT – [inferred] Execution environment type. Default is cuda::std::execution::env<>.

Parameters:

d_keys_in – [in] Random-access iterator to the input sequence containing the keys
d_keys_out – [out] Random-access iterator to the output sequence of keys, where K values will be written to
num_items – [in] Number of items to be read and processed from d_keys_in
k – [in] The value of K, which is the number of lowest keys to find from num_items keys. Capped to a maximum of num_items.
env – [in]
[optional] Execution environment. Default is cuda::std::execution::env{}.

template<typename KeyInputIteratorT, typename KeyOutputIteratorT, typename NumItemsT, typename NumOutItemsT, typename DecomposerT, typename EnvT = ::cuda::std::execution::env<>> static inline ::cuda::std::enable_if_t<detail::radix::is_valid_decomposer<detail::it_value_t<KeyInputIteratorT>, DecomposerT>, cudaError_t> MinKeys( void *d_temp_storage, size_t &temp_storage_bytes, KeyInputIteratorT d_keys_in, KeyOutputIteratorT d_keys_out, NumItemsT num_items, NumOutItemsT k, DecomposerT decomposer, const EnvT &env = {} )#

Overview#

Finds the lowest K keys from an unordered input sequence of keys, using a decomposer to interpret user-defined key types.

Temporary storage for this operation. If d_temp_storage is nullptr, the required size is written to temp_storage_bytes without dereferencing iterators or launching kernels. Otherwise, d_temp_storage must point to a device-accessible allocation of at least temp_storage_bytes bytes. No special alignment is required. See Two-Phase API (explicit temporary storage management) for usage guidance.

Added in version 3.4.0: First appears in CUDA Toolkit 13.4.

A Simple Example#

Let’s consider a user-defined custom_t type below. To find the top-k elements of an array of custom_t objects, we have to tell CUB about relevant members of the custom_t type. We do this by providing a decomposer that returns a tuple of references to relevant members of the key.

struct custom_t
{
  float f;
  int unused;
  long long int lli;

  custom_t() = default;
  custom_t(float f, long long int lli)
      : f(f)
      , unused(42)
      , lli(lli)
  {}
};

struct decomposer_t
{
  __host__ __device__ cuda::std::tuple<float&, long long int&> operator()(custom_t& key) const
  {
    return {key.f, key.lli};
  }
};

The following snippet shows how to find the top-k smallest keys of custom_t objects using cub::DeviceTopK::MinKeys:

constexpr int num_items = 6;
constexpr int k         = 3;

thrust::device_vector<custom_t> in = {
  {+2.5f, 4}, //
  {-2.5f, 0}, //
  {+1.1f, 3}, //
  {+0.0f, 1}, //
  {-0.0f, 2}, //
  {+3.7f, 5} //
};

thrust::device_vector<custom_t> out(k);

const custom_t* d_in = thrust::raw_pointer_cast(in.data());
custom_t* d_out      = thrust::raw_pointer_cast(out.data());

auto requirements = cuda::execution::require(
  cuda::execution::determinism::not_guaranteed, cuda::execution::output_ordering::unsorted);

std::uint8_t* d_temp_storage{};
std::size_t temp_storage_bytes{};

cub::DeviceTopK::MinKeys(
  d_temp_storage, temp_storage_bytes, d_in, d_out, num_items, k, decomposer_t{}, requirements);

thrust::device_vector<std::uint8_t> temp_storage(temp_storage_bytes);
d_temp_storage = thrust::raw_pointer_cast(temp_storage.data());

cub::DeviceTopK::MinKeys(
  d_temp_storage, temp_storage_bytes, d_in, d_out, num_items, k, decomposer_t{}, requirements);

// Sort output for comparison (output order is not guaranteed)
thrust::sort(out.begin(), out.end());
thrust::device_vector<custom_t> expected = {
  {-2.5f, 0}, //
  {+0.0f, 1}, //
  {-0.0f, 2} //
};

Note

The behavior is undefined if the input and output ranges overlap in any way.

Template Parameters:

KeyInputIteratorT – [inferred] Random-access input iterator type for reading input keys (may be a simple pointer type)
KeyOutputIteratorT – [inferred] Random-access output iterator type for writing output keys (may be a simple pointer type)
NumItemsT – The integral type of variable num_items
NumOutItemsT – The integral type of variable k
DecomposerT – [inferred] Type of a callable object responsible for decomposing a key into a tuple of references to its constituent arithmetic types.

Parameters:

d_temp_storage – [in] Temporary storage for this operation. If d_temp_storage is nullptr, the required size is written to temp_storage_bytes without dereferencing iterators or launching kernels. Otherwise, d_temp_storage must point to a device-accessible allocation of at least temp_storage_bytes bytes. No special alignment is required. See :ref:device-temp-storage for usage guidance.
temp_storage_bytes – [inout] Reference to size in bytes of d_temp_storage allocation
d_keys_in – [in] Random-access iterator to the input sequence containing the keys
d_keys_out – [out] Random-access iterator to the output sequence of keys, where K values will be written to
num_items – [in] Number of items to be read and processed from d_keys_in
k – [in] The value of K, which is the number of lowest keys to find from num_items keys. Capped to a maximum of num_items.
decomposer – Callable object responsible for decomposing a key into a tuple of references to its constituent arithmetic types.
env – [in]
[optional] Execution environment. Default is cuda::std::execution::env{}.

template<typename KeyInputIteratorT, typename KeyOutputIteratorT, typename NumItemsT, typename NumOutItemsT, typename DecomposerT, typename EnvT = ::cuda::std::execution::env<>, ::cuda::std::enable_if_t<detail::radix::is_valid_decomposer<detail::it_value_t<KeyInputIteratorT>, DecomposerT>, int> = 0> static inline cudaError_t MinKeys( KeyInputIteratorT d_keys_in, KeyOutputIteratorT d_keys_out, NumItemsT num_items, NumOutItemsT k, DecomposerT decomposer, const EnvT &env = {} )#

Finds the smallest K keys from an unordered input sequence, using a decomposer to interpret user-defined key types.

Added in version 3.5.0: First appears in CUDA Toolkit 13.5.

This is an environment-based API that allows customization of:

Stream: Query via cuda::get_stream
Memory resource: Query via cuda::mr::get_memory_resource

Unlike the temp-storage overload, this overload allocates and manages the required temporary storage internally using the memory resource queried from the environment.

Snippet#

thrust::host_vector<topk_custom_t> h_in{
  {8, 0}, {6, 1}, {7, 2}, {5, 3}, {3, 4}, {0, 5}, {9, 6}, {1, 7}, {4, 8}, {2, 9}};
thrust::device_vector<topk_custom_t> d_in = h_in;
thrust::device_vector<topk_custom_t> d_out(3);
int k = 3;

cuda::stream stream{cuda::devices[0]};
cuda::stream_ref stream_ref{stream};
auto env = cuda::std::execution::env{
  cuda::execution::require(cuda::execution::determinism::not_guaranteed, //
                           cuda::execution::output_ordering::unsorted),
  stream_ref};

auto error = cub::DeviceTopK::MinKeys(
  d_in.begin(), d_out.begin(), static_cast<int>(d_in.size()), k, topk_custom_decomposer_t{}, env);
if (error != cudaSuccess)
{
  std::cerr << "cub::DeviceTopK::MinKeys failed with status: " << error << '\n';
}
thrust::host_vector<int> expected_ranks{0, 1, 2}; // possibly in different order

Note

The behavior is undefined if the input and output ranges overlap in any way.

Template Parameters:

KeyInputIteratorT – [inferred] Random-access input iterator type for reading input keys (may be a simple pointer type)
KeyOutputIteratorT – [inferred] Random-access output iterator type for writing output keys (may be a simple pointer type)
NumItemsT – The integral type of variable num_items
NumOutItemsT – The integral type of variable k
DecomposerT – [inferred] Type of a callable object responsible for decomposing a key into a tuple of references to its constituent arithmetic types.
EnvT – [inferred] Execution environment type. Default is cuda::std::execution::env<>.

Parameters:

d_keys_in – [in] Random-access iterator to the input sequence containing the keys
d_keys_out – [out] Random-access iterator to the output sequence of keys, where K values will be written to
num_items – [in] Number of items to be read and processed from d_keys_in
k – [in] The value of K, which is the number of lowest keys to find from num_items keys. Capped to a maximum of num_items.
decomposer – [in] Callable object responsible for decomposing a key into a tuple of references to its constituent arithmetic types.
env – [in]
[optional] Execution environment. Default is cuda::std::execution::env{}.