cub::DeviceTopK#

struct DeviceTopK#

DeviceTopK provides device-wide, parallel operations for finding the largest (or smallest) K items from sequences of unordered data items residing within device-accessible memory.

Overview#

The TopK algorithm tries to find the largest (or smallest) K items in an unordered list. A related problem is called K selection problem, which finds the Kth largest (or smallest) values in a list. DeviceTopK will return K items in an unspecified order as results. It is based on an algorithm called AIR TopK.

Supported Types#

DeviceTopK can process all of the built-in C++ numeric primitive types (unsigned char, int, double, etc.) as well as CUDA’s __half and __nv_bfloat16 16-bit floating-point types.

Determinism#

DeviceTopK currently only supports unordered output, which may be non-deterministic for certain inputs. That is, if there are multiple items across the k-th position that compare equal, the subset of tied elements that ends up in the returned top‑k is not uniquely defined and may vary between runs. This behavior has to be explicitly acknowledged by the user by passing cuda::execution::determinism::not_guaranteed.

Usage Considerations#

  • Dynamic parallelism. DeviceTopK methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.

Performance#

The work-complexity of top-k as a function of input size is linear, resulting in performance throughput that plateaus with problem sizes large enough to saturate the GPU.

Public Static Functions

template<typename KeyInputIteratorT, typename KeyOutputIteratorT, typename ValueInputIteratorT, typename ValueOutputIteratorT, typename NumItemsT, typename NumOutItemsT, typename EnvT = ::cuda::std::execution::env<>>
static inline cudaError_t MaxPairs(
void *d_temp_storage,
size_t &temp_storage_bytes,
KeyInputIteratorT d_keys_in,
KeyOutputIteratorT d_keys_out,
ValueInputIteratorT d_values_in,
ValueOutputIteratorT d_values_out,
NumItemsT num_items,
NumOutItemsT k,
EnvT env = {},
)#

Overview#

Finds the largest K keys and their corresponding values from an unordered input sequence of key-value pairs.

A Simple Example#

The following code snippet demonstrates how to use the cub::DeviceTopK::MaxPairs function to find the largest K items:

const int k     = 4;
auto keys       = thrust::device_vector<int>{5, -3, 1, 7, 8, 2, 4, 6};
auto values     = cuda::make_counting_iterator<int>(0);
auto keys_out   = thrust::device_vector<int>(k, thrust::no_init);
auto values_out = thrust::device_vector<int>(k, thrust::no_init);

// Specify that we do not require a specific output order and do not require deterministic results
auto requirements =
  cuda::execution::require(cuda::execution::determinism::not_guaranteed, cuda::execution::output_ordering::unsorted);

// Prepare CUDA stream
cudaStream_t stream = nullptr;
cudaStreamCreate(&stream);
cuda::stream_ref stream_ref{stream};

// Create the environment with the stream and requirements
auto env = cuda::std::execution::env{stream_ref, requirements};

// Query temporary storage requirements
size_t temp_storage_bytes{};
cub::DeviceTopK::MaxPairs(
  nullptr, temp_storage_bytes, keys.begin(), keys_out.begin(), values, values_out.begin(), keys.size(), k, env);

// Allocate temporary storage
thrust::device_vector<char> temp_storage(temp_storage_bytes, thrust::no_init);

cub::DeviceTopK::MaxPairs(
  thrust::raw_pointer_cast(temp_storage.data()),
  temp_storage_bytes,
  keys.begin(),
  keys_out.begin(),
  values,
  values_out.begin(),
  keys.size(),
  k,
  env);

// Get the top-k results into sorted order for easy comparison
thrust::sort_by_key(keys_out.begin(), keys_out.end(), values_out.begin(), cuda::std::greater<>{});
thrust::host_vector<int> expected_keys{8, 7, 6, 5};
thrust::host_vector<int> expected_values{4, 3, 7, 0};

Template Parameters:
  • KeyInputIteratorT[inferred] Random-access input iterator type for reading input keys (may be a simple pointer type)

  • KeyOutputIteratorT[inferred] Random-access output iterator type for writing output keys (may be a simple pointer type)

  • ValueInputIteratorT[inferred] Random-access input iterator type for reading input values (may be a simple pointer type)

  • ValueOutputIteratorT[inferred] Random-access input iterator type for writing output values (may be a simple pointer type)

  • NumItemsT – The integral type of variable num_items

  • NumOutItemsT – The integral type of variable k

Parameters:
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_keys_in[in] Random-access iterator to the input sequence containing the keys

  • d_keys_out[out] Random-access iterator to the output sequence of keys, where K values will be written to

  • d_values_in[in] Random-access iterator to the input sequence containing the values associated to each key

  • d_values_out[out] Random-access iterator to the output sequence of values, corresponding to the top k keys, where k values will be written to

  • num_items[in] Number of items to be read and processed from d_keys_in and d_values_in each

  • k[in] The value of K, which is the number of largest pairs to find from num_items pairs. Capped to a maximum of num_items.

  • env[in]

    [optional] Execution environment. Default is cuda::std::execution::env{}.

template<typename KeyInputIteratorT, typename KeyOutputIteratorT, typename ValueInputIteratorT, typename ValueOutputIteratorT, typename NumItemsT, typename NumOutItemsT, typename EnvT = ::cuda::std::execution::env<>>
static inline cudaError_t MinPairs(
void *d_temp_storage,
size_t &temp_storage_bytes,
KeyInputIteratorT d_keys_in,
KeyOutputIteratorT d_keys_out,
ValueInputIteratorT d_values_in,
ValueOutputIteratorT d_values_out,
NumItemsT num_items,
NumOutItemsT k,
EnvT env = {},
)#

Overview#

Finds the lowest K keys and their corresponding values from an unordered input sequence of key-value pairs.

A Simple Example#

The following code snippet demonstrates how to use the cub::DeviceTopK::MinPairs function to find the lowest K items:

const int k     = 4;
auto keys       = thrust::device_vector<int>{5, -3, 1, 7, 8, 2, 4, 6};
auto values     = cuda::make_counting_iterator<int>(0);
auto keys_out   = thrust::device_vector<int>(k, thrust::no_init);
auto values_out = thrust::device_vector<int>(k, thrust::no_init);

// Specify that we do not require a specific output order and do not require deterministic results
auto requirements =
  cuda::execution::require(cuda::execution::determinism::not_guaranteed, cuda::execution::output_ordering::unsorted);

// Prepare CUDA stream
cudaStream_t stream = nullptr;
cudaStreamCreate(&stream);
cuda::stream_ref stream_ref{stream};

// Create the environment with the stream and requirements
auto env = cuda::std::execution::env{stream_ref, requirements};

// Query temporary storage requirements
size_t temp_storage_bytes{};
cub::DeviceTopK::MinPairs(
  nullptr, temp_storage_bytes, keys.begin(), keys_out.begin(), values, values_out.begin(), keys.size(), k, env);

// Allocate temporary storage
thrust::device_vector<char> temp_storage(temp_storage_bytes, thrust::no_init);

cub::DeviceTopK::MinPairs(
  thrust::raw_pointer_cast(temp_storage.data()),
  temp_storage_bytes,
  keys.begin(),
  keys_out.begin(),
  values,
  values_out.begin(),
  keys.size(),
  k,
  env);

// Get the top-k results into sorted order for easy comparison
thrust::sort_by_key(keys_out.begin(), keys_out.end(), values_out.begin());
thrust::host_vector<int> expected_keys{-3, 1, 2, 4};
thrust::host_vector<int> expected_values{1, 2, 5, 6};

Template Parameters:
  • KeyInputIteratorT[inferred] Random-access input iterator type for reading input keys (may be a simple pointer type)

  • KeyOutputIteratorT[inferred] Random-access output iterator type for writing output keys (may be a simple pointer type)

  • ValueInputIteratorT[inferred] Random-access input iterator type for reading input values (may be a simple pointer type)

  • ValueOutputIteratorT[inferred] Random-access input iterator type for writing output values (may be a simple pointer type)

  • NumItemsT – The integral type of variable num_items

  • NumOutItemsT – The integral type of variable k

Parameters:
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_keys_in[in] Random-access iterator to the input sequence containing the keys

  • d_keys_out[out] Random-access iterator to the output sequence of keys, where K values will be written to

  • d_values_in[in] Random-access iterator to the input sequence containing the values associated to each key

  • d_values_out[out] Random-access iterator to the output sequence of values, corresponding to the top k keys, where k values will be written to

  • num_items[in] Number of items to be read and processed from d_keys_in and d_values_in each

  • k[in] The value of K, which is the number of lowest pairs to find from num_items pairs. Capped to a maximum of num_items.

  • env[in]

    [optional] Execution environment. Default is cuda::std::execution::env{}.

template<typename KeyInputIteratorT, typename KeyOutputIteratorT, typename NumItemsT, typename NumOutItemsT, typename EnvT = ::cuda::std::execution::env<>>
static inline cudaError_t MaxKeys(
void *d_temp_storage,
size_t &temp_storage_bytes,
KeyInputIteratorT d_keys_in,
KeyOutputIteratorT d_keys_out,
NumItemsT num_items,
NumOutItemsT k,
EnvT env = {},
)#

Overview#

Finds the largest K keys from an unordered input sequence of keys.

A Simple Example#

The following code snippet demonstrates how to use the cub::DeviceTopK::MinKeys function to find the largest K items:

const int k = 4;
auto input  = thrust::device_vector<int>{5, -3, 1, 7, 8, 2, 4, 6};
auto output = thrust::device_vector<int>(k, thrust::no_init);

// Specify that we do not require a specific output order and do not require deterministic results
auto requirements =
  cuda::execution::require(cuda::execution::determinism::not_guaranteed, cuda::execution::output_ordering::unsorted);

// Prepare CUDA stream
cudaStream_t stream = nullptr;
cudaStreamCreate(&stream);
cuda::stream_ref stream_ref{stream};

// Create the environment with the stream and requirements
auto env = cuda::std::execution::env{stream_ref, requirements};

// Query temporary storage requirements
size_t temp_storage_bytes{};
cub::DeviceTopK::MaxKeys(nullptr, temp_storage_bytes, input.begin(), output.begin(), input.size(), k, env);

// Allocate temporary storage
thrust::device_vector<char> temp_storage(temp_storage_bytes, thrust::no_init);

cub::DeviceTopK::MaxKeys(
  thrust::raw_pointer_cast(temp_storage.data()),
  temp_storage_bytes,
  input.begin(),
  output.begin(),
  input.size(),
  k,
  env);

// Get the top-k results into sorted order for easy comparison
thrust::sort(output.begin(), output.end(), cuda::std::greater{});
thrust::host_vector<int> expected{8, 7, 6, 5};

Template Parameters:
  • KeyInputIteratorT[inferred] Random-access input iterator type for reading input keys (may be a simple pointer type)

  • KeyOutputIteratorT[inferred] Random-access output iterator type for writing output keys (may be a simple pointer type)

  • NumItemsT – The integral type of variable num_items

  • NumOutItemsT – The integral type of variable k

Parameters:
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_keys_in[in] Random-access iterator to the input sequence containing the keys

  • d_keys_out[out] Random-access iterator to the output sequence of keys, where K values will be written to

  • num_items[in] Number of items to be read and processed from d_keys_in

  • k[in] The value of K, which is the number of largest pairs to find from num_items pairs. Capped to a maximum of num_items.

  • env[in]

    [optional] Execution environment. Default is cuda::std::execution::env{}.

template<typename KeyInputIteratorT, typename KeyOutputIteratorT, typename NumItemsT, typename NumOutItemsT, typename EnvT = ::cuda::std::execution::env<>>
static inline cudaError_t MinKeys(
void *d_temp_storage,
size_t &temp_storage_bytes,
KeyInputIteratorT d_keys_in,
KeyOutputIteratorT d_keys_out,
NumItemsT num_items,
NumOutItemsT k,
EnvT env = {},
)#

Overview#

Finds the lowest K keys from an unordered input sequence of keys.

A Simple Example#

The following code snippet demonstrates how to use the cub::DeviceTopK::MinKeys function to find the lowest K items:

const int k = 4;
auto input  = thrust::device_vector<int>{5, -3, 1, 7, 8, 2, 4, 6};
auto output = thrust::device_vector<int>(k, thrust::no_init);

// Specify that we do not require a specific output order and do not require deterministic results
auto requirements =
  cuda::execution::require(cuda::execution::determinism::not_guaranteed, cuda::execution::output_ordering::unsorted);

// Prepare CUDA stream
cudaStream_t stream = nullptr;
cudaStreamCreate(&stream);
cuda::stream_ref stream_ref{stream};

// Create the environment with the stream and requirements
auto env = cuda::std::execution::env{stream_ref, requirements};

// Query temporary storage requirements
size_t temp_storage_bytes{};
cub::DeviceTopK::MinKeys(nullptr, temp_storage_bytes, input.begin(), output.begin(), input.size(), k, env);

// Allocate temporary storage
thrust::device_vector<char> temp_storage(temp_storage_bytes, thrust::no_init);

cub::DeviceTopK::MinKeys(
  thrust::raw_pointer_cast(temp_storage.data()),
  temp_storage_bytes,
  input.begin(),
  output.begin(),
  input.size(),
  k,
  env);

// Get the top-k results into sorted order for easy comparison
thrust::sort(output.begin(), output.end());
thrust::host_vector<int> expected{-3, 1, 2, 4};

Template Parameters:
  • KeyInputIteratorT[inferred] Random-access input iterator type for reading input keys (may be a simple pointer type)

  • KeyOutputIteratorT[inferred] Random-access output iterator type for writing output keys (may be a simple pointer type)

  • NumItemsT – The integral type of variable num_items

  • NumOutItemsT – The integral type of variable k

Parameters:
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_keys_in[in] Random-access iterator to the input sequence containing the keys

  • d_keys_out[out] Random-access iterator to the output sequence of keys, where K values will be written to

  • num_items[in] Number of items to be read and processed from d_keys_in

  • k[in] The value of K, which is the number of largest pairs to find from num_items pairs. Capped to a maximum of num_items.

  • env[in]

    [optional] Execution environment. Default is cuda::std::execution::env{}.