Sampling#

struct SamplingParams#

Structure to hold sampling parameters.

Public Functions

inline SamplingParams( int32_t batchSize_, int32_t vocabSize_, float temperature_ = 1.0f, int32_t topK_ = 0, float topP_ = 1.0f )#

Constructor with default values.

Parameters:

batchSize_ – Number of samples in the batch
vocabSize_ – Size of the vocabulary
temperature_ – Temperature parameter (default: 1.0f)
topK_ – Top-K parameter (default: 0, disabled)
topP_ – Top-P parameter (default: 1.0f, disabled)

Throws:

std::invalid_argument – if neither topK nor topP is set, or if temperature is invalid

Public Members

int32_t batchSize#: Number of samples in the batch.

int32_t vocabSize#: Size of the vocabulary.

float temperature#: Temperature parameter for sampling (higher = more random)

int32_t topK#: Top-K sampling parameter (0 = disabled)

float topP#: Top-P (nucleus) sampling parameter (1.0 = disabled)

bool useTopK#: Flag indicating if top-K sampling is enabled.

bool useTopP#: Flag indicating if top-P sampling is enabled.

size_t trt_edgellm::getTopKtopPSamplingWorkspaceSize( int32_t batchSize, int32_t vocabSize, SamplingParams const &params )#

Get workspace size required for top-K/top-P sampling (FP32 only).

Calculates the amount of GPU memory needed for intermediate computations during the sampling operation. The workspace must be allocated before calling topKtopPSamplingFromLogits().

Parameters:

batchSize – [in] Batch size for sampling
vocabSize – [in] Vocabulary size
params – [in] Sampling parameters

Returns:

Required workspace size in bytes

size_t trt_edgellm::getSelectAllTopKWorkspaceSize( int32_t batchSize, int32_t vocabSize, int32_t topK )#

Get workspace size required for selectAllTopK operation (FP32 only).

Calculates the amount of GPU memory needed for intermediate computations during the top-K selection operation. The workspace must be allocated before calling selectAllTopK().

Parameters:

batchSize – [in] Batch size for selection
vocabSize – [in] Vocabulary size
topK – [in] Number of top elements to select

Returns:

Required workspace size in bytes

void trt_edgellm::topKtopPSamplingFromLogits( rt::Tensor const &logits, rt::Tensor &selectedIndices, SamplingParams const &params, rt::Tensor &workspace, cudaStream_t stream, uint64_t philoxSeed = 42, uint64_t philoxOffset = 0 )#

Main sampling function for top-K and top-P sampling from logits.

Performs token sampling using top-K and/or top-P (nucleus) sampling strategies on the input logits. The function applies temperature scaling and returns the selected token indices for each batch element.

Parameters:

logits – [in] Input logits tensor [GPU, Float] with shape [batch-size, vocab-size]
selectedIndices – [out] Selected token indices [GPU, Int32] with shape [batch-size, 1]
params – [in] Sampling parameters including batch size, vocab size, temperature, top-K, and top-P values
workspace – [inout] Workspace buffer [GPU, Int8] for intermediate computations
stream – [in] CUDA stream to execute the kernel
philoxSeed – [in] Random seed for sampling (default: 42)
philoxOffset – [in] Random offset for sampling (default: 0)

void trt_edgellm::selectAllTopK( rt::Tensor const &input, rt::OptionalOutputTensor topKValues, rt::Tensor &topKIndices, int32_t topK, rt::Tensor &workspace, cudaStream_t stream )#

Select all top-K elements from input tensor.

Returns topK indices and raw values from input with no transformations applied. This function identifies the K largest elements in each batch and returns their indices and optionally their values.

Parameters:

input – [in] Input tensor [GPU, Float] with shape [batch-size, vocab-size]
topKValues – [out] Optional top-K values [GPU, Float] with shape [batch-size, top-K]. Can be std::nullopt if values not needed
topKIndices – [out] Top-K indices [GPU, Int32] with shape [batch-size, top-K]
topK – [in] Number of top elements to select
workspace – [inout] Workspace buffer [GPU, Int8] for intermediate computations
stream – [in] CUDA stream to execute the kernel

void trt_edgellm::mapReducedVocabToFullVocab( rt::Tensor &vocabIds, rt::Tensor const &vocabMappingTable, cudaStream_t stream )#

Map reduced vocabulary IDs to full vocabulary IDs using a lookup table (in-place).

Performs in-place mapping from reduced vocabulary space to full vocabulary space using the provided mapping table: vocabIds[i] = vocabMappingTable[vocabIds[i]]

The operation is performed in-place, modifying the input tensor directly.

Parameters:

vocabIds – [inout] Tensor [GPU, Int32] containing reduced vocabulary IDs as input, will be overwritten with full vocabulary IDs as output
vocabMappingTable – [in] Lookup table [GPU, Int32] with shape [reduced_vocab_size] mapping reduced IDs to full IDs
stream – [in] CUDA stream to execute the kernel