Sampling#
-
struct SamplingParams#
Structure to hold sampling parameters.
Public Functions
- inline SamplingParams(
- int32_t batchSize_,
- int32_t vocabSize_,
- float temperature_ = 1.0f,
- int32_t topK_ = 0,
- float topP_ = 1.0f
Constructor with default values.
- Parameters:
batchSize_ – Number of samples in the batch
vocabSize_ – Size of the vocabulary
temperature_ – Temperature parameter (default: 1.0f)
topK_ – Top-K parameter (default: 0, disabled)
topP_ – Top-P parameter (default: 1.0f, disabled)
- Throws:
std::invalid_argument – if neither topK nor topP is set, or if temperature is invalid
Public Members
-
int32_t batchSize#
Number of samples in the batch.
-
int32_t vocabSize#
Size of the vocabulary.
-
float temperature#
Temperature parameter for sampling (higher = more random)
-
int32_t topK#
Top-K sampling parameter (0 = disabled)
-
float topP#
Top-P (nucleus) sampling parameter (1.0 = disabled)
-
bool useTopK#
Flag indicating if top-K sampling is enabled.
-
bool useTopP#
Flag indicating if top-P sampling is enabled.
- size_t trt_edgellm::getTopKtopPSamplingWorkspaceSize(
- int32_t batchSize,
- int32_t vocabSize,
- SamplingParams const ¶ms
Get workspace size required for top-K/top-P sampling (FP32 only).
Calculates the amount of GPU memory needed for intermediate computations during the sampling operation. The workspace must be allocated before calling topKtopPSamplingFromLogits().
- Parameters:
batchSize – [in] Batch size for sampling
vocabSize – [in] Vocabulary size
params – [in] Sampling parameters
- Returns:
Required workspace size in bytes
- size_t trt_edgellm::getSelectAllTopKWorkspaceSize(
- int32_t batchSize,
- int32_t vocabSize,
- int32_t topK
Get workspace size required for selectAllTopK operation (FP32 only).
Calculates the amount of GPU memory needed for intermediate computations during the top-K selection operation. The workspace must be allocated before calling selectAllTopK().
- Parameters:
batchSize – [in] Batch size for selection
vocabSize – [in] Vocabulary size
topK – [in] Number of top elements to select
- Returns:
Required workspace size in bytes
- void trt_edgellm::topKtopPSamplingFromLogits(
- rt::Tensor const &logits,
- rt::Tensor &selectedIndices,
- SamplingParams const ¶ms,
- rt::Tensor &workspace,
- cudaStream_t stream,
- uint64_t philoxSeed = 42,
- uint64_t philoxOffset = 0
Main sampling function for top-K and top-P sampling from logits.
Performs token sampling using top-K and/or top-P (nucleus) sampling strategies on the input logits. The function applies temperature scaling and returns the selected token indices for each batch element.
- Parameters:
logits – [in] Input logits tensor [GPU, Float] with shape [batch-size, vocab-size]
selectedIndices – [out] Selected token indices [GPU, Int32] with shape [batch-size, 1]
params – [in] Sampling parameters including batch size, vocab size, temperature, top-K, and top-P values
workspace – [inout] Workspace buffer [GPU, Int8] for intermediate computations
stream – [in] CUDA stream to execute the kernel
philoxSeed – [in] Random seed for sampling (default: 42)
philoxOffset – [in] Random offset for sampling (default: 0)
- void trt_edgellm::selectAllTopK(
- rt::Tensor const &input,
- rt::OptionalOutputTensor topKValues,
- rt::Tensor &topKIndices,
- int32_t topK,
- rt::Tensor &workspace,
- cudaStream_t stream
Select all top-K elements from input tensor.
Returns topK indices and raw values from input with no transformations applied. This function identifies the K largest elements in each batch and returns their indices and optionally their values.
- Parameters:
input – [in] Input tensor [GPU, Float] with shape [batch-size, vocab-size]
topKValues – [out] Optional top-K values [GPU, Float] with shape [batch-size, top-K]. Can be std::nullopt if values not needed
topKIndices – [out] Top-K indices [GPU, Int32] with shape [batch-size, top-K]
topK – [in] Number of top elements to select
workspace – [inout] Workspace buffer [GPU, Int8] for intermediate computations
stream – [in] CUDA stream to execute the kernel
- void trt_edgellm::mapReducedVocabToFullVocab( )#
Map reduced vocabulary IDs to full vocabulary IDs using a lookup table (in-place).
Performs in-place mapping from reduced vocabulary space to full vocabulary space using the provided mapping table: vocabIds[i] = vocabMappingTable[vocabIds[i]]
The operation is performed in-place, modifying the input tensor directly.
- Parameters:
vocabIds – [inout] Tensor [GPU, Int32] containing reduced vocabulary IDs as input, will be overwritten with full vocabulary IDs as output
vocabMappingTable – [in] Lookup table [GPU, Int32] with shape [reduced_vocab_size] mapping reduced IDs to full IDs
stream – [in] CUDA stream to execute the kernel