Sampling#

struct SamplingParams#

Structure to hold sampling parameters.

Public Functions

inline SamplingParams(
int32_t batchSize_,
int32_t vocabSize_,
float temperature_ = 1.0f,
int32_t topK_ = 0,
float topP_ = 1.0f
)#

Constructor with default values.

Parameters:
  • batchSize_ – Number of samples in the batch

  • vocabSize_ – Size of the vocabulary

  • temperature_ – Temperature parameter (default: 1.0f)

  • topK_ – Top-K parameter (default: 0, disabled)

  • topP_ – Top-P parameter (default: 1.0f, disabled)

Throws:

std::invalid_argument – if neither topK nor topP is set, or if temperature is invalid

Public Members

int32_t batchSize#

Number of samples in the batch.

int32_t vocabSize#

Size of the vocabulary.

float temperature#

Temperature parameter for sampling (higher = more random)

int32_t topK#

Top-K sampling parameter (0 = disabled)

float topP#

Top-P (nucleus) sampling parameter (1.0 = disabled)

bool useTopK#

Flag indicating if top-K sampling is enabled.

bool useTopP#

Flag indicating if top-P sampling is enabled.

size_t trt_edgellm::getTopKtopPSamplingWorkspaceSize(
int32_t batchSize,
int32_t vocabSize,
SamplingParams const &params
)#

Get workspace size required for top-K/top-P sampling (FP32 only).

Calculates the amount of GPU memory needed for intermediate computations during the sampling operation. The workspace must be allocated before calling topKtopPSamplingFromLogits().

Parameters:
  • batchSize[in] Batch size for sampling

  • vocabSize[in] Vocabulary size

  • params[in] Sampling parameters

Returns:

Required workspace size in bytes

size_t trt_edgellm::getSelectAllTopKWorkspaceSize(
int32_t batchSize,
int32_t vocabSize,
int32_t topK
)#

Get workspace size required for selectAllTopK operation (FP32 only).

Calculates the amount of GPU memory needed for intermediate computations during the top-K selection operation. The workspace must be allocated before calling selectAllTopK().

Parameters:
  • batchSize[in] Batch size for selection

  • vocabSize[in] Vocabulary size

  • topK[in] Number of top elements to select

Returns:

Required workspace size in bytes

void trt_edgellm::topKtopPSamplingFromLogits(
rt::Tensor const &logits,
rt::Tensor &selectedIndices,
SamplingParams const &params,
rt::Tensor &workspace,
cudaStream_t stream,
uint64_t philoxSeed = 42,
uint64_t philoxOffset = 0
)#

Main sampling function for top-K and top-P sampling from logits.

Performs token sampling using top-K and/or top-P (nucleus) sampling strategies on the input logits. The function applies temperature scaling and returns the selected token indices for each batch element.

Parameters:
  • logits[in] Input logits tensor [GPU, Float] with shape [batch-size, vocab-size]

  • selectedIndices[out] Selected token indices [GPU, Int32] with shape [batch-size, 1]

  • params[in] Sampling parameters including batch size, vocab size, temperature, top-K, and top-P values

  • workspace[inout] Workspace buffer [GPU, Int8] for intermediate computations

  • stream[in] CUDA stream to execute the kernel

  • philoxSeed[in] Random seed for sampling (default: 42)

  • philoxOffset[in] Random offset for sampling (default: 0)

void trt_edgellm::selectAllTopK(
rt::Tensor const &input,
rt::OptionalOutputTensor topKValues,
rt::Tensor &topKIndices,
int32_t topK,
rt::Tensor &workspace,
cudaStream_t stream
)#

Select all top-K elements from input tensor.

Returns topK indices and raw values from input with no transformations applied. This function identifies the K largest elements in each batch and returns their indices and optionally their values.

Parameters:
  • input[in] Input tensor [GPU, Float] with shape [batch-size, vocab-size]

  • topKValues[out] Optional top-K values [GPU, Float] with shape [batch-size, top-K]. Can be std::nullopt if values not needed

  • topKIndices[out] Top-K indices [GPU, Int32] with shape [batch-size, top-K]

  • topK[in] Number of top elements to select

  • workspace[inout] Workspace buffer [GPU, Int8] for intermediate computations

  • stream[in] CUDA stream to execute the kernel

void trt_edgellm::mapReducedVocabToFullVocab(
rt::Tensor &vocabIds,
rt::Tensor const &vocabMappingTable,
cudaStream_t stream
)#

Map reduced vocabulary IDs to full vocabulary IDs using a lookup table (in-place).

Performs in-place mapping from reduced vocabulary space to full vocabulary space using the provided mapping table: vocabIds[i] = vocabMappingTable[vocabIds[i]]

The operation is performed in-place, modifying the input tensor directly.

Parameters:
  • vocabIds[inout] Tensor [GPU, Int32] containing reduced vocabulary IDs as input, will be overwritten with full vocabulary IDs as output

  • vocabMappingTable[in] Lookup table [GPU, Int32] with shape [reduced_vocab_size] mapping reduced IDs to full IDs

  • stream[in] CUDA stream to execute the kernel