LLM Engine Runner#
-
class LLMEngineRunner#
The class wraps the TensorRT engine built for auto-regressive style decoder model. The LLMEngineRunner define the interface for upper level runtime to execute engine actions to drive autoregressive decoding with/without speculative decoding for edge inference scenarios. Current design assume prefill and decoding operations are synchronous so a batched requests need to perform prefill and decoding at the same time (no continuous batching). The LLMEngineRunner will:
Hold TensorRT resources of the LLM engine (TRT IRuntime, CUDA Engine, Execution Contexts).
Hold the LinearKVCache resources that support till maxSupportedBatchSize and maxSequenceLength.
Hold the Rope CosSinCache tensor required for positional encoding.
Public Types
-
using DecodingGraphKey = std::tuple<int64_t, uintptr_t, uintptr_t, std::string>#
Key to uniquely identify a captured CUDA graph for the decoding step.
-
using BaseGraphKey = std::tuple<int64_t, uintptr_t, uintptr_t, uintptr_t, std::string>#
Key to uniquely identify a captured CUDA graph for the base model verification step.
Public Functions
- LLMEngineRunner(
- std::filesystem::path const &enginePath,
- std::filesystem::path const &configPath,
- std::unordered_map<std::string, std::string> const &loraWeightsMap,
- cudaStream_t stream
Construct LLM engine runner.
- Parameters:
enginePath – Path to TensorRT engine file
configPath – Path to model configuration file
loraWeightsMap – Map of LoRA weight names to file paths
stream – CUDA stream for operations
- Throws:
std::runtime_error – If engine loading, configuration parsing, or initialization fails, or a CUDA operation fails
-
~LLMEngineRunner() noexcept#
Destructor.
-
rt::Tensor &getRopeCosSinCacheTensor() noexcept#
API entry to get the Rope CosSinCache tensor. The API is useful when the rope cos/sin cache depends on the context which cannot be initialized in advance when creating the LLMEngineRunner instance.
-
rt::LinearKVCache &getLinearKVCache() noexcept#
Get reference to the linear KV cache (also owns Mamba SSM/conv state buffers for hybrid models)
- Returns:
Reference to LinearKVCache
-
LLMEngineRunnerConfig getEngineConfig() const noexcept#
Get engine configuration.
- Returns:
Engine configuration structure
- bool setLMHeadWeights( )#
Set an extra input tensor for the engine.
This is a temporary API for binding additional input tensors that are not part of the standard LLM input set.
Example use case: CodePredictor’s lm_head_weight input for dynamic lm_head selection.
Note
This is not a good design but we put it here temporarily to support TTS inference.
Note
The API will be replaced soon with a better design. Please don’t follow this schema.
Note
Must be called before executePrefillStep/executeVanillaDecodingStep
- Parameters:
name – The name of the LMHead input weights in the ONNX/TRT model
tensor – The tensor to bind (must be on GPU, shape must match engine expectation)
- Returns:
True if the binding was successful
- bool executePrefillStep(
- rt::Tensor const &inputsEmbeds,
- rt::Tensor const &contextLengths,
- rt::OptionalInputTensors deepstackEmbeds,
- rt::Tensor &outputLogits,
- rt::OptionalOutputTensor outputHiddenStates,
- cudaStream_t stream
API entry to execute one prefill engine action for a batched request. The API will clear existing KVCache for last batch of requests and perform prefill operations to fill the KVCache and produce the output logits. Inputs: inputsEmbeds [GPU]: The input embeddings for the batch of new requests [batchSize, seqLen, hiddenSize]. contextLengths [CPU]: The context lengths for each sequence in the batch. deepstackEmbeds [GPU]: Optional. Deepstack embeddings for Qwen3-VL (already embedded). outputLogits [GPU]: The output logits for the batch of requests. outputHiddenStates [GPU]: Optional. The output hidden states for Eagle speculative decoding. stream: The CUDA stream to execute the prefill step. Returns: True if the prefill step is successful, false otherwise.
- Throws:
std::runtime_error – if setting optimization profile fails, or a CUDA operation fails
- bool executeVanillaDecodingStep(
- rt::Tensor const &inputsEmbeds,
- rt::Tensor &outputLogits,
- rt::OptionalOutputTensor outputHiddenStates,
- cudaStream_t stream
API entry to execute one vanilla decoding engine action for a batched request. The API will perform decoding operations fill the KVCache of the new generated tokens and produce the output logits. The decoding operation shall be performed after the prefill step is completed. Inputs: inputsEmbeds [GPU]: The input embeddings for the batch of new requests [batchSize, 1, hiddenSize]. stream: The CUDA stream to execute the decoding step. Outputs: outputLogits [GPU]: The output logits for the batch of requests. Returns: True if the decoding step is successful, false otherwise.
- Throws:
std::runtime_error – if setting optimization profile fails, or a CUDA operation fails
- bool executeEagleBaseTreeDecodingStep(
- rt::Tensor const &baseTreeDecodingInputsEmbeds,
- rt::Tensor const &baseTreeDecodingMask,
- rt::Tensor &outputLogits,
- rt::Tensor &outputHiddenStates,
- cudaStream_t stream
API entry to execute eagle base tree decoding step. The API will takes a draft tree of input embeddings. baseTreeDecodingMask denote the relationship between the draft tree nodes. Inputs: baseTreeDecodingInputsEmbeds [GPU, Float16]: Input embeddings for the base model with shape [batchSize, Tree-Size, hiddenSize]. baseTreeDecodingMask [GPU, Int32]: Denote the relationship between the base tree nodes with shape [batchSize, Tree-Size, Tree-Size]. stream: The CUDA stream to execute the base tree decoding step. Outputs: outputLogits [GPU, Float16]: The output logits with shape [batchSize*Tree-Size, base-Vocab-Size]. outputHiddenStates [GPU]: The output hidden states with shape [batchSize*Tree-Size, base-hidden-dim].
- Throws:
std::runtime_error – if setting optimization profile fails, or a CUDA operation fails
- bool captureVanillaDecodingCudaGraph(
- rt::Tensor const &inputsEmbeds,
- rt::Tensor &outputLogits,
- std::string const &loraWeightsName,
- cudaStream_t stream,
- rt::OptionalOutputTensor outputHiddenStates = std::nullopt
API entry to capture the CUDA graph for the decoding step. If CUDA graph capture is successful, later call to executeVanillaDecodingStep() will always launch the captured CUDA graph. Inputs: inputsEmbeds [GPU]: The input embeddings for the batch of new requests [batchSize, 1, hiddenSize]. outputLogits [GPU]: The output logits for the batch of requests. loraWeightsName: The name to the LoRA weights. Empty string if no LoRA weights. stream: The CUDA stream to execute the decoding step. Returns: True if the CUDA graph capture is successful, false otherwise.
- Throws:
std::runtime_error – if setting optimization profile fails, or a CUDA operation fails
-
bool switchLoraWeights(std::string const &loraWeightsName)#
API entry to switch the LoRA weights of the LLM engine. Inputs: loraWeightsName: The name of the LoRA weights. Returns: True if the LoRA weights switch is successful, false otherwise.
-
std::string getActiveLoraWeightsName() const#
API entry to get the active LoRA weights name. Returns: The active LoRA weights name.
-
std::vector<std::string> getAvailableLoraWeights() const#
API entry to get the LoRA weights. Returns: The LoRA weights names.
- bool captureEagleBaseTreeDecodingCudaGraph(
- rt::Tensor const &baseTreeDecodingInputsEmbeds,
- rt::Tensor const &baseTreeDecodingMask,
- rt::Tensor &outputLogits,
- rt::Tensor &outputHiddenStates,
- std::string const &loraWeightsName,
- cudaStream_t stream
API entry to capture the CUDA graph for the base model tree decoding step. If CUDA graph capture is successful, later call to executeEagleBaseTreeDecodingStep() will always launch the captured CUDA graph. Inputs: baseTreeDecodingInputsEmbeds [GPU, Float16]: Input embeddings for the base model with shape [batchSize, Tree-Size, hiddenSize]. baseTreeDecodingMask [GPU, Int32]: Denote the relationship between the base tree nodes with shape [batchSize, Tree-Size, Tree-Size]. outputLogits [GPU, Float16]: The output logits with shape [batchSize*Tree-Size, base-Vocab-Size]. outputHiddenStates [GPU]: The output hidden states with shape [batchSize*Tree-Size, base-hidden-dim]. stream: The CUDA stream to capture the CUDA graph. The API will capture the CUDA graph for the base tree decoding step. Returns: True if the CUDA graph capture is successful, false otherwise.
- Throws:
std::runtime_error – if setting optimization profile fails, or a CUDA operation fails
-
struct LLMEngineRunnerConfig#
Configuration structure for LLM engine runner.
Contains all runtime configuration parameters for the LLM engine.
Public Members
-
RopeConfig ropeConfig = {}#
Type of rotary positional encoding.
-
bool useContextDependentRope = {false}#
Use context-dependent RoPE.
-
bool enableEagleSpecDecode = {false}#
Enable Eagle speculative decoding.
-
bool useTrtNativeOps = {false}#
Use TensorRT native operations instead of custom plugin.
-
int32_t numDecoderLayers = {}#
Number of decoder layers.
-
int32_t numKVHeads = {}#
Number of key-value heads.
-
int32_t headDim = {}#
Dimension of each attention head.
-
int32_t rotaryDim = {}#
Rotary embedding dimension.
Model’s hidden dimension.
-
int32_t maxSupportedBatchSize = {}#
Maximum supported batch size.
-
int32_t maxSupportedInputLength = {}#
Maximum supported input length.
-
int32_t maxKVCacheCapacity = {}#
Maximum KV cache capacity.
-
int32_t vocabSize = {}#
Vocabulary size (full vocabulary)
-
int32_t reducedVocabSize = {0}#
Reduced vocabulary size (0 if not using reduced vocab)
-
int32_t outputVocabSize = {}#
Actual output vocabulary size (reducedVocabSize if enabled, else vocabSize)
-
int32_t maxSupportedLoraRank = {}#
Maximum supported LoRA rank.
-
int32_t outputHiddenDim = {}#
Output hidden dimension for Eagle speculative decoding (hidden_size * 3)
-
int32_t maxVerifyTreeSize = {}#
Maximum verification tree size for Eagle speculative decoding.
-
int32_t numDeepstackFeatures = {0}#
Number of deepstack features for Qwen3-VL and Qwen3-Omni.
-
int32_t audioTokenId = {0}#
Special token ID for audio in Qwen3-Omni.
-
int32_t imageTokenId = {0}#
Special token ID for image in Qwen3-Omni.
-
int32_t numMambaLayers = {0}#
Number of Mamba (SSM) layers (0 for pure attention models)
-
int32_t numAttentionLayers = {0}#
Number of attention layers (equals numDecoderLayers for pure attention)
-
int32_t mambaNumHeads = {0}#
Number of Mamba heads.
-
int32_t mambaHeadDim = {0}#
Dimension of each Mamba head.
-
int32_t ssmStateSize = {0}#
SSM state dimension (dstate)
-
int32_t convDim = {0}#
Conv1d dimension (intermediate_size + 2 * n_groups * ssm_state_size)
-
int32_t convKernel = {0}#
Conv1d kernel width.
-
RopeConfig ropeConfig = {}#