LLM Engine Runner#

class LLMEngineRunner#

The class wraps the TensorRT engine built for auto-regressive style decoder model. The LLMEngineRunner define the interface for upper level runtime to execute engine actions to drive autoregressive decoding with/without speculative decoding for edge inference scenarios. Current design assume prefill and decoding operations are synchronous so a batched requests need to perform prefill and decoding at the same time (no continuous batching). The LLMEngineRunner will:

Hold TensorRT resources of the LLM engine (TRT IRuntime, CUDA Engine, Execution Contexts).
Hold the HybridCacheManager resources that support till maxSupportedBatchSize and maxSequenceLength.
Hold the Rope CosSinCache tensor required for positional encoding.

Public Types

using DecodingGraphKey = std::tuple<int64_t, uintptr_t, uintptr_t, std::string>#: Key to uniquely identify a captured CUDA graph for the decoding step.

using BaseGraphKey = std::tuple<int64_t, uintptr_t, uintptr_t, uintptr_t, std::string>#: Key to uniquely identify a captured CUDA graph for the base model verification step.

Public Functions

LLMEngineRunner( std::filesystem::path const &enginePath, std::filesystem::path const &configPath, std::unordered_map<std::string, std::string> const &loraWeightsMap, cudaStream_t stream )#

Construct LLM engine runner.

Parameters:

enginePath – Path to TensorRT engine file
configPath – Path to model configuration file
loraWeightsMap – Map of LoRA weight names to file paths
stream – CUDA stream for operations

Throws:

std::runtime_error – If engine loading, configuration parsing, or initialization fails, or a CUDA operation fails

~LLMEngineRunner() noexcept#: Destructor.

int64_t getRequiredContextMemorySize() const#

Get the required context memory size for this engine.

Returns:: Required context memory size in bytes

bool setContextMemory(rt::Tensor &sharedContextMemory)#

Set shared context memory for the execution context.

Note

The tensor size must be >= getRequiredContextMemorySize(). Must be called before execution.

Parameters:: sharedContextMemory – Tensor containing the shared device memory (must be on GPU)
Returns:: True on success, false if the tensor is too small

rt::Tensor &getRopeCosSinCacheTensor() noexcept#: API entry to get the Rope CosSinCache tensor. The API is useful when the rope cos/sin cache depends on the context which cannot be initialized in advance when creating the LLMEngineRunner instance.

rt::HybridCacheManager &getCacheManager() noexcept#

Get reference to the cache manager.

Returns:: Reference to HybridCacheManager

LLMEngineRunnerConfig getEngineConfig() const noexcept#

Get engine configuration.

Returns:: Engine configuration structure

bool setLmHeadWeight(rt::Tensor const &lmHeadWeight)#

Bind a dynamic lm_head weight tensor to the engine.

Used by CodePredictor to select which lm_head to use for each RVQ layer. The CodePredictor ONNX model has lm_head_weight as a dynamic input tensor, bound before each decode step to switch between the 15 per-layer lm_heads.

Note

Must be called before executePrefillStep/executeVanillaDecodingStep

Parameters:: lmHeadWeight – The lm_head weight tensor [vocabSize, hiddenSize] on GPU
Returns:: True if the binding was successful

bool executePrefillStep( rt::Tensor const &inputsEmbeds, rt::Tensor const &contextLengths, rt::OptionalInputTensors deepstackEmbeds, rt::Tensor &outputLogits, rt::OptionalOutputTensor outputHiddenStates, cudaStream_t stream )#

API entry to execute one prefill engine action for a batched request. The API will clear existing KVCache for last batch of requests and perform prefill operations to fill the KVCache and produce the output logits. Inputs: inputsEmbeds [GPU]: The input embeddings for the batch of new requests [batchSize, seqLen, hiddenSize]. contextLengths [CPU]: The context lengths for each sequence in the batch. deepstackEmbeds [GPU]: Optional. Deepstack embeddings for Qwen3-VL (already embedded). outputLogits [GPU]: The output logits for the batch of requests. outputHiddenStates [GPU]: Optional. The output hidden states for Eagle speculative decoding. stream: The CUDA stream to execute the prefill step. Returns: True if the prefill step is successful, false otherwise.

Throws:: std::runtime_error – if setting optimization profile fails, or a CUDA operation fails

bool executeVanillaDecodingStep( rt::Tensor const &inputsEmbeds, rt::Tensor &outputLogits, rt::OptionalOutputTensor outputHiddenStates, cudaStream_t stream )#

API entry to execute one vanilla decoding engine action for a batched request. The API will perform decoding operations fill the KVCache of the new generated tokens and produce the output logits. The decoding operation shall be performed after the prefill step is completed. Inputs: inputsEmbeds [GPU]: The input embeddings for the batch of new requests [batchSize, 1, hiddenSize]. stream: The CUDA stream to execute the decoding step. Outputs: outputLogits [GPU]: The output logits for the batch of requests. Returns: True if the decoding step is successful, false otherwise.

Throws:: std::runtime_error – if setting optimization profile fails, or a CUDA operation fails

bool executeEagleBaseTreeDecodingStep( rt::Tensor const &baseTreeDecodingInputsEmbeds, rt::Tensor const &baseTreeDecodingMask, rt::Tensor &outputLogits, rt::Tensor &outputHiddenStates, cudaStream_t stream )#

API entry to execute eagle base tree decoding step. The API will takes a draft tree of input embeddings. baseTreeDecodingMask denote the relationship between the draft tree nodes. Inputs: baseTreeDecodingInputsEmbeds [GPU, Float16]: Input embeddings for the base model with shape [batchSize, Tree-Size, hiddenSize]. baseTreeDecodingMask [GPU, Int32]: Denote the relationship between the base tree nodes with shape [batchSize, Tree-Size, Tree-Size]. stream: The CUDA stream to execute the base tree decoding step. Outputs: outputLogits [GPU, Float16]: The output logits with shape [batchSize*Tree-Size, base-Vocab-Size]. outputHiddenStates [GPU]: The output hidden states with shape [batchSize*Tree-Size, base-hidden-dim].

Throws:: std::runtime_error – if setting optimization profile fails, or a CUDA operation fails

bool captureVanillaDecodingCudaGraph( rt::Tensor const &inputsEmbeds, rt::Tensor &outputLogits, std::string const &loraWeightsName, cudaStream_t stream, rt::OptionalOutputTensor outputHiddenStates = std::nullopt )#

API entry to capture the CUDA graph for the decoding step. If CUDA graph capture is successful, later call to executeVanillaDecodingStep() will always launch the captured CUDA graph. Inputs: inputsEmbeds [GPU]: The input embeddings for the batch of new requests [batchSize, 1, hiddenSize]. outputLogits [GPU]: The output logits for the batch of requests. loraWeightsName: The name to the LoRA weights. Empty string if no LoRA weights. stream: The CUDA stream to execute the decoding step. Returns: True if the CUDA graph capture is successful, false otherwise.

Throws:: std::runtime_error – if setting optimization profile fails, or a CUDA operation fails

bool switchLoraWeights(std::string const &loraWeightsName)#: API entry to switch the LoRA weights of the LLM engine. Inputs: loraWeightsName: The name of the LoRA weights. Returns: True if the LoRA weights switch is successful, false otherwise.

std::string getActiveLoraWeightsName() const#: API entry to get the active LoRA weights name. Returns: The active LoRA weights name.

std::vector<std::string> getAvailableLoraWeights() const#: API entry to get the LoRA weights. Returns: The LoRA weights names.

bool captureEagleBaseTreeDecodingCudaGraph( rt::Tensor const &baseTreeDecodingInputsEmbeds, rt::Tensor const &baseTreeDecodingMask, rt::Tensor &outputLogits, rt::Tensor &outputHiddenStates, std::string const &loraWeightsName, cudaStream_t stream )#

API entry to capture the CUDA graph for the base model tree decoding step. If CUDA graph capture is successful, later call to executeEagleBaseTreeDecodingStep() will always launch the captured CUDA graph. Inputs: baseTreeDecodingInputsEmbeds [GPU, Float16]: Input embeddings for the base model with shape [batchSize, Tree-Size, hiddenSize]. baseTreeDecodingMask [GPU, Int32]: Denote the relationship between the base tree nodes with shape [batchSize, Tree-Size, Tree-Size]. outputLogits [GPU, Float16]: The output logits with shape [batchSize*Tree-Size, base-Vocab-Size]. outputHiddenStates [GPU]: The output hidden states with shape [batchSize*Tree-Size, base-hidden-dim]. stream: The CUDA stream to capture the CUDA graph. The API will capture the CUDA graph for the base tree decoding step. Returns: True if the CUDA graph capture is successful, false otherwise.

Throws:: std::runtime_error – if setting optimization profile fails, or a CUDA operation fails

struct LLMEngineRunnerConfig#

Configuration structure for LLM engine runner.

Contains all runtime configuration parameters for the LLM engine.

Public Members

RopeConfig ropeConfig = {}#: Type of rotary positional encoding.

bool useContextDependentRope = {false}#: Use context-dependent RoPE.

bool enableEagleSpecDecode = {false}#: Enable Eagle speculative decoding.

bool mtpBase = {false}#: MTP base model (gates intermediate-state allocation/binding)

bool useTrtNativeOps = {false}#: Use TensorRT native operations instead of custom plugin.

int32_t numDecoderLayers = {}#: Number of decoder layers.

int32_t numKVHeads = {}#: Number of key-value heads.

int32_t headDim = {}#: Dimension of each attention head.

int32_t rotaryDim = {}#: Rotary embedding dimension.

int32_t hiddenSize = {}#: Model’s hidden dimension.

int32_t maxSupportedBatchSize = {}#: Maximum supported batch size.

int32_t maxSupportedInputLength = {}#: Maximum supported input length.

int32_t maxKVCacheCapacity = {}#: Maximum KV cache capacity.

int32_t vocabSize = {}#: Vocabulary size (full vocabulary)

int32_t reducedVocabSize = {0}#: Reduced vocabulary size (0 if not using reduced vocab)

int32_t outputVocabSize = {}#: Actual output vocabulary size (reducedVocabSize if enabled, else vocabSize)

int32_t maxSupportedLoraRank = {}#: Maximum supported LoRA rank.

int32_t outputHiddenDim = {}#: Output hidden dimension for Eagle speculative decoding (hidden_size * 3)

int32_t maxVerifyTreeSize = {}#: Maximum verification tree size for Eagle speculative decoding.

int32_t numDeepstackFeatures = {0}#: Number of deepstack features for Qwen3-VL and Qwen3-Omni.

int32_t audioTokenId = {0}#: Special token ID for audio in Qwen3-Omni.

int32_t imageTokenId = {0}#: Special token ID for image in Qwen3-Omni.

int32_t numLinearAttnLayers = {0}#: Number of recurrent layers (0 for pure attention models)

int32_t numAttentionLayers = {0}#: Number of attention layers (equals numDecoderLayers for pure attention)

int32_t recurrentStateNumHeads = {0}#: Number of recurrent heads (hv for GDN, mamba_num_heads for Mamba)

int32_t recurrentStateHeadDim = {0}#: Dimension of each recurrent head (k for GDN, mamba_head_dim for Mamba)

int32_t recurrentStateSize = {0}#: Recurrent state dimension (v for GDN, dstate for Mamba)

int32_t convDim = {0}#: Conv1d channel dimension.

int32_t convKernel = {0}#: Conv1d kernel width.

std::vector<rt::HybridCacheManager::LayerType> layerTypes = {}#: Per-layer type routing.

std::vector<rt::KVLayerConfig> kvLayerConfigs = {}#: Per-layer KV config.