LLM Engine Runner#

class LLMEngineRunner#

The class wraps the TensorRT engine built for auto-regressive style decoder model. The LLMEngineRunner define the interface for upper level runtime to execute engine actions to drive autoregressive decoding with/without speculative decoding for edge inference scenarios. Current design assume prefill and decoding operations are synchronous so a batched requests need to perform prefill and decoding at the same time (no continuous batching). The LLMEngineRunner will:

Hold TensorRT resources of the LLM engine (TRT IRuntime, CUDA Engine, Execution Contexts).
Hold the LinearKVCache resources that support till maxSupportedBatchSize and maxSequenceLength.
Hold the Rope CosSinCache tensor required for positional encoding.

Public Functions

LLMEngineRunner( std::filesystem::path const &enginePath, std::filesystem::path const &configPath, std::unordered_map<std::string, std::string> const &loraWeightsMap, cudaStream_t stream )#

Construct LLM engine runner.

Parameters:

enginePath – Path to TensorRT engine file
configPath – Path to model configuration file
loraWeightsMap – Map of LoRA weight names to file paths
stream – CUDA stream for operations

~LLMEngineRunner()#: Destructor.

rt::Tensor &getRopeCosSinCacheTensor()#: API entry to get the Rope CosSinCache tensor. The API is useful when the rope cos/sin cache depends on the context which cannot be initialized in advance when creating the LLMEngineRunner instance.

rt::LinearKVCache &getLinearKVCache()#

Get reference to the linear KV cache.

Returns:: Reference to LinearKVCache

LLMEngineRunnerConfig getEngineConfig() const#

Get engine configuration.

Returns:: Engine configuration structure

bool executePrefillStep( rt::Tensor const &inputIds, rt::Tensor const &contextLengths, rt::OptionalInputTensor multimodalEmbeddings, rt::OptionalInputTensors extraInputTensors, rt::Tensor &outputLogits, rt::OptionalOutputTensor outputHiddenStates, cudaStream_t stream )#: API entry to execute one prefill engine action for a batched request. The API will clear existing KVCache for last batch of requests and perform prefill operations to fill the KVCache and produce the output logits. Inputs: inputIds [GPU]: The input token_ids for the batch of new requests. contextLengths [CPU]: The context lengths for each sequence in the batch. multimodalEmbeddings [GPU]: Optional. The multimodal embeddings for the batch of requests. extraInputTensors [GPU]: Optional. Extra input tensors (e.g., deepstack features for Qwen3-VL). outputLogits [GPU]: The output logits for the batch of requests.. stream: The CUDA stream to execute the prefill stp. Returns: True if the prefill step is successful, false otherwise.

bool executeVanillaDecodingStep( rt::Tensor const &inputIds, rt::Tensor &outputLogits, cudaStream_t stream )#: API entry to execute one vanilla decoding engine action for a batched request. The API will perform decoding operations fill the KVCache of the new generated tokens and produce the output logits. The decoding operation shall be performed after the prefill step is completed. Inputs: inputIds [GPU]: The input token_ids for the batch of new requests. outputLogits [GPU]: The output logits for the batch of requests. stream: The CUDA stream to execute the decoding step. Returns: True if the decoding step is successful, false otherwise.

bool executeEagleBaseTreeDecodingStep( rt::Tensor const &baseTreeDecodingInputIds, rt::Tensor const &baseTreeDecodingMask, rt::Tensor &outputLogits, rt::Tensor &outputHiddenStates, cudaStream_t stream )#: API entry to execute eagle base tree decoding step. The API will takes a draft tree of input_token_ids. baseTreeDecodingMask denote the relationship between the draft tree nodes. Inputs: baseTreeDecodingInputIds [GPU, Int32]: Input token_ids for the base model with shape [1, Tree-Size]. baseTreeDecodingMask [GPU, Int32]: Denote the relationship between the base tree nodes with shape [1, Tree-Size, Tree-Size]. stream: The CUDA stream to execute the base tree decoding step. Outputs: outputLogits [GPU, Float16]: The output logits with shape [topK, base-Vocab-Size]. outputHiddenStates [GPU]: The output hidden states with shape [topK, base-hidden-dim].

bool captureVanillaDecodingCudaGraph( rt::Tensor const &inputIds, rt::Tensor &outputLogits, std::string const &loraWeightsName, cudaStream_t stream )#: API entry to capture the CUDA graph for the decoding step. If CUDA graph capture is successful, later call to executeVanillaDecodingStep() will always launch the captured CUDA graph. Inputs: inputIds [GPU]: The input token_ids for the batch of new requests. outputLogits [GPU]: The output logits for the batch of requests. loraWeightsName: The name to the LoRA weights. Empty string if no LoRA weights. stream: The CUDA stream to execute the decoding step. Returns: True if the CUDA graph capture is successful, false otherwise.

bool switchLoraWeights( std::string const &loraWeightsName, cudaStream_t stream )#: API entry to switch the LoRA weights of the LLM engine. Inputs: loraWeightsName: The name of the LoRA weights. stream: The CUDA stream to execute the switch step. Returns: True if the LoRA weights switch is successful, false otherwise.

std::string getActiveLoraWeightsName() const#: API entry to get the active LoRA weights name. Returns: The active LoRA weights name.

std::vector<std::string> getAvailableLoraWeights() const#: API entry to get the LoRA weights. Returns: The LoRA weights names.

bool captureEagleBaseTreeDecodingCudaGraph( rt::Tensor const &baseTreeDecodingInputIds, rt::Tensor const &baseTreeDecodingMask, rt::Tensor &outputLogits, rt::Tensor &outputHiddenStates, cudaStream_t stream )#: API entry to capture the CUDA graph for the base model tree decoding step. If CUDA graph capture is successful, later call to executeEagleBaseTreeDecodingStep() will always launch the captured CUDA graph. Inputs: baseTreeDecodingInputIds [GPU, Int32]: Input token_ids for the base model with shape [1, Tree-Size]. baseTreeDecodingMask [GPU, Int32]: Denote the relationship between the base tree nodes with shape [1, Tree-Size, Tree-Size]. outputLogits [GPU, Float16]: The output logits with shape [topK, base-Vocab-Size]. outputHiddenStates [GPU]: The output hidden states with shape [topK, base-hidden-dim]. stream: The CUDA stream to capture the CUDA graph. The API will capture the CUDA graph for the base tree decoding step. Returns: True if the CUDA graph capture is successful, false otherwise.

struct LLMEngineRunnerConfig#

Configuration structure for LLM engine runner.

Contains all runtime configuration parameters for the LLM engine.

Public Members

RopeConfig ropeConfig = {}#: Type of rotary positional encoding.

bool useContextDependentRope = {false}#: Use context-dependent RoPE.

bool enableEagleSpecDecode = {false}#: Enable Eagle speculative decoding.

bool isVlm = {false}#: Whether this is a Vision-Language Model.

int32_t numDecoderLayers = {}#: Number of decoder layers.

int32_t numKVHeads = {}#: Number of key-value heads.

int32_t headDim = {}#: Dimension of each attention head.

int32_t rotaryDim = {}#: Rotary embedding dimension.

int32_t hiddenSize = {}#: Model’s hidden dimension.

int32_t maxSupportedBatchSize = {}#: Maximum supported batch size.

int32_t minSupportedInputLength = {}#: Minimum supported input length.

int32_t maxSupportedInputLength = {}#: Maximum supported input length.

int32_t maxKVCacheCapacity = {}#: Maximum KV cache capacity.

int32_t vocabSize = {}#: Vocabulary size (full vocabulary)

int32_t reducedVocabSize = {0}#: Reduced vocabulary size (0 if not using reduced vocab)

int32_t outputVocabSize = {}#: Actual output vocabulary size (reducedVocabSize if enabled, else vocabSize)

int32_t maxSupportedLoraRank = {}#: Maximum supported LoRA rank.

int32_t outputHiddenDim = {}#: Output hidden dimension for Eagle speculative decoding (hidden_size * 3)

int32_t maxVerifyTreeSize = {}#: Maximum verification tree size for Eagle speculative decoding.

int32_t numDeepstackFeatures = {}#: Number of deepstack features for Qwen3-VL.