LLM Inference Runtime#
Architecture#
The LLM Inference Runtime (LLMInferenceRuntime) is the unified runtime for all LLM inference in TensorRT Edge-LLM. It supports both standard autoregressive (vanilla) decoding and speculative decoding modes (EAGLE, MTP, etc.) through a pluggable DecodingStrategy layer. When constructed without a drafting config, it operates as a pure vanilla decoding runtime with zero draft-model memory overhead. When constructed with a SpecDecodeDraftingConfig, it additionally loads a draft model and enables speculative decoding strategies.
%%{init: {'theme':'neutral', 'themeVariables': {'primaryColor':'#76B900','primaryTextColor':'#fff','primaryBorderColor':'#5a8f00','lineColor':'#666','edgeLabelBackground':'#ffffff','labelTextColor':'#000','clusterBkg':'#ffffff','clusterBorder':'#999'}}}%%
graph TB
CLIENT[Client Application]
RT[LLM Inference Runtime]
MULTIMODAL[Multimodal Runner]
TOKENIZER[Tokenizer]
DECODER_REG[Decoder Registry<BR>& Decoding Strategies]
CLIENT -->|handleRequest| RT
RT -->|owns| TOKENIZER
RT -->|owns optional| MULTIMODAL
RT -->|owns| DECODER_REG
subgraph ENGINE_EXEC[Engine Execution]
BASE_EXEC[Engine Executor<BR>Base Model]
DRAFT_EXEC[Engine Executor<BR>Draft Model<BR>optional]
end
subgraph END_NODES[" "]
subgraph DRAFT_COMPONENTS[Draft Model — optional]
DRAFT_KV[Linear KV Cache]
DRAFT_ENGINE[TRT Engine]
end
subgraph COMMON_KERNELS["Common Kernels"]
SPECKERNELS[Speculative Decode<br>Util Kernels]
SAMPLING[Sampling<br>Kernels]
end
subgraph BASE_COMPONENTS [Base Model]
BASE_KV[Linear KV Cache]
BASE_ENGINE[TRT Engine]
end
end
MULTIMODAL -->|provides embeddings| ENGINE_EXEC
TOKENIZER -->|encode/decode| ENGINE_EXEC
RT -->|owns & manages| ENGINE_EXEC
RT -->|calls| COMMON_KERNELS
DECODER_REG -->|selects strategy| ENGINE_EXEC
BASE_EXEC -->|owns & manages| BASE_KV
BASE_EXEC -->|executes| BASE_ENGINE
DRAFT_EXEC -->|owns & manages| DRAFT_KV
DRAFT_EXEC -->|executes| DRAFT_ENGINE
DRAFT_EXEC -.->|candidate tokens| BASE_EXEC
BASE_EXEC -.->|verification results| DRAFT_EXEC
%% Styling
classDef nvNode fill:#76B900,stroke:#5a8f00,stroke-width:1px,color:#fff
classDef greyNode fill:#f5f5f5,stroke:#999,stroke-width:1px,color:#333
classDef darkNode fill:#ffffff,stroke:#999,stroke-width:1px,color:#333
classDef inputNode fill:#f5f5f5,stroke:#999,stroke-width:1px,color:#333
classDef greenSubGraph fill:none,stroke:#76B900,stroke-width:1.5px
classDef greySubGraph fill:none,stroke:#aaa,stroke-width:1.5px
classDef invisibleSubGraph fill:transparent,stroke:transparent
class CLIENT inputNode
class RT,BASE_EXEC,DRAFT_EXEC,TOKENIZER,MULTIMODAL,DECODER_REG,DRAFT_KV,BASE_KV,SPECKERNELS,SAMPLING nvNode
class DRAFT_ENGINE,BASE_ENGINE greyNode
class ENGINE_EXEC greenSubGraph
class DRAFT_COMPONENTS,BASE_COMPONENTS,COMMON_KERNELS greySubGraph
class END_NODES invisibleSubGraph
Key Components#
Component |
Description |
|---|---|
Engine Executor (Base) |
Executes TensorRT engines and manages dual-phase inference for the base model. Core engine execution component owned by |
Engine Executor (Draft — optional) |
Specialized engine executor for draft models used in speculative decoding (e.g. EAGLE). Only instantiated when a |
Decoder Registry & Decoding Strategies |
Pluggable decoding strategy layer introduced by the runtime refactor. The |
Tokenizer |
HuggingFace-compatible text tokenization system. Converts between text and token IDs using Byte-Pair Encoding (BPE). Supports various model vocabularies (GPT, Llama, Qwen) with configurable special tokens and preprocessing steps. Files: |
Multimodal Runner |
Vision and audio processing for multimodal models (VLMs). Processes image/audio inputs through Vision/Audio Transformer models and generates embeddings. Supports Qwen-VL, InternVL, and Qwen-Audio architectures. Integrates multimodal embeddings with text tokens for multimodal inference. Files: |
Linear KV Cache |
Attention key-value cache management. Each engine executor maintains its own Linear KV Cache instance. Stores attention key-value pairs across inference steps for efficient autoregressive generation. Uses linear memory layout optimized for GPU access with support for batched processing and variable sequence lengths. Files: |
Sampling Kernels |
Token generation from model logits. Converts model output logits into probability distributions and samples the next token using configurable strategies (greedy, top-k, top-p, temperature). Called directly by the runtime (not by engine executors) after engine execution produces logits. Operates on GPU for efficient batch processing. Files: |
Speculative Decode Util Kernels |
Speculative decoding utility kernels (only used when a draft model is present). Specialized CUDA kernels for tree-based speculative decoding operations (EAGLE). Handles tree construction, candidate token generation, verification logic, and accept/reject mechanisms for speculative tokens. Files: |
TRT Engines |
TensorRT inference engines. In vanilla mode, a single base engine is used. In speculative decoding mode, dual engines (base + draft) are used. Optimized TensorRT engines compiled from ONNX models. Provides high-performance inference through TensorRT optimizations including kernel fusion, precision calibration, and memory optimization. Files: Built by |
Inference Workflow#
Vanilla Decoding (No Draft Model)#
When constructed without a SpecDecodeDraftingConfig, the runtime performs standard autoregressive generation:
%%{init: {'theme':'neutral', 'themeVariables': {'primaryColor':'#76B900','primaryTextColor':'#fff','primaryBorderColor':'#5a8f00','lineColor':'#666','edgeLabelBackground':'#ffffff','labelTextColor':'#000','clusterBkg':'#ffffff','clusterBorder':'#999'}}}%%
graph LR
INPUT_PROMPT(Input<br/>Prompt) --> TOKENIZER(Tokenize)
subgraph VIT_BOX ["Optional"]
VIT_PROCESS(ViT<br>Processing)
end
TOKENIZER --> VIT_PROCESS
VIT_PROCESS --> PREFILL_ENGINE(Prefill<br/>**TRT Engine**)
PREFILL_ENGINE --> GENERATE_KV[Generate<br/>KV-Cache]
GENERATE_KV --> SAMPLE_FIRST(Sample First<br/>Token)
SAMPLE_FIRST --> GENERATION_ENGINE[Generation<br/>**TRT Engine** or<br>**CUDA Graph**]
GENERATION_ENGINE --> UPDATE_KV(Update<br>KV Cache)
UPDATE_KV --> SAMPLE_TOKEN(Sample Next<br/>Token)
SAMPLE_TOKEN --> STOP_CHECK{Stop<br/>Condition?}
STOP_CHECK -->|N| GENERATION_ENGINE
STOP_CHECK -->|Y| OUTPUT_SEQUENCE(Generated<br/>Sequence)
subgraph PHASE1 ["Phase 1: Prefill"]
PREFILL_ENGINE
GENERATE_KV
SAMPLE_FIRST
end
subgraph PHASE2 ["Phase 2: Generation"]
GENERATION_ENGINE
UPDATE_KV
SAMPLE_TOKEN
STOP_CHECK
OUTPUT_SEQUENCE
end
classDef greyNode fill:#f5f5f5,stroke:#999,stroke-width:1px,color:#333
classDef darkNode fill:#ffffff,stroke:#999,stroke-width:1px,color:#333
classDef nvNode fill:#76B900,stroke:#5a8f00,stroke-width:1px,color:#fff
classDef nvLightNode fill:#b8d67e,stroke:#76B900,stroke-width:1px,color:#333
classDef inputNode fill:#f5f5f5,stroke:#999,stroke-width:1px,color:#333
classDef itemNode fill:#ffffff,stroke:#999,stroke-width:1px,color:#333
classDef lightSubGraph fill:none,stroke:#aaa,stroke-width:1.5px
classDef optionalBox fill:none,stroke:#aaa,stroke-width:1px,stroke-dasharray:5 5
class INPUT_PROMPT inputNode
class TOKENIZER,SAMPLE_FIRST,SAMPLE_TOKEN,VIT_PROCESS,STOP_CHECK,PREFILL_ENGINE,GENERATION_ENGINE greyNode
class GENERATE_KV,UPDATE_KV nvLightNode
class OUTPUT_SEQUENCE darkNode
class PHASE1,PHASE2 lightSubGraph
class VIT_BOX optionalBox
Phase 1: Prefill Processing
Input Processing: Text is tokenized and padded to batch requirements
Multimodal Integration: For VLMs, vision/audio embeddings are processed and integrated with text embeddings
Parallel Execution: All prompt tokens are processed simultaneously through transformer layers
KV-Cache Generation: Key-value cache is populated for all prompt tokens
First Token Sampling: Initial generated token is sampled from output logits
Phase 2: Generation (Autoregressive Decode)
Sequential Processing: Each iteration processes the previously generated token
KV-Cache Reuse: Leverages accumulated key-value cache from previous steps
CUDA Graph Optimization: Optional CUDA graph capture reduces kernel launch overhead
Sampling Strategies: Configurable token generation (greedy, top-k, top-p, temperature)
Stopping Criteria: Continues until EOS token, maximum length, or custom conditions
EAGLE Speculative Decoding (With Draft Model)#
When constructed with a SpecDecodeDraftingConfig, the runtime uses EAGLE tree-based speculative decoding for accelerated generation:
Phase 1: Prefill
%%{init: {'theme':'neutral', 'themeVariables': {'primaryColor':'#76B900','primaryTextColor':'#fff','primaryBorderColor':'#5a8f00','lineColor':'#666','edgeLabelBackground':'#ffffff','labelTextColor':'#000','clusterBkg':'#ffffff','clusterBorder':'#999'}}}%%
graph LR
INPUT_PROMPT(Input<BR>Prompt)
TOKENIZER(Tokenize)
subgraph VIT_BOX ["Optional"]
VIT_PROCESS(ViT<br>Processing)
end
BASE_PREFILL_ENGINE[Base Prefill<BR>**TRT Engine**]
BASE_KV_GEN(Generate Base<BR>KV-Cache, Logits <BR>& Hidden States)
BASE_SAMPLE(Sample)
DRAFT_PREFILL_ENGINE[Draft Prefill<BR>**TRT Engine**]
DRAFT_KV_GEN(Generate Draft<BR>KV-Cache, Logits<BR>& Hidden States)
PHASE2_START(Phase 2:<BR>Generation)
INPUT_PROMPT --> TOKENIZER
TOKENIZER --> VIT_PROCESS
VIT_PROCESS --> BASE_PREFILL_ENGINE
BASE_PREFILL_ENGINE --> BASE_KV_GEN
BASE_KV_GEN --> BASE_SAMPLE
BASE_SAMPLE --> DRAFT_PREFILL_ENGINE
DRAFT_PREFILL_ENGINE --> DRAFT_KV_GEN
DRAFT_KV_GEN --> PHASE2_START
classDef greyNode fill:#f5f5f5,stroke:#999,stroke-width:1px,color:#333
classDef nvNode fill:#76B900,stroke:#5a8f00,stroke-width:1px,color:#fff
classDef nvLightNode fill:#b8d67e,stroke:#76B900,stroke-width:1px,color:#333
classDef inputNode fill:#f5f5f5,stroke:#999,stroke-width:1px,color:#333
classDef optionalBox fill:none,stroke:#aaa,stroke-width:1px,stroke-dasharray:5 5
class PHASE2_START nvNode
class INPUT_PROMPT inputNode
class TOKENIZER,VIT_PROCESS,BASE_PREFILL_ENGINE,DRAFT_PREFILL_ENGINE,BASE_SAMPLE greyNode
class BASE_KV_GEN,DRAFT_KV_GEN nvLightNode
class VIT_BOX optionalBox
Phase 2: Generation
%%{init: {'theme':'neutral', 'themeVariables': {'primaryColor':'#76B900','primaryTextColor':'#fff','primaryBorderColor':'#5a8f00','lineColor':'#666','edgeLabelBackground':'#ffffff','labelTextColor':'#000','clusterBkg':'#ffffff','clusterBorder':'#999'}}}%%
graph LR
PHASE1_INPUT(Phase 1:<BR>Prefill)
DRAFT_ACCEPT_ENGINE[Draft Accept Token<BR>**TRT Engine**]
DRAFT_KV_GEN2(Generate Draft<BR>KV-Cache, Logits<BR>& Hidden States)
subgraph TREE_CONSTRUCTION ["Draft Tree Construction"]
DRAFT_PROPOSAL_ENGINE[Draft Batch Proposal<BR>**TRT Engine**]
SELECT_TOP_N(Select Top-N <BR>Tokens via Logits)
UPDATE_DRAFT_KV(Update Draft<BR>KV-Cache & <BR>Hidden States)
TREE_ROUND_CHECK{Tree<BR>Done?}
end
BASE_VERIFY_ENGINE[Base Tree Verification<BR>**TRT Engine**]
EAGLE_ACCEPT(EAGLE Accept Algorithm<BR>Token Selection)
BUILD_KV_CACHE(Update Base Model<BR>KV Cache with <BR>Accepted Tokens)
STOP_CHECK{Stop?}
OUTPUT_SEQUENCE(Generated<BR>Sequence)
PHASE1_INPUT --> DRAFT_KV_GEN2
DRAFT_ACCEPT_ENGINE --> DRAFT_KV_GEN2
DRAFT_KV_GEN2 --> DRAFT_PROPOSAL_ENGINE
DRAFT_PROPOSAL_ENGINE --> SELECT_TOP_N
SELECT_TOP_N --> UPDATE_DRAFT_KV
UPDATE_DRAFT_KV --> TREE_ROUND_CHECK
TREE_ROUND_CHECK -->|N| DRAFT_PROPOSAL_ENGINE
TREE_ROUND_CHECK -->|Y| BASE_VERIFY_ENGINE
BASE_VERIFY_ENGINE --> EAGLE_ACCEPT
EAGLE_ACCEPT --> BUILD_KV_CACHE
BUILD_KV_CACHE --> STOP_CHECK
STOP_CHECK -->|N| DRAFT_ACCEPT_ENGINE
STOP_CHECK -->|Y| OUTPUT_SEQUENCE
classDef greyNode fill:#f5f5f5,stroke:#999,stroke-width:1px,color:#333
classDef darkNode fill:#ffffff,stroke:#999,stroke-width:1px,color:#333
classDef nvNode fill:#76B900,stroke:#5a8f00,stroke-width:1px,color:#fff
classDef nvLightNode fill:#b8d67e,stroke:#76B900,stroke-width:1px,color:#333
classDef lightSubGraph fill:none,stroke:#aaa,stroke-width:1.5px
class PHASE1_INPUT nvNode
class TOKENIZER,TREE_ROUND_CHECK,STOP_CHECK,SELECT_TOP_N,DRAFT_ACCEPT_ENGINE,DRAFT_PROPOSAL_ENGINE,BASE_VERIFY_ENGINE greyNode
class DRAFT_KV_GEN2,UPDATE_DRAFT_KV,BUILD_KV_CACHE nvLightNode
class EAGLE_ACCEPT nvNode
class OUTPUT_SEQUENCE darkNode
class TREE_CONSTRUCTION lightSubGraph
Phase 1: Base Model Prefill
EAGLE starts with only the base model prefill:
Base Model Prefill: Standard prefill using
EngineExecutorto establish base model KV-cacheHidden States Generation: Base model produces hidden states required for draft model
Single Prefill: Only base model is prefilled initially; draft model prefill happens later
Multimodal Integration: Vision/audio embeddings processed once and used by base model
Phase 2: EAGLE Speculation Loop
The generation phase uses iterative tree-based speculation with conditional draft prefill:
First Round Only: Draft model prefill using draft
EngineExecutorwith base model hidden statesSubsequent Rounds: Draft model accept token operation instead of full prefill
Draft Tree Construction: Draft model generates candidate token trees using top-k sampling from draft logits
Base Model Verification: Base model processes entire draft tree in parallel and generates logits for all tree positions
EAGLE Accept Algorithm:
Base model’s top-1 predictions are always selected as final tokens
Draft tree tokens are accepted only when they match base model predictions
When draft tokens diverge from base predictions, remaining draft tokens are rejected
Process continues following the draft tree path as long as tokens match
Token Generation Source: All final output tokens come from base model, draft model only provides speculative candidates
CUDA Graph Capture Support: Supports CUDA graph capture for draft proposal, draft accept-token, base verification, and base vanilla decode paths
Iterative Process: Continues until stop conditions or maximum generation length reached
Vanilla vs. Speculative Decoding Differences#
Feature |
Vanilla Mode |
Speculative Decoding Mode |
|---|---|---|
Engines |
Single base engine |
Dual engines (base + draft) |
Prefill |
Base model only |
Base model first, draft model in first generation round |
Generation |
Standard autoregressive, one token per step |
Tree-based speculation, multiple tokens per step |
KV Cache |
Single KV cache |
Separate KV caches for base and draft models |
CUDA Graph |
Vanilla decode path |
Draft proposal, draft accept-token, base verification, base vanilla decode |
Dynamic Batching |
Supported with eviction |
Supported with eviction |
Usage Examples#
Standard LLM Inference (Vanilla)#
#include "runtime/llmInferenceRuntime.h"
#include <unordered_map>
// Initialize CUDA stream
cudaStream_t stream;
CUDA_CHECK(cudaStreamCreate(&stream));
// Initialize runtime — vanilla mode (no draft model)
std::unordered_map<std::string, std::string> loraWeightsMap{}; // Empty for no LoRA
LLMInferenceRuntime runtime(engineDir, "", loraWeightsMap, stream);
// Prepare request
LLMGenerationRequest request;
request.requests.resize(1);
Message userMsg;
userMsg.role = "user";
userMsg.contents.push_back({"text", "What is the capital of France?"});
request.requests[0].messages.push_back(std::move(userMsg));
request.maxGenerateLength = 100;
request.temperature = 1.0;
request.topK = 50;
request.topP = 0.8;
// Prepare response
LLMGenerationResponse response;
// Execute inference
if (runtime.handleRequest(request, response, stream)) {
std::cout << "Generated: " << response.outputTexts[0] << std::endl;
}
// Cleanup
CUDA_CHECK(cudaStreamDestroy(stream));
EAGLE Speculative Decoding#
#include "runtime/llmInferenceRuntime.h"
#include "runtime/llmRuntimeUtils.h"
// Initialize CUDA stream
cudaStream_t stream;
CUDA_CHECK(cudaStreamCreate(&stream));
// Configure EAGLE drafting parameters
SpecDecodeDraftingConfig draftingConfig;
draftingConfig.draftingTopK = 10;
draftingConfig.draftingStep = 6;
draftingConfig.verifyTreeSize = 60;
// Initialize runtime — speculative decoding mode
std::unordered_map<std::string, std::string> loraWeightsMap{};
LLMInferenceRuntime runtime(engineDir, "", loraWeightsMap, draftingConfig, stream);
// Prepare request
LLMGenerationRequest request;
request.requests.resize(1);
rt::Message userMsg;
userMsg.role = "user";
userMsg.contents.push_back({"text", "Explain quantum computing."});
request.requests[0].messages.push_back(std::move(userMsg));
request.maxGenerateLength = 200;
request.temperature = 1.0;
request.topK = 50;
request.topP = 0.8;
// Prepare response
LLMGenerationResponse response;
// Execute inference
if (runtime.handleRequest(request, response, stream)) {
std::cout << "Generated: " << response.outputTexts[0] << std::endl;
}
// Cleanup
CUDA_CHECK(cudaStreamDestroy(stream));
Multimodal VLM Inference#
#include "runtime/llmInferenceRuntime.h"
#include "runtime/imageUtils.h"
#include <unordered_map>
// Initialize CUDA stream
cudaStream_t stream;
CUDA_CHECK(cudaStreamCreate(&stream));
// Initialize multimodal runtime
std::unordered_map<std::string, std::string> loraWeightsMap{};
LLMInferenceRuntime runtime(engineDir, visualEngineDir, loraWeightsMap, stream);
// Prepare multimodal request
LLMGenerationRequest request;
request.requests.resize(1);
Message mmMsg;
mmMsg.role = "user";
// Content metadata in messages should align with the external buffers order.
mmMsg.contents.push_back({"image", "/path/to/image.jpg"});
mmMsg.contents.push_back({"text", "What's in this image?"});
request.requests[0].messages.push_back(std::move(mmMsg));
// Image bytes are provided separately via imageBuffers.
request.requests[0].imageBuffers.push_back(imageUtils::loadImageFromFile("/path/to/image.jpg"));
request.maxGenerateLength = 150;
request.temperature = 1.0;
request.topK = 50;
request.topP = 0.8;
// Prepare response
LLMGenerationResponse response;
// Execute inference
if (runtime.handleRequest(request, response, stream)) {
std::cout << "Generated: " << response.outputTexts[0] << std::endl;
}
// Cleanup
CUDA_CHECK(cudaStreamDestroy(stream));
LoRA Adapter Switching#
#include "runtime/llmInferenceRuntime.h"
#include <unordered_map>
// Initialize CUDA stream
cudaStream_t stream;
CUDA_CHECK(cudaStreamCreate(&stream));
// Initialize runtime with LoRA weights map
std::unordered_map<std::string, std::string> loraWeightsMap{
{"medical", "lora_weights/medical_adapter.safetensors"},
{"legal", "lora_weights/legal_adapter.safetensors"}
};
LLMInferenceRuntime runtime(engineDir, "", loraWeightsMap, stream);
// Execute inference with different LoRA adapters
LLMGenerationRequest medicalRequest;
medicalRequest.requests.resize(1);
{
Message msg;
msg.role = "user";
msg.contents.push_back({"text", "Medical question"});
medicalRequest.requests[0].messages.push_back(std::move(msg));
}
medicalRequest.loraWeightsName = "medical";
medicalRequest.maxGenerateLength = 100;
LLMGenerationResponse medicalResponse;
runtime.handleRequest(medicalRequest, medicalResponse, stream);
// Disable LoRA (use empty string)
LLMGenerationRequest baseRequest;
baseRequest.requests.resize(1);
{
Message msg;
msg.role = "user";
msg.contents.push_back({"text", "Base question"});
baseRequest.requests[0].messages.push_back(std::move(msg));
}
baseRequest.loraWeightsName = "";
baseRequest.maxGenerateLength = 100;
LLMGenerationResponse baseResponse;
runtime.handleRequest(baseRequest, baseResponse, stream);
// Cleanup
CUDA_CHECK(cudaStreamDestroy(stream));