C++ Runtime Overview#
Overview#
The TensorRT Edge-LLM C++ Runtime provides a comprehensive inference system for Large Language Models (LLMs) and Vision Language Models (VLMs) built on top of TensorRT. The runtime implements a layered architecture that manages the autoregressive decoding loop required for language model inference, handling everything from tokenization to final text generation.
Purpose#
The C++ Runtime serves as the final stage in the TensorRT Edge-LLM workflow:
%%{init: {'theme':'neutral', 'themeVariables': {'primaryColor':'#76B900','primaryTextColor':'#fff','primaryBorderColor':'#5a8f00','lineColor':'#666','edgeLabelBackground':'#ffffff','labelTextColor':'#000','clusterBkg':'#ffffff','clusterBorder':'#999'}}}%%
graph LR
HF_MODEL[HuggingFace<br>Model]
PYTHON_EXPORT[Python<br>Export<br>Pipeline]
ONNX_FILES[ONNX<br>Models]
ENGINE_BUILDER[Engine<br>Builder]
TRT_ENGINE[TensorRT<br>Engine]
OUTPUT[Inference<br>Results]
subgraph RUNTIME_SG [" "]
CPP_RUNTIME[C++<br>Runtime]
end
HF_MODEL --> PYTHON_EXPORT
PYTHON_EXPORT --> ONNX_FILES
ONNX_FILES --> ENGINE_BUILDER
ENGINE_BUILDER --> TRT_ENGINE
TRT_ENGINE --> CPP_RUNTIME
CPP_RUNTIME --> OUTPUT
classDef inputNode fill:#f5f5f5,stroke:#999,stroke-width:1px,color:#333
classDef nvLightNode fill:#b8d67e,stroke:#76B900,stroke-width:1px,color:#333
classDef nvNode fill:#76B900,stroke:#5a8f00,stroke-width:1px,color:#fff
classDef itemNode fill:#ffffff,stroke:#999,stroke-width:1px,color:#333
classDef darkNode fill:#ffffff,stroke:#999,stroke-width:1px,color:#333
classDef greenSubGraph fill:none,stroke:#76B900,stroke-width:1.5px
class HF_MODEL inputNode
class PYTHON_EXPORT,ENGINE_BUILDER nvLightNode
class CPP_RUNTIME nvNode
class ONNX_FILES,TRT_ENGINE itemNode
class OUTPUT darkNode
class RUNTIME_SG greenSubGraph
Runtime Architecture#
The C++ runtime is organized around two distinct, mutually exclusive runtime implementations that serve different inference scenarios. Both runtimes share the same high-level API (handleRequest) but implement fundamentally different execution strategies:
Component |
Description |
|---|---|
LLM Inference Runtime |
Top-level orchestrator for standard and multimodal inference. Owns and coordinates all components for the inference pipeline. Creates and directly manages a single |
LLM Inference SpecDecode Runtime |
Specialized runtime for EAGLE speculative decoding. Completely separate from the LLM Inference Runtime, owns and coordinates two distinct engine runners: |
Next Steps#
Standard Inference: Learn about the LLM Inference Runtime
EAGLE Speculative Decoding: Refer to LLM Inference SpecDecode Runtime
Advanced Features: Explore Advanced Runtime Features
Try Examples: Run the Examples to see the runtime in action
Additional Resources#
Runtime API: Refer to the
cpp/runtime/directoryExample Applications: Refer to
examples/llm/andexamples/multimodal/Architecture Overview: Refer to Overview