Engine Builder#
Overview#
The TensorRT Edge-LLM Engine Builder is a C++ component that converts ONNX models into optimized TensorRT engines specifically designed for edge deployment. The builder abstracts the complexity of TensorRT engine compilation, providing specialized builders for different model architectures.
Purpose#
The Engine Builder serves as the second stage in the TensorRT Edge-LLM workflow:
%%{init: {'theme':'neutral', 'themeVariables': {'primaryColor':'#76B900','primaryTextColor':'#fff','primaryBorderColor':'#5a8f00','lineColor':'#666','edgeLabelBackground':'#ffffff','labelTextColor':'#000','clusterBkg':'#ffffff','clusterBorder':'#999'}}}%%
graph LR
HF_MODEL[HuggingFace<br>Model]
PYTHON_EXPORT[Python<br>Export<br>Pipeline]
ONNX_FILES[ONNX<br>Models]
TRT_ENGINE[TensorRT<br>Engine]
CPP_RUNTIME[C++<br>Runtime]
OUTPUT[Inference<br>Results]
subgraph BUILDER_SG [" "]
ENGINE_BUILDER[Engine<br>Builder]
end
HF_MODEL --> PYTHON_EXPORT
PYTHON_EXPORT --> ONNX_FILES
ONNX_FILES --> ENGINE_BUILDER
ENGINE_BUILDER --> TRT_ENGINE
TRT_ENGINE --> CPP_RUNTIME
CPP_RUNTIME --> OUTPUT
classDef inputNode fill:#f5f5f5,stroke:#999,stroke-width:1px,color:#333
classDef nvLightNode fill:#b8d67e,stroke:#76B900,stroke-width:1px,color:#333
classDef nvNode fill:#76B900,stroke:#5a8f00,stroke-width:1px,color:#fff
classDef itemNode fill:#ffffff,stroke:#999,stroke-width:1px,color:#333
classDef darkNode fill:#ffffff,stroke:#999,stroke-width:1px,color:#333
classDef greenSubGraph fill:none,stroke:#76B900,stroke-width:1.5px
class HF_MODEL inputNode
class PYTHON_EXPORT,CPP_RUNTIME nvLightNode
class ENGINE_BUILDER nvNode
class ONNX_FILES,TRT_ENGINE itemNode
class OUTPUT darkNode
class BUILDER_SG greenSubGraph
Key Responsibilities:
Parse ONNX models and extract network structure
Configure optimization profiles for dynamic shapes
Compile TensorRT engines with platform-specific optimizations
Generate runtime configuration files
Handle model-specific requirements (EAGLE, VLM, LoRA)
Compatibility and Model Integrity#
⚠️ VERSION COMPATIBILITY: ONNX models and TensorRT engines are NOT portable across different versions of TensorRT Edge-LLM or TensorRT. Always re-export ONNX models and rebuild engines when upgrading versions.
⚠️ USER RESPONSIBILITY: Users are responsible for verifying model integrity before export and build, validating outputs, and maintaining proper model sanity throughout the pipeline. See the Security and Model Integrity section in the Python Export Pipeline guide for verification best practices.
Build Process Workflow#
%%{init: {'theme':'neutral', 'themeVariables': {'primaryColor':'#76B900','primaryTextColor':'#fff','primaryBorderColor':'#5a8f00','lineColor':'#666','edgeLabelBackground':'#ffffff','labelTextColor':'#000','clusterBkg':'#ffffff','clusterBorder':'#999'}}}%%
graph LR
subgraph INPUT_STAGE [" "]
INPUT_SPACER_LEFT
ONNX_INPUT[ONNX Models<br>+ Export Configs]
end
subgraph ENGINE_BUILDER ["Engine Builder"]
BUILDER_SPACER_1
BUILDER_SPACER_2
LLM_BUILDER[LLM<br>Builder]
TEXT_LLM("← IF LLM")
TEXT_VISUAL("IF Vision Encoder →")
VISION_BUILDER[Vision Encoder<br>Builder]
PLUGIN_LOADING( 1 <br>Plugin<br>Loading)
CONFIG_PARSING( 2 <br>Configuration<br>Parsing)
NETWORK_CREATION( 3 <br>Network<br>Creation)
MODEL_TYPE( 4 <br>Model Type<br>Detection)
ONNX_PARSING( 5 <br>ONNX<br>Parsing)
PROFILE_SETUP( 6 <br>Optimization Profile<br>Setup)
ENGINE_COMPILATION( 7 <br>Engine Compilation<br>*via TensorRT Builder*)
FILE_MANAGEMENT( 8 <br>File<br>Management)
BUILDER_SPACER_BOTTOM
end
INPUT_SPACER_LEFT -.-> BUILDER_SPACER_1 -.-> BUILDER_SPACER_2 -.-> LLM_BUILDER -.-> TEXT_LLM -.-> TEXT_VISUAL -.-> VISION_BUILDER
linkStyle 0 stroke:transparent
linkStyle 1 stroke:transparent
linkStyle 2 stroke:transparent
linkStyle 3 stroke:transparent
linkStyle 4 stroke:transparent
linkStyle 5 stroke:transparent
subgraph RESULTS [" "]
RUNTIME_CONFIG[Runtime<br>Config]
TOKENIZER_FILES[Tokenizer<br>Files]
TRT_ENGINE[TensorRT<br>Engine]
EAGLE_MAPPINGS[EAGLE<br>Mappings]
end
ONNX_INPUT --> PLUGIN_LOADING
PLUGIN_LOADING --> CONFIG_PARSING
CONFIG_PARSING --> NETWORK_CREATION
NETWORK_CREATION --> MODEL_TYPE
MODEL_TYPE --> ONNX_PARSING
ONNX_PARSING --> PROFILE_SETUP
PROFILE_SETUP --> ENGINE_COMPILATION
ENGINE_COMPILATION --> FILE_MANAGEMENT
FILE_MANAGEMENT --> RUNTIME_CONFIG
FILE_MANAGEMENT --> TOKENIZER_FILES
FILE_MANAGEMENT --> TRT_ENGINE
FILE_MANAGEMENT -->|IF LLM Builder<br>& IF EAGLE| EAGLE_MAPPINGS
classDef greyNode fill:#f5f5f5,stroke:#999,stroke-width:1px,color:#333
classDef nvNode fill:#76B900,stroke:#5a8f00,stroke-width:1px,color:#fff
classDef darkNode fill:#ffffff,stroke:#999,stroke-width:1px,color:#333
classDef inputNode fill:#f5f5f5,stroke:#999,stroke-width:1px,color:#333
classDef itemNode fill:#ffffff,stroke:#999,stroke-width:1px,color:#333
classDef greySubGraph fill:none,stroke:#bbb,stroke-width:1px
classDef greenSubGraph fill:none,stroke:#76B900,stroke-width:1.5px
classDef contextBox fill:none,stroke:transparent
classDef invisibleNode fill:transparent,stroke:transparent
classDef invisible fill:transparent,stroke:transparent,color:transparent,font-size:1px
classDef invisibleSubGraph fill:transparent,stroke:transparent
classDef textOnly fill:transparent,stroke:transparent
class VISION_BUILDER,LLM_BUILDER nvNode
class TEXT_LLM,TEXT_VISUAL textOnly
class ONNX_INPUT inputNode
class RUNTIME_CONFIG,TOKENIZER_FILES,TRT_ENGINE,EAGLE_MAPPINGS darkNode
class PLUGIN_LOADING,CONFIG_PARSING,NETWORK_CREATION,MODEL_TYPE,ONNX_PARSING,PROFILE_SETUP,ENGINE_COMPILATION,FILE_MANAGEMENT greyNode
class INPUT_SPACER_LEFT,BUILDER_SPACER_1,BUILDER_SPACER_2,BUILDER_SPACER_BOTTOM invisible
class INPUT_STAGE,RESULTS contextBox
class ENGINE_BUILDER greenSubGraph
For EAGLE3 speculative decoding, the LLM Builder executes this same 8-stage workflow twice with different configurations to create both base and draft models. This approach leverages the same optimization pipeline while producing specialized engines for each role in the speculative decoding process.
Component Overview#
The Engine Builder consists of two main components designed to handle different aspects of multimodal AI:
Component |
Description |
|---|---|
LLM Builder |
Converts language model ONNX files into optimized TensorRT engines. Supporting: Standard LLMs, EAGLE3 speculative decoding, VLM language components, LoRA adaptations |
Visual Encoder Builder |
Converts visual encoder ONNX files into optimized TensorRT engines for multimodal models. Supporting: Dynamic image token generation, Multiple multimodal architectures, Variable resolution support |
Build Process Stages#
Both the LLM Builder and Visual Encoder Builder follow a systematic multi-stage process to convert ONNX models into optimized TensorRT engines. The build process includes optional debugging steps that log network information for troubleshooting and validation purposes.
Step |
General Description |
LLM Builder |
Visual Encoder Builder |
|---|---|---|---|
1. Plugin Loading |
Loads TensorRT Edge-LLM custom plugins required for model-specific operations |
Attention mechanisms, quantization operations, autoregressive generation patterns |
Vision processing operations, image patch handling, vision transformer components |
2. Configuration Parsing |
Extracts model parameters and build settings from configuration files |
Parses |
Parses |
3. Network Creation |
Creates TensorRT network definition with strongly-typed tensors |
Autoregressive language generation with attention mechanisms |
Vision processing with image patch handling and feature extraction |
4. Model Type Detection |
Identifies the specific model architecture and selects appropriate ONNX files |
Standard LLM, EAGLE base/draft, VLM language component, LoRA-enabled. Selects |
Multimodal vision encoders. Always uses |
5. ONNX Parsing |
Parses the ONNX model file and populates the TensorRT network with the model’s computational graph |
Attention layers, embedding layers, autoregressive components |
Vision transformer layers, patch embedding, image processing components |
6. Optimization Profile Setup |
Configures optimization profiles for dynamic input shapes and batch processing |
Dual-phase profiles (context/prefill and generation/decode phases) for various model types |
Single profile for image processing with dynamic image token counts and variable resolutions |
7. Engine Compilation |
Invokes the TensorRT Builder API ( |
Autoregressive generation, attention patterns, memory efficiency |
Vision transformer workloads, image processing, feature extraction |
8. File Management |
Copies and generates necessary runtime files and configurations with automatic directory creation |
Runtime config, tokenizer files ( |
Runtime configuration only ( |
LLM Builder#
The LLM Builder is the core component responsible for converting language model ONNX files into optimized TensorRT engines. It provides a unified interface for building various types of language models while handling the complexity of model-specific optimizations and configurations.
Supported Model Types#
The LLM Builder adapts its optimization profiles and tensor configurations based on the specific model type and requirements:
Standard LLMs (setupVanillaProfiles):
Input Tensors: Input IDs, attention masks, position embeddings
Dynamic Shapes: Variable sequence lengths and batch sizes
KV Cache: Key-value cache tensors for autoregressive generation
EAGLE Models (setupEagleProfiles):
Base Models: Standard LLM inputs plus EAGLE-specific attention patterns and tree verification
Draft Models: Hidden states from draft, tree attention masks, speculation tokens
Tree Size Limits: Configurable maximum tree sizes for verification (
maxVerifyTreeSize) and draft generation (maxDraftTreeSize)Vocabulary Mapping: Draft-to-target token mapping for EAGLE3 (
d2t.safetensorsfile)
Vision-Language Models (setupVLMProfiles):
Image Embeddings: Dynamic image token inputs from visual encoder
Multimodal Integration: Coordination between vision and language processing
Token Flexibility: Variable image token counts based on input resolution
LoRA-Enabled Models (setupLoraProfiles):
Adapter Matrices: Dynamic LoRA weight matrices with configurable ranks
Runtime Switching: Support for multiple LoRA adapters per model
Memory Optimization: Efficient adapter weight storage and loading
Dual-Phase Optimization#
The LLMBuilder creates two optimization profiles for efficient inference:
Context Profile (Prefill Phase):
Purpose: Optimized for processing input prompts and initial context
Characteristics: Large sequence lengths, parallel processing, memory-intensive
Batch Sizes: Typically smaller batches due to memory constraints
Generation Profile (Decode Phase):
Purpose: Optimized for autoregressive token generation
Characteristics: Single token generation, sequential processing, compute-intensive
Batch Sizes: Larger batches possible due to smaller memory footprint
Output Generation#
Based on model type, the LLMBuilder produces different output configurations:
Standard Models:
llm.engine: TensorRT engine fileconfig.json: Runtime configuration with model parametersTokenizer files:
tokenizer.json,tokenizer_config.json
EAGLE Base Models:
eagle_base.engine: Base model TensorRT enginebase_config.json: Base model runtime configurationTokenizer files: Shared tokenizer configuration
EAGLE Draft Models:
eagle_draft.engine: Draft model TensorRT enginedraft_config.json: Draft model runtime configurationd2t.safetensors: Draft-to-target vocabulary mapping
Hardware Optimization#
The LLMBuilder automatically applies platform-specific optimizations for maximum performance on edge hardware:
Precision Selection: FP16, FP8, INT4, NVFP4 based on hardware capabilities
Memory Layout: Optimized tensor layouts for target GPU architecture
Kernel Selection: Hardware-specific kernel implementations for maximum performance
Visual Encoder Builder#
The Visual Encoder Builder is specialized for converting visual encoder ONNX files into optimized TensorRT engines for multimodal AI applications. It handles the complexity of different vision architectures while providing dynamic image processing capabilities.
Supported Vision Architectures#
The Visual Encoder Builder supports multiple vision architectures with architecture-specific optimization profiles:
Qwen2/2.5/3-VL Models (setupQwenViTProfile):
Architecture: Qwen2-VL and Qwen2.5-VL vision encoders
Processing: Dynamic image patches with 4x spatial merge operations
Token Generation: Variable image tokens based on input resolution
Window Attention: Qwen2.5-VL includes window attention mechanisms for improved efficiency
InternVL Models (setupInternPhi4ViTProfile):
Architecture: InternVL3 vision encoder with 0.5 downsampling ratio(downsampling by a factor of 2 in each dimension, resulting in a 4× reduction in tokens).
Constraints: Image tokens must be multiples of 256 for optimal processing
Configuration: Configurable input channels (typically 3 for RGB)
Resolution: Fixed image size processing with dynamic token output
Phi-4-multimodal Models (setupInternPhi4ViTProfile):
Architecture: Phi-4-multimodal vision encoder with 0.5 downsampling ratio(downsampling by a factor of 2 in each dimension, resulting in a 4× reduction in tokens).
Constraints: Image tokens must be multiples of 256 for optimal processing
Configuration: Configurable input channels (typically 3 for RGB)
Resolution: Fixed image size processing with dynamic token output
Vision-Specific Optimization#
Unlike LLM models, visual encoders use a single optimization profile optimized for image processing:
Purpose: Image encoding and feature extraction
Dynamic Dimensions: Variable image resolutions and patch counts
Memory Pattern: Batch-oriented processing for multiple images
Token Output: Dynamic image token generation based on input complexity
Output Generation#
The Visual Encoder Builder produces engines specifically designed for multimodal integration:
visual.engine: TensorRT engine optimized for visual processingconfig.json: Runtime configuration with vision model parameters and builder settingsToken Interface: Produces image tokens compatible with LLM input requirements
Dynamic Sizing: Supports variable image token counts based on input resolution
Hardware Optimization#
Visual processing benefits from specific GPU optimizations tailored for vision transformer workloads:
Tensor Cores: Leverages mixed-precision operations for vision transformers
Memory Bandwidth: Optimized for high-resolution image processing
Batch Processing: Efficient handling of multiple images simultaneously
Precision Selection: Automatic FP16/FP8 selection based on hardware capabilities
Usage Examples#
Standard LLM Build#
./build/examples/llm/llm_build \
--onnxDir=onnx_models/qwen2.5-0.5b \
--engineDir=engines/qwen2.5-0.5b \
--maxBatchSize=1 \
--maxInputLen=1024 \
--maxKVCacheCapacity=4096
EAGLE Speculative Decoding Build#
Base and draft engine directories should be the same.
# Build base model
./build/examples/llm/llm_build \
--onnxDir=onnx_models/qwen2.5-vl-7b_eagle3_base \
--engineDir=engines/qwen2.5-vl-7b_eagle3 \
--maxBatchSize=1 \
--maxInputLen=1024 \
--maxKVCacheCapacity=4096 \
--vlm \
--minImageTokens=128 \
--maxImageTokens=512 \
--eagleBase
# Build draft model
./build/examples/llm/llm_build \
--onnxDir=onnx_models/qwen2.5-vl-7b_eagle3_draft \
--engineDir=engines/qwen2.5-vl-7b_eagle3 \
--maxBatchSize=1 \
--maxInputLen=1024 \
--maxKVCacheCapacity=4096 \
--vlm \
--minImageTokens=128 \
--maxImageTokens=512 \
--eagleDraft
# Build visual encoder (required for VLM)
./build/examples/multimodal/visual_build \
--onnxDir=onnx_models/qwen2.5-vl-7b/visual_enc_onnx \
--engineDir=visual_engines/qwen2.5-vl-7b_eagle3 \
--minImageTokens=128 \
--maxImageTokens=512 \
--maxImageTokensPerImage=512
Multimodal VLM Build#
# Build LLM engine
./build/examples/llm/llm_build \
--onnxDir=onnx_models/qwen2.5-vl-3b \
--engineDir=engines/qwen2.5-vl-3b \
--maxBatchSize=1 \
--maxInputLen=1024 \
--maxKVCacheCapacity=4096 \
--vlm \
--minImageTokens=128 \
--maxImageTokens=512
# Build visual encoder
./build/examples/multimodal/visual_build \
--onnxDir=onnx_models/qwen2.5-vl-3b/visual_enc_onnx \
--engineDir=visual_engines/qwen2.5-vl-3b \
--minImageTokens=128 \
--maxImageTokens=512 \
--maxImageTokensPerImage=512
LoRA-Enabled Build#
./build/examples/llm/llm_build \
--onnxDir=onnx_models/qwen2.5-0.5b \
--engineDir=engines/qwen2.5-0.5b-lora \
--maxBatchSize=1 \
--maxLoraRank=64
Best Practices#
Engine Building Strategy#
Optimize Batch Size: Set
maxBatchSizebased on your workloadInteractive applications: 1-2
Batch processing: 4-8
Configure Sequence Lengths: Balance memory and use case
Short prompts:
maxInputLen=512Long context:
maxInputLen=2048or higherKV-Cache capacity:
maxKVCacheCapacity=maxInputLen+ expected output length
Set Image Token Ranges for VLMs: Configure appropriate token ranges for multimodal models
InternVL and Phi-4-multimodal: Image tokens must be multiples of 256
Qwen-VL: Flexible image token counts based on dynamic patching
Use
--minImageTokensand--maxImageTokensto set the rangeSet
--maxImageTokensPerImagefor batch processing limits
Enable Verbose Logging: Use
--verbosefor debugging build issues
Troubleshooting#
Out of Memory During Build:
Reduce
maxBatchSizeReduce
maxInputLenandmaxKVCacheCapacityUse lighter quantization (INT4 instead of FP16)
Build Takes Too Long:
Expected: 1-20 minutes depending on model size
Use faster GPU for building
Consider reducing optimization profile complexity
Engine Not Loading:
Check TensorRT version compatibility
Verify ONNX model integrity
Check plugin library loading
Next Steps#
After building your TensorRT engine:
Deploy with C++ Runtime: Use the C++ Runtime for inference
Run Examples: Try the Examples to validate your engine
Benchmark Performance: Measure latency and throughput for your use case
Additional Resources#
Builder API: Refer to the
cpp/builder/directoryTensorRT Documentation: NVIDIA TensorRT
Plugin Development: Refer to the
cpp/plugins/directoryBuild Examples: Refer to
examples/llm/llm_build.cpp