Examples and Complete Workflows#
Code Location:
examples/| Build:examples/llm/,examples/multimodal/
Overview#
This guide provides complete end-to-end workflows for using TensorRT Edge-LLM, covering model export, engine building, and inference execution. Each workflow demonstrates the complete pipeline from HuggingFace models to deployed inference with standardized folder structures.
⚠️ USER RESPONSIBILITY: Users are responsible for composing meaningful and appropriate prompts for their use cases. The examples provided demonstrate technical usage patterns but do not guarantee output quality or appropriateness.
Prerequisites: Complete the Installation Guide for both x86 host and edge device before proceeding.
Complete Workflow Summary#
Note: Each example shows complete commands including environment variable setup (
WORKSPACE_DIR,MODEL_NAME). These variables persist in your shell session, so you only need to set them once per session.
Every TensorRT Edge-LLM deployment follows this pattern:
graph LR
HF[HuggingFace Model<br/>x86] --> QUANT[Quantize<br/>x86]
QUANT --> EXPORT[Export<br/>x86]
EXPORT --> ONNX[ONNX<br/>transfer]
ONNX --> BUILD[Build<br/>device]
BUILD --> ENGINE[Engine<br/>device]
ENGINE --> INF[Inference<br/>device]
Pipeline Stages:
Quantize (x86 Host): Quantize model to target precision (FP8/FP4)
Export (x86 Host): Convert quantized model to ONNX format
Transfer: Copy ONNX models to edge device
Build (Edge Device): Compile ONNX into optimized TensorRT engines
Inference (Edge Device): Run inference using compiled engines
Example 1: VLM (Vision-Language Model) Inference#
Complete workflow for vision-language models with image understanding capabilities.
Model: Qwen2.5-VL-3B-Instruct
Step 1: Quantize and Export (x86 Host)#
export WORKSPACE_DIR=$HOME/tensorrt-edgellm-workspace
export MODEL_NAME=Qwen2.5-VL-3B-Instruct
mkdir -p $WORKSPACE_DIR
cd $WORKSPACE_DIR
# Quantize language model
tensorrt-edgellm-quantize-llm \
--model_dir Qwen/Qwen2.5-VL-3B-Instruct \
--quantization fp8 \
--output_dir $MODEL_NAME/quantized
# Export language model
tensorrt-edgellm-export-llm \
--model_dir $MODEL_NAME/quantized \
--output_dir $MODEL_NAME/onnx/llm
# Export visual encoder
tensorrt-edgellm-export-visual \
--model_dir Qwen/Qwen2.5-VL-3B-Instruct \
--output_dir $MODEL_NAME/onnx/visual
Step 2: Transfer to Device#
# Transfer ONNX to device
scp -r $MODEL_NAME/onnx \
<device_user>@<device_ip>:~/tensorrt-edgellm-workspace/$MODEL_NAME/
Step 3: Build Engines (Thor Device)#
# Set up workspace directory on device
export WORKSPACE_DIR=$HOME/tensorrt-edgellm-workspace
export MODEL_NAME=Qwen2.5-VL-3B-Instruct
cd ~/TensorRT-Edge-LLM
# Build language model engine
./build/examples/llm/llm_build \
--onnxDir $WORKSPACE_DIR/$MODEL_NAME/onnx/llm \
--engineDir $WORKSPACE_DIR/$MODEL_NAME/engines/llm \
--maxBatchSize 1 \
--maxInputLen 1024 \
--maxKVCacheCapacity 4096
# Build visual encoder engine
./build/examples/multimodal/visual_build \
--onnxDir $WORKSPACE_DIR/$MODEL_NAME/onnx/visual \
--engineDir $WORKSPACE_DIR/$MODEL_NAME/engines/visual \
--minImageTokens 128 \
--maxImageTokens 512 \
--maxImageTokensPerImage 512
Build time: ~10-15 minutes total
Step 4: Run Inference (Thor Device)#
Create an input file $WORKSPACE_DIR/input_vlm.json (replace /path/to/image.jpg with an actual image file path):
{
"batch_size": 1,
"temperature": 1.0,
"top_p": 1.0,
"top_k": 50,
"max_generate_length": 128,
"requests": [
{
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": [
{
"type": "image",
"image": "/path/to/image.jpg"
},
{
"type": "text",
"text": "Please describe the image."
}
]
}
]
}
]
}
Run inference:
cd ~/TensorRT-Edge-LLM
./build/examples/llm/llm_inference \
--engineDir $WORKSPACE_DIR/$MODEL_NAME/engines/llm \
--multimodalEngineDir $WORKSPACE_DIR/$MODEL_NAME/engines/visual \
--inputFile $WORKSPACE_DIR/input_vlm.json \
--outputFile $WORKSPACE_DIR/output_vlm.json
Success! 🎉 Check output_vlm.json for vision-language model responses.
Example 2: LLM EAGLE Speculative Decoding#
EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) uses a smaller draft model to accelerate generation for text-only models.
Model: Llama-3.1-8B-Instruct with EAGLE draft model
Step 1: Quantize and Export (x86 Host)#
export MODEL_NAME=Llama-3.1-8B-Instruct
cd $WORKSPACE_DIR
# Download EAGLE draft model to workspace
git clone https://huggingface.co/yuhuili/EAGLE3-LLaMA3.1-Instruct-8B
cd EAGLE3-LLaMA3.1-Instruct-8B && git lfs pull && cd ..
# Quantize base model
tensorrt-edgellm-quantize-llm \
--model_dir meta-llama/Llama-3.1-8B-Instruct \
--quantization fp8 \
--output_dir $MODEL_NAME/quantized-base
# Export base model with EAGLE flag
tensorrt-edgellm-export-llm \
--model_dir $MODEL_NAME/quantized-base \
--output_dir $MODEL_NAME/onnx/base \
--is_eagle_base
# Quantize draft model
tensorrt-edgellm-quantize-draft \
--base_model_dir meta-llama/Llama-3.1-8B-Instruct \
--draft_model_dir EAGLE3-LLaMA3.1-Instruct-8B \
--quantization fp8 \
--output_dir $MODEL_NAME/quantized-draft
# Export draft model
tensorrt-edgellm-export-draft \
--draft_model_dir $MODEL_NAME/quantized-draft \
--base_model_dir meta-llama/Llama-3.1-8B-Instruct \
--output_dir $MODEL_NAME/onnx/draft
Step 2: Transfer to Device#
# Transfer ONNX to device
scp -r $MODEL_NAME/onnx \
<device_user>@<device_ip>:~/tensorrt-edgellm-workspace/$MODEL_NAME/
Step 3: Build Engines (Thor Device)#
export WORKSPACE_DIR=$HOME/tensorrt-edgellm-workspace
export MODEL_NAME=Llama-3.1-8B-Instruct
cd ~/TensorRT-Edge-LLM
# Build base model EAGLE engine
./build/examples/llm/llm_build \
--onnxDir $WORKSPACE_DIR/$MODEL_NAME/onnx/base \
--engineDir $WORKSPACE_DIR/$MODEL_NAME/engines \
--maxBatchSize 1 \
--maxInputLen 1024 \
--maxKVCacheCapacity 4096 \
--maxVerifyTreeSize 60 \
--maxDraftTreeSize 60 \
--eagleBase
# Build draft model engine
./build/examples/llm/llm_build \
--onnxDir $WORKSPACE_DIR/$MODEL_NAME/onnx/draft \
--engineDir $WORKSPACE_DIR/$MODEL_NAME/engines \
--maxBatchSize 1 \
--maxInputLen 1024 \
--maxKVCacheCapacity 4096 \
--maxVerifyTreeSize 60 \
--maxDraftTreeSize 60 \
--eagleDraft
Build time: ~15-20 minutes total
Step 4: Run Inference (Thor Device)#
cd ~/TensorRT-Edge-LLM
./build/examples/llm/llm_inference \
--engineDir $WORKSPACE_DIR/$MODEL_NAME/engines \
--inputFile $WORKSPACE_DIR/input.json \
--outputFile $WORKSPACE_DIR/output.json \
--eagle
Note: EAGLE speculative decoding provides 1.5-3x faster generation but is limited to batch size 1.
Example 3: VLM EAGLE Speculative Decoding#
EAGLE for vision-language models combines accelerated text generation with image understanding.
Model: Qwen2.5-VL-7B-Instruct with EAGLE3 draft model
Step 1: Quantize and Export (x86 Host)#
export MODEL_NAME=Qwen2.5-VL-7B-Instruct
cd $WORKSPACE_DIR
# Download EAGLE draft model to workspace
git clone https://huggingface.co/Rayzl/qwen2.5-vl-7b-eagle3-sgl
cd qwen2.5-vl-7b-eagle3-sgl && git lfs pull && cd ..
# Quantize base model
tensorrt-edgellm-quantize-llm \
--model_dir Qwen/Qwen2.5-VL-7B-Instruct \
--quantization fp8 \
--output_dir $MODEL_NAME/quantized-base
# Export base model with EAGLE flag
tensorrt-edgellm-export-llm \
--model_dir $MODEL_NAME/quantized-base \
--output_dir $MODEL_NAME/onnx/base \
--is_eagle_base
# Quantize draft model
tensorrt-edgellm-quantize-draft \
--base_model_dir Qwen/Qwen2.5-VL-7B-Instruct \
--draft_model_dir qwen2.5-vl-7b-eagle3-sgl \
--quantization fp8 \
--output_dir $MODEL_NAME/quantized-draft
# Export draft model
tensorrt-edgellm-export-draft \
--draft_model_dir $MODEL_NAME/quantized-draft \
--base_model_dir Qwen/Qwen2.5-VL-7B-Instruct \
--output_dir $MODEL_NAME/onnx/draft
# Export visual encoder
tensorrt-edgellm-export-visual \
--model_dir Qwen/Qwen2.5-VL-7B-Instruct \
--output_dir $MODEL_NAME/onnx/visual
Step 2: Transfer to Device#
# Transfer ONNX to device
scp -r $MODEL_NAME/onnx \
<device_user>@<device_ip>:~/tensorrt-edgellm-workspace/$MODEL_NAME/
Step 3: Build Engines (Thor Device)#
export WORKSPACE_DIR=$HOME/tensorrt-edgellm-workspace
export MODEL_NAME=Qwen2.5-VL-7B-Instruct
cd ~/TensorRT-Edge-LLM
# Build base model EAGLE engine
./build/examples/llm/llm_build \
--onnxDir $WORKSPACE_DIR/$MODEL_NAME/onnx/base \
--engineDir $WORKSPACE_DIR/$MODEL_NAME/engines/llm \
--maxBatchSize 1 \
--maxInputLen 1024 \
--maxKVCacheCapacity 4096 \
--maxVerifyTreeSize 60 \
--maxDraftTreeSize 60 \
--eagleBase
# Build draft model engine
./build/examples/llm/llm_build \
--onnxDir $WORKSPACE_DIR/$MODEL_NAME/onnx/draft \
--engineDir $WORKSPACE_DIR/$MODEL_NAME/engines/llm \
--maxBatchSize 1 \
--maxInputLen 1024 \
--maxKVCacheCapacity 4096 \
--maxVerifyTreeSize 60 \
--maxDraftTreeSize 60 \
--eagleDraft
# Build visual encoder engine
./build/examples/multimodal/visual_build \
--onnxDir $WORKSPACE_DIR/$MODEL_NAME/onnx/visual \
--engineDir $WORKSPACE_DIR/$MODEL_NAME/engines/visual \
--minImageTokens 128 \
--maxImageTokens 512 \
--maxImageTokensPerImage 512
Build time: ~20-30 minutes total
Step 4: Run Inference (Thor Device)#
cd ~/TensorRT-Edge-LLM
./build/examples/llm/llm_inference \
--engineDir $WORKSPACE_DIR/$MODEL_NAME/engines/llm \
--multimodalEngineDir $WORKSPACE_DIR/$MODEL_NAME/engines/visual \
--inputFile $WORKSPACE_DIR/input.json \
--outputFile $WORKSPACE_DIR/output.json \
--eagle
Success! 🎉 EAGLE VLM provides accelerated multimodal inference.
Example 4: LoRA-Enabled Models#
Dynamic LoRA adapter support allows switching between fine-tuned adapters at runtime without rebuilding engines.
Model: Qwen2.5-0.5B-Instruct with LoRA adapter
Step 1: Export and Process (x86 Host)#
export MODEL_NAME=Qwen2.5-0.5B-Instruct
cd $WORKSPACE_DIR
# Export base model (FP16, no quantization needed for small model)
tensorrt-edgellm-export-llm \
--model_dir Qwen/Qwen2.5-0.5B-Instruct \
--output_dir $MODEL_NAME/onnx
# Insert LoRA support into ONNX
tensorrt-edgellm-insert-lora \
--onnx_dir $MODEL_NAME/onnx
# Download and process LoRA adapter
git clone https://huggingface.co/madhurjindal/Jailbreak-Detector-2-XL
cd Jailbreak-Detector-2-XL && git lfs pull && cd ..
# Process LoRA weights
tensorrt-edgellm-process-lora \
--input_dir Jailbreak-Detector-2-XL \
--output_dir $MODEL_NAME/onnx/lora_weights/jailbreak_detector
Step 2: Transfer to Device#
# Transfer ONNX and LoRA weights to device
scp -r $MODEL_NAME/onnx \
<device_user>@<device_ip>:~/tensorrt-edgellm-workspace/$MODEL_NAME/
Step 3: Build Engine (Thor Device)#
export WORKSPACE_DIR=$HOME/tensorrt-edgellm-workspace
export MODEL_NAME=Qwen2.5-0.5B-Instruct
cd ~/TensorRT-Edge-LLM
# Build engine with LoRA support
./build/examples/llm/llm_build \
--onnxDir $WORKSPACE_DIR/$MODEL_NAME/onnx \
--engineDir $WORKSPACE_DIR/$MODEL_NAME/engines \
--maxBatchSize 1 \
--maxInputLen 1024 \
--maxKVCacheCapacity 4096 \
--maxLoraRank 64
Step 4: Run Inference (Thor Device)#
cd ~/TensorRT-Edge-LLM
./build/examples/llm/llm_inference \
--engineDir $WORKSPACE_DIR/$MODEL_NAME/engines \
--inputFile $WORKSPACE_DIR/input.json \
--outputFile $WORKSPACE_DIR/output.json
Note: You can add multiple LoRA adapters to the lora_weights/ directory and switch between them at runtime without rebuilding the engine.
Example 5: Phi-4-Multimodal with LoRA Merge#
Phi-4-Multimodal requires merging vision LoRA adapter into the base model before quantization and export.
Model: Phi-4-multimodal-instruct
Step 1: Merge, Quantize, and Export (x86 Host)#
export MODEL_NAME=Phi-4-multimodal-instruct
cd $WORKSPACE_DIR
# Clone Phi-4-multimodal-instruct from HuggingFace
git clone https://huggingface.co/microsoft/Phi-4-multimodal-instruct
cd Phi-4-multimodal-instruct && git lfs pull && cd ..
# Merge vision LoRA adapter into base model
tensorrt-edgellm-merge-lora \
--model_dir Phi-4-multimodal-instruct \
--lora_dir Phi-4-multimodal-instruct/vision-lora \
--output_dir $MODEL_NAME/merged
# Quantize merged model
tensorrt-edgellm-quantize-llm \
--model_dir $MODEL_NAME/merged \
--output_dir $MODEL_NAME/quantized \
--quantization nvfp4
# Export language model
tensorrt-edgellm-export-llm \
--model_dir $MODEL_NAME/quantized \
--output_dir $MODEL_NAME/onnx/llm
# Export visual encoder (use original weights, not merged)
tensorrt-edgellm-export-visual \
--model_dir Phi-4-multimodal-instruct \
--output_dir $MODEL_NAME/onnx/visual
Step 2: Transfer to Device#
# Transfer ONNX to device
scp -r $MODEL_NAME/onnx \
<device_user>@<device_ip>:~/tensorrt-edgellm-workspace/$MODEL_NAME/
Step 3: Build Engines (Thor Device)#
export WORKSPACE_DIR=$HOME/tensorrt-edgellm-workspace
export MODEL_NAME=Phi-4-multimodal-instruct
cd ~/TensorRT-Edge-LLM
# Build language model engine
./build/examples/llm/llm_build \
--onnxDir $WORKSPACE_DIR/$MODEL_NAME/onnx/llm \
--engineDir $WORKSPACE_DIR/$MODEL_NAME/engines/llm \
--maxBatchSize 1 \
--maxInputLen 1024 \
--maxKVCacheCapacity 4096
# Build visual encoder engine
./build/examples/multimodal/visual_build \
--onnxDir $WORKSPACE_DIR/$MODEL_NAME/onnx/visual \
--engineDir $WORKSPACE_DIR/$MODEL_NAME/engines/visual \
--minImageTokens 128 \
--maxImageTokens 512 \
--maxImageTokensPerImage 512
Step 4: Run Inference (Thor Device)#
cd ~/TensorRT-Edge-LLM
./build/examples/llm/llm_inference \
--engineDir $WORKSPACE_DIR/$MODEL_NAME/engines/llm \
--multimodalEngineDir $WORKSPACE_DIR/$MODEL_NAME/engines/visual \
--inputFile $WORKSPACE_DIR/input.json \
--outputFile $WORKSPACE_DIR/output.json
Success! 🎉 Phi-4-Multimodal with merged vision adapter running on edge device.
Input File Format Reference#
All examples in this guide use standardized JSON input files. For complete input format specification including all parameters, multi-turn conversations, LoRA adapters, and advanced features, see the Input Format Guide.
Common Build Parameters#
LLM Build Parameters (llm_build)#
Parameter |
Description |
Default |
Used In |
|---|---|---|---|
|
Input ONNX directory |
Required |
All |
|
Output engine directory |
Required |
All |
|
Maximum batch size |
4 |
All |
|
Maximum input length |
1024 |
All |
|
Maximum KV-cache capacity (sequence length) |
4096 |
All |
|
Maximum LoRA rank (0=disabled) |
0 |
LoRA models |
|
Max verify tree size |
60 |
EAGLE only |
|
Max draft tree size |
60 |
EAGLE only |
|
Build EAGLE base model |
false |
EAGLE only |
|
Build EAGLE draft model |
false |
EAGLE only |
|
Enable debug logging |
false |
Optional |
Visual Build Parameters (visual_build)#
Parameter |
Description |
Default |
|---|---|---|
|
Input ONNX directory |
Required |
|
Output engine directory |
Required |
|
Minimum image tokens |
4 |
|
Maximum image tokens |
1024 |
|
Max tokens per image |
512 |
|
Enable debug logging |
false |
Inference Parameters (llm_inference)#
Parameter |
Description |
Used In |
|---|---|---|
|
Engine directory (required) |
All |
|
Visual/draft engine directory |
VLM/EAGLE |
|
Input JSON path (required) |
All |
|
Output JSON path |
All |
|
Enable EAGLE speculative decoding |
EAGLE only |
|
Tokens selected per drafting step |
EAGLE (default: 10) |
|
Number of drafting steps |
EAGLE (default: 6) |
|
Tokens for verification |
EAGLE (default: 60) |
|
Override batch size from input file |
Optional |
|
Override max generate length |
Optional |
|
Enable profiling output |
Optional |
|
Profile output path |
Optional |
|
Number of warmup runs |
Optional (default: 0) |
|
Enable debug logging |
Optional |
Note: Sampling parameters (temperature, top_p, top_k) are specified in the input JSON file, not as command-line arguments.
Profiling and Performance Analysis#
Enable profiling to measure inference performance:
cd ~/TensorRT-Edge-LLM
./build/examples/llm/llm_inference \
--engineDir $WORKSPACE_DIR/$MODEL_NAME/engines \
--inputFile $WORKSPACE_DIR/input.json \
--outputFile $WORKSPACE_DIR/output.json \
--dumpProfile \
--profileOutputFile $WORKSPACE_DIR/profile.json
The profile output includes:
Per-token latency
Prefill time
Generation time
KV-cache usage
Memory allocation statistics