Examples and Complete Workflows#

Code Location: examples/ | Build: examples/llm/, examples/multimodal/

Overview#

This guide provides complete end-to-end workflows for using TensorRT Edge-LLM, covering model export, engine building, and inference execution. Each workflow demonstrates the complete pipeline from HuggingFace models to deployed inference with standardized folder structures.

⚠️ USER RESPONSIBILITY: Users are responsible for composing meaningful and appropriate prompts for their use cases. The examples provided demonstrate technical usage patterns but do not guarantee output quality or appropriateness.

Prerequisites: Complete the Installation Guide for both x86 host and edge device before proceeding.

Complete Workflow Summary#

Note: Each example shows complete commands including environment variable setup (WORKSPACE_DIR, MODEL_NAME). These variables persist in your shell session, so you only need to set them once per session.

Every TensorRT Edge-LLM deployment follows this pattern:

        graph LR
    HF[HuggingFace Model<br/>x86] --> QUANT[Quantize<br/>x86]
    QUANT --> EXPORT[Export<br/>x86]
    EXPORT --> ONNX[ONNX<br/>transfer]
    ONNX --> BUILD[Build<br/>device]
    BUILD --> ENGINE[Engine<br/>device]
    ENGINE --> INF[Inference<br/>device]

Pipeline Stages:

Quantize (x86 Host): Quantize model to target precision (FP8/FP4)
Export (x86 Host): Convert quantized model to ONNX format
Transfer: Copy ONNX models to edge device
Build (Edge Device): Compile ONNX into optimized TensorRT engines
Inference (Edge Device): Run inference using compiled engines

Example 1: VLM (Vision-Language Model) Inference#

Complete workflow for vision-language models with image understanding capabilities.

Model: Qwen2.5-VL-3B-Instruct

Step 1: Quantize and Export (x86 Host)#

export WORKSPACE_DIR=$HOME/tensorrt-edgellm-workspace
export MODEL_NAME=Qwen2.5-VL-3B-Instruct
mkdir -p $WORKSPACE_DIR
cd $WORKSPACE_DIR

# Quantize language model
tensorrt-edgellm-quantize-llm \
  --model_dir Qwen/Qwen2.5-VL-3B-Instruct \
  --quantization fp8 \
  --output_dir $MODEL_NAME/quantized

# Export language model
tensorrt-edgellm-export-llm \
  --model_dir $MODEL_NAME/quantized \
  --output_dir $MODEL_NAME/onnx/llm

# Export visual encoder
tensorrt-edgellm-export-visual \
  --model_dir Qwen/Qwen2.5-VL-3B-Instruct \
  --output_dir $MODEL_NAME/onnx/visual

Step 2: Transfer to Device#

# Transfer ONNX to device
scp -r $MODEL_NAME/onnx \
  <device_user>@<device_ip>:~/tensorrt-edgellm-workspace/$MODEL_NAME/

Step 3: Build Engines (Thor Device)#

# Set up workspace directory on device
export WORKSPACE_DIR=$HOME/tensorrt-edgellm-workspace
export MODEL_NAME=Qwen2.5-VL-3B-Instruct
cd ~/TensorRT-Edge-LLM

# Build language model engine
./build/examples/llm/llm_build \
  --onnxDir $WORKSPACE_DIR/$MODEL_NAME/onnx/llm \
  --engineDir $WORKSPACE_DIR/$MODEL_NAME/engines/llm \
  --maxBatchSize 1 \
  --maxInputLen 1024 \
  --maxKVCacheCapacity 4096

# Build visual encoder engine
./build/examples/multimodal/visual_build \
  --onnxDir $WORKSPACE_DIR/$MODEL_NAME/onnx/visual \
  --engineDir $WORKSPACE_DIR/$MODEL_NAME/engines/visual \
  --minImageTokens 128 \
  --maxImageTokens 512 \
  --maxImageTokensPerImage 512

Build time: ~10-15 minutes total

Step 4: Run Inference (Thor Device)#

Create an input file $WORKSPACE_DIR/input_vlm.json (replace /path/to/image.jpg with an actual image file path):

{
    "batch_size": 1,
    "temperature": 1.0,
    "top_p": 1.0,
    "top_k": 50,
    "max_generate_length": 128,
    "requests": [
        {
            "messages": [
                {
                    "role": "system",
                    "content": "You are a helpful assistant."
                },
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "image",
                            "image": "/path/to/image.jpg"
                        },
                        {
                            "type": "text",
                            "text": "Please describe the image."
                        }
                    ]
                }
            ]
        }
    ]
}

Run inference:

cd ~/TensorRT-Edge-LLM

./build/examples/llm/llm_inference \
  --engineDir $WORKSPACE_DIR/$MODEL_NAME/engines/llm \
  --multimodalEngineDir $WORKSPACE_DIR/$MODEL_NAME/engines/visual \
  --inputFile $WORKSPACE_DIR/input_vlm.json \
  --outputFile $WORKSPACE_DIR/output_vlm.json

Success! 🎉 Check output_vlm.json for vision-language model responses.

Example 2: LLM EAGLE Speculative Decoding#

EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) uses a smaller draft model to accelerate generation for text-only models.

Model: Llama-3.1-8B-Instruct with EAGLE draft model

Step 1: Quantize and Export (x86 Host)#

export MODEL_NAME=Llama-3.1-8B-Instruct
cd $WORKSPACE_DIR

# Download EAGLE draft model to workspace
git clone https://huggingface.co/yuhuili/EAGLE3-LLaMA3.1-Instruct-8B
cd EAGLE3-LLaMA3.1-Instruct-8B && git lfs pull && cd ..

# Quantize base model
tensorrt-edgellm-quantize-llm \
  --model_dir meta-llama/Llama-3.1-8B-Instruct \
  --quantization fp8 \
  --output_dir $MODEL_NAME/quantized-base

# Export base model with EAGLE flag
tensorrt-edgellm-export-llm \
  --model_dir $MODEL_NAME/quantized-base \
  --output_dir $MODEL_NAME/onnx/base \
  --is_eagle_base

# Quantize draft model
tensorrt-edgellm-quantize-draft \
  --base_model_dir meta-llama/Llama-3.1-8B-Instruct \
  --draft_model_dir EAGLE3-LLaMA3.1-Instruct-8B \
  --quantization fp8 \
  --output_dir $MODEL_NAME/quantized-draft

# Export draft model
tensorrt-edgellm-export-draft \
  --draft_model_dir $MODEL_NAME/quantized-draft \
  --base_model_dir meta-llama/Llama-3.1-8B-Instruct \
  --output_dir $MODEL_NAME/onnx/draft

Step 2: Transfer to Device#

# Transfer ONNX to device
scp -r $MODEL_NAME/onnx \
  <device_user>@<device_ip>:~/tensorrt-edgellm-workspace/$MODEL_NAME/

Step 3: Build Engines (Thor Device)#

export WORKSPACE_DIR=$HOME/tensorrt-edgellm-workspace
export MODEL_NAME=Llama-3.1-8B-Instruct
cd ~/TensorRT-Edge-LLM

# Build base model EAGLE engine
./build/examples/llm/llm_build \
  --onnxDir $WORKSPACE_DIR/$MODEL_NAME/onnx/base \
  --engineDir $WORKSPACE_DIR/$MODEL_NAME/engines \
  --maxBatchSize 1 \
  --maxInputLen 1024 \
  --maxKVCacheCapacity 4096 \
  --maxVerifyTreeSize 60 \
  --maxDraftTreeSize 60 \
  --eagleBase

# Build draft model engine
./build/examples/llm/llm_build \
  --onnxDir $WORKSPACE_DIR/$MODEL_NAME/onnx/draft \
  --engineDir $WORKSPACE_DIR/$MODEL_NAME/engines \
  --maxBatchSize 1 \
  --maxInputLen 1024 \
  --maxKVCacheCapacity 4096 \
  --maxVerifyTreeSize 60 \
  --maxDraftTreeSize 60 \
  --eagleDraft

Build time: ~15-20 minutes total

Step 4: Run Inference (Thor Device)#

cd ~/TensorRT-Edge-LLM

./build/examples/llm/llm_inference \
  --engineDir $WORKSPACE_DIR/$MODEL_NAME/engines \
  --inputFile $WORKSPACE_DIR/input.json \
  --outputFile $WORKSPACE_DIR/output.json \
  --eagle

Note: EAGLE speculative decoding provides 1.5-3x faster generation but is limited to batch size 1.

Example 3: VLM EAGLE Speculative Decoding#

EAGLE for vision-language models combines accelerated text generation with image understanding.

Model: Qwen2.5-VL-7B-Instruct with EAGLE3 draft model

Step 1: Quantize and Export (x86 Host)#

export MODEL_NAME=Qwen2.5-VL-7B-Instruct
cd $WORKSPACE_DIR

# Download EAGLE draft model to workspace
git clone https://huggingface.co/Rayzl/qwen2.5-vl-7b-eagle3-sgl
cd qwen2.5-vl-7b-eagle3-sgl && git lfs pull && cd ..

# Quantize base model
tensorrt-edgellm-quantize-llm \
  --model_dir Qwen/Qwen2.5-VL-7B-Instruct \
  --quantization fp8 \
  --output_dir $MODEL_NAME/quantized-base

# Export base model with EAGLE flag
tensorrt-edgellm-export-llm \
  --model_dir $MODEL_NAME/quantized-base \
  --output_dir $MODEL_NAME/onnx/base \
  --is_eagle_base

# Quantize draft model
tensorrt-edgellm-quantize-draft \
  --base_model_dir Qwen/Qwen2.5-VL-7B-Instruct \
  --draft_model_dir qwen2.5-vl-7b-eagle3-sgl \
  --quantization fp8 \
  --output_dir $MODEL_NAME/quantized-draft

# Export draft model
tensorrt-edgellm-export-draft \
  --draft_model_dir $MODEL_NAME/quantized-draft \
  --base_model_dir Qwen/Qwen2.5-VL-7B-Instruct \
  --output_dir $MODEL_NAME/onnx/draft

# Export visual encoder
tensorrt-edgellm-export-visual \
  --model_dir Qwen/Qwen2.5-VL-7B-Instruct \
  --output_dir $MODEL_NAME/onnx/visual

Step 2: Transfer to Device#

# Transfer ONNX to device
scp -r $MODEL_NAME/onnx \
  <device_user>@<device_ip>:~/tensorrt-edgellm-workspace/$MODEL_NAME/

Step 3: Build Engines (Thor Device)#

export WORKSPACE_DIR=$HOME/tensorrt-edgellm-workspace
export MODEL_NAME=Qwen2.5-VL-7B-Instruct
cd ~/TensorRT-Edge-LLM

# Build base model EAGLE engine
./build/examples/llm/llm_build \
  --onnxDir $WORKSPACE_DIR/$MODEL_NAME/onnx/base \
  --engineDir $WORKSPACE_DIR/$MODEL_NAME/engines/llm \
  --maxBatchSize 1 \
  --maxInputLen 1024 \
  --maxKVCacheCapacity 4096 \
  --maxVerifyTreeSize 60 \
  --maxDraftTreeSize 60 \
  --eagleBase

# Build draft model engine
./build/examples/llm/llm_build \
  --onnxDir $WORKSPACE_DIR/$MODEL_NAME/onnx/draft \
  --engineDir $WORKSPACE_DIR/$MODEL_NAME/engines/llm \
  --maxBatchSize 1 \
  --maxInputLen 1024 \
  --maxKVCacheCapacity 4096 \
  --maxVerifyTreeSize 60 \
  --maxDraftTreeSize 60 \
  --eagleDraft

# Build visual encoder engine
./build/examples/multimodal/visual_build \
  --onnxDir $WORKSPACE_DIR/$MODEL_NAME/onnx/visual \
  --engineDir $WORKSPACE_DIR/$MODEL_NAME/engines/visual \
  --minImageTokens 128 \
  --maxImageTokens 512 \
  --maxImageTokensPerImage 512

Build time: ~20-30 minutes total

Step 4: Run Inference (Thor Device)#

cd ~/TensorRT-Edge-LLM

./build/examples/llm/llm_inference \
  --engineDir $WORKSPACE_DIR/$MODEL_NAME/engines/llm \
  --multimodalEngineDir $WORKSPACE_DIR/$MODEL_NAME/engines/visual \
  --inputFile $WORKSPACE_DIR/input.json \
  --outputFile $WORKSPACE_DIR/output.json \
  --eagle

Success! 🎉 EAGLE VLM provides accelerated multimodal inference.

Example 4: LoRA-Enabled Models#

Dynamic LoRA adapter support allows switching between fine-tuned adapters at runtime without rebuilding engines.

Model: Qwen2.5-0.5B-Instruct with LoRA adapter

Step 1: Export and Process (x86 Host)#

export MODEL_NAME=Qwen2.5-0.5B-Instruct
cd $WORKSPACE_DIR

# Export base model (FP16, no quantization needed for small model)
tensorrt-edgellm-export-llm \
  --model_dir Qwen/Qwen2.5-0.5B-Instruct \
  --output_dir $MODEL_NAME/onnx

# Insert LoRA support into ONNX
tensorrt-edgellm-insert-lora \
  --onnx_dir $MODEL_NAME/onnx

# Download and process LoRA adapter
git clone https://huggingface.co/madhurjindal/Jailbreak-Detector-2-XL
cd Jailbreak-Detector-2-XL && git lfs pull && cd ..

# Process LoRA weights
tensorrt-edgellm-process-lora \
  --input_dir Jailbreak-Detector-2-XL \
  --output_dir $MODEL_NAME/onnx/lora_weights/jailbreak_detector

Step 2: Transfer to Device#

# Transfer ONNX and LoRA weights to device
scp -r $MODEL_NAME/onnx \
  <device_user>@<device_ip>:~/tensorrt-edgellm-workspace/$MODEL_NAME/

Step 3: Build Engine (Thor Device)#

export WORKSPACE_DIR=$HOME/tensorrt-edgellm-workspace
export MODEL_NAME=Qwen2.5-0.5B-Instruct
cd ~/TensorRT-Edge-LLM

# Build engine with LoRA support
./build/examples/llm/llm_build \
  --onnxDir $WORKSPACE_DIR/$MODEL_NAME/onnx \
  --engineDir $WORKSPACE_DIR/$MODEL_NAME/engines \
  --maxBatchSize 1 \
  --maxInputLen 1024 \
  --maxKVCacheCapacity 4096 \
  --maxLoraRank 64

Step 4: Run Inference (Thor Device)#

cd ~/TensorRT-Edge-LLM

./build/examples/llm/llm_inference \
  --engineDir $WORKSPACE_DIR/$MODEL_NAME/engines \
  --inputFile $WORKSPACE_DIR/input.json \
  --outputFile $WORKSPACE_DIR/output.json

Note: You can add multiple LoRA adapters to the lora_weights/ directory and switch between them at runtime without rebuilding the engine.

Example 5: Phi-4-Multimodal with LoRA Merge#

Phi-4-Multimodal requires merging vision LoRA adapter into the base model before quantization and export.

Model: Phi-4-multimodal-instruct

Step 1: Merge, Quantize, and Export (x86 Host)#

export MODEL_NAME=Phi-4-multimodal-instruct
cd $WORKSPACE_DIR

# Clone Phi-4-multimodal-instruct from HuggingFace
git clone https://huggingface.co/microsoft/Phi-4-multimodal-instruct
cd Phi-4-multimodal-instruct && git lfs pull && cd ..

# Merge vision LoRA adapter into base model
tensorrt-edgellm-merge-lora \
  --model_dir Phi-4-multimodal-instruct \
  --lora_dir Phi-4-multimodal-instruct/vision-lora \
  --output_dir $MODEL_NAME/merged

# Quantize merged model
tensorrt-edgellm-quantize-llm \
  --model_dir $MODEL_NAME/merged \
  --output_dir $MODEL_NAME/quantized \
  --quantization nvfp4

# Export language model
tensorrt-edgellm-export-llm \
  --model_dir $MODEL_NAME/quantized \
  --output_dir $MODEL_NAME/onnx/llm

# Export visual encoder (use original weights, not merged)
tensorrt-edgellm-export-visual \
  --model_dir Phi-4-multimodal-instruct \
  --output_dir $MODEL_NAME/onnx/visual

Step 2: Transfer to Device#

# Transfer ONNX to device
scp -r $MODEL_NAME/onnx \
  <device_user>@<device_ip>:~/tensorrt-edgellm-workspace/$MODEL_NAME/

Step 3: Build Engines (Thor Device)#

export WORKSPACE_DIR=$HOME/tensorrt-edgellm-workspace
export MODEL_NAME=Phi-4-multimodal-instruct
cd ~/TensorRT-Edge-LLM

# Build language model engine
./build/examples/llm/llm_build \
  --onnxDir $WORKSPACE_DIR/$MODEL_NAME/onnx/llm \
  --engineDir $WORKSPACE_DIR/$MODEL_NAME/engines/llm \
  --maxBatchSize 1 \
  --maxInputLen 1024 \
  --maxKVCacheCapacity 4096

# Build visual encoder engine
./build/examples/multimodal/visual_build \
  --onnxDir $WORKSPACE_DIR/$MODEL_NAME/onnx/visual \
  --engineDir $WORKSPACE_DIR/$MODEL_NAME/engines/visual \
  --minImageTokens 128 \
  --maxImageTokens 512 \
  --maxImageTokensPerImage 512

Step 4: Run Inference (Thor Device)#

cd ~/TensorRT-Edge-LLM

./build/examples/llm/llm_inference \
  --engineDir $WORKSPACE_DIR/$MODEL_NAME/engines/llm \
  --multimodalEngineDir $WORKSPACE_DIR/$MODEL_NAME/engines/visual \
  --inputFile $WORKSPACE_DIR/input.json \
  --outputFile $WORKSPACE_DIR/output.json

Success! 🎉 Phi-4-Multimodal with merged vision adapter running on edge device.

Input File Format Reference#

All examples in this guide use standardized JSON input files. For complete input format specification including all parameters, multi-turn conversations, LoRA adapters, and advanced features, see the Input Format Guide.

Common Build Parameters#

LLM Build Parameters (`llm_build`)#

Parameter	Description	Default	Used In
`--onnxDir`	Input ONNX directory	Required	All
`--engineDir`	Output engine directory	Required	All
`--maxBatchSize`	Maximum batch size	4	All
`--maxInputLen`	Maximum input length	1024	All
`--maxKVCacheCapacity`	Maximum KV-cache capacity (sequence length)	4096	All
`--maxLoraRank`	Maximum LoRA rank (0=disabled)	0	LoRA models
`--maxVerifyTreeSize`	Max verify tree size	60	EAGLE only
`--maxDraftTreeSize`	Max draft tree size	60	EAGLE only
`--eagleBase`	Build EAGLE base model	false	EAGLE only
`--eagleDraft`	Build EAGLE draft model	false	EAGLE only
`--debug`	Enable debug logging	false	Optional

Visual Build Parameters (`visual_build`)#

Parameter	Description	Default
`--onnxDir`	Input ONNX directory	Required
`--engineDir`	Output engine directory	Required
`--minImageTokens`	Minimum image tokens	4
`--maxImageTokens`	Maximum image tokens	1024
`--maxImageTokensPerImage`	Max tokens per image	512
`--debug`	Enable debug logging	false

Inference Parameters (`llm_inference`)#

Parameter	Description	Used In
`--engineDir`	Engine directory (required)	All
`--multimodalEngineDir`	Visual/draft engine directory	VLM/EAGLE
`--inputFile`	Input JSON path (required)	All
`--outputFile`	Output JSON path	All
`--eagle`	Enable EAGLE speculative decoding	EAGLE only
`--eagleDraftTopK`	Tokens selected per drafting step	EAGLE (default: 10)
`--eagleDraftStep`	Number of drafting steps	EAGLE (default: 6)
`--eagleVerifyTreeSize`	Tokens for verification	EAGLE (default: 60)
`--batchSize`	Override batch size from input file	Optional
`--maxGenerateLength`	Override max generate length	Optional
`--dumpProfile`	Enable profiling output	Optional
`--profileOutputFile`	Profile output path	Optional
`--warmup`	Number of warmup runs	Optional (default: 0)
`--debug`	Enable debug logging	Optional

Note: Sampling parameters (temperature, top_p, top_k) are specified in the input JSON file, not as command-line arguments.

Profiling and Performance Analysis#

Enable profiling to measure inference performance:

cd ~/TensorRT-Edge-LLM

./build/examples/llm/llm_inference \
  --engineDir $WORKSPACE_DIR/$MODEL_NAME/engines \
  --inputFile $WORKSPACE_DIR/input.json \
  --outputFile $WORKSPACE_DIR/output.json \
  --dumpProfile \
  --profileOutputFile $WORKSPACE_DIR/profile.json

The profile output includes:

Per-token latency
Prefill time
Generation time
KV-cache usage
Memory allocation statistics

Examples and Complete Workflows#

Overview#

Complete Workflow Summary#

Example 1: VLM (Vision-Language Model) Inference#

Step 1: Quantize and Export (x86 Host)#

Step 2: Transfer to Device#

Step 3: Build Engines (Thor Device)#

Step 4: Run Inference (Thor Device)#

Example 2: LLM EAGLE Speculative Decoding#

Step 1: Quantize and Export (x86 Host)#

Step 2: Transfer to Device#

Step 3: Build Engines (Thor Device)#

Step 4: Run Inference (Thor Device)#

Example 3: VLM EAGLE Speculative Decoding#

Step 1: Quantize and Export (x86 Host)#

Step 2: Transfer to Device#

Step 3: Build Engines (Thor Device)#

Step 4: Run Inference (Thor Device)#

Example 4: LoRA-Enabled Models#

Step 1: Export and Process (x86 Host)#

Step 2: Transfer to Device#

Step 3: Build Engine (Thor Device)#

Step 4: Run Inference (Thor Device)#

Example 5: Phi-4-Multimodal with LoRA Merge#

Step 1: Merge, Quantize, and Export (x86 Host)#

Step 2: Transfer to Device#

Step 3: Build Engines (Thor Device)#

Step 4: Run Inference (Thor Device)#

Input File Format Reference#

Common Build Parameters#

LLM Build Parameters (llm_build)#

Visual Build Parameters (visual_build)#

Inference Parameters (llm_inference)#

Profiling and Performance Analysis#

LLM Build Parameters (`llm_build`)#

Visual Build Parameters (`visual_build`)#

Inference Parameters (`llm_inference`)#