Python Export Pipeline#
Overview#
The TensorRT Edge-LLM Export Pipeline is a comprehensive Python-based system that transforms HuggingFace models into optimized ONNX representations suitable for TensorRT engine compilation. The pipeline handles model quantization, ONNX export, and specialized features like LoRA adaptation and multimodal processing.
Purpose#
The export pipeline serves as the first stage in the TensorRT Edge-LLM workflow:
%%{init: {'theme':'neutral', 'themeVariables': {'primaryColor':'#76B900','primaryTextColor':'#fff','primaryBorderColor':'#5a8f00','lineColor':'#666','edgeLabelBackground':'#ffffff','labelTextColor':'#000','clusterBkg':'#ffffff','clusterBorder':'#999'}}}%%
graph LR
HF_MODEL[HuggingFace<br>Model]
ONNX_FILES[ONNX<br>Models]
ENGINE_BUILDER[Engine<br>Builder]
TRT_ENGINE[TensorRT<br>Engine]
CPP_RUNTIME[C++<br>Runtime]
OUTPUT[Inference<br>Results]
subgraph EXPORT_SG [" "]
PYTHON_EXPORT[Python<br>Export<br>Pipeline]
end
HF_MODEL --> PYTHON_EXPORT
PYTHON_EXPORT --> ONNX_FILES
ONNX_FILES --> ENGINE_BUILDER
ENGINE_BUILDER --> TRT_ENGINE
TRT_ENGINE --> CPP_RUNTIME
CPP_RUNTIME --> OUTPUT
classDef inputNode fill:#f5f5f5,stroke:#999,stroke-width:1px,color:#333
classDef nvNode fill:#76B900,stroke:#5a8f00,stroke-width:1px,color:#fff
classDef nvLightNode fill:#b8d67e,stroke:#76B900,stroke-width:1px,color:#333
classDef itemNode fill:#ffffff,stroke:#999,stroke-width:1px,color:#333
classDef darkNode fill:#ffffff,stroke:#999,stroke-width:1px,color:#333
classDef greenSubGraph fill:none,stroke:#76B900,stroke-width:1.5px
class HF_MODEL inputNode
class PYTHON_EXPORT nvNode
class ENGINE_BUILDER,CPP_RUNTIME nvLightNode
class ONNX_FILES,TRT_ENGINE itemNode
class OUTPUT darkNode
class EXPORT_SG greenSubGraph
%%{init: {'theme':'neutral', 'themeVariables': {'primaryColor':'#76B900','primaryTextColor':'#fff','primaryBorderColor':'#5a8f00','lineColor':'#666','edgeLabelBackground':'#ffffff','labelTextColor':'#000','clusterBkg':'#ffffff','clusterBorder':'#999'}}}%%
graph LR
HF_MODEL[HuggingFace<br>Model]
QUANTIZATION(Model<br>Quantization)
ONNX_EXPORT(ONNX<br>Export)
GRAPH_SURGERY(Graph<br>Surgery)
ONNX_OUTPUT[Optimized<br>ONNX Model]
subgraph EXPORT_TOOLS ["Python Export Pipeline"]
QUANTIZATION
ONNX_EXPORT
GRAPH_SURGERY
end
HF_MODEL --> QUANTIZATION
QUANTIZATION --> ONNX_EXPORT
ONNX_EXPORT --> GRAPH_SURGERY
GRAPH_SURGERY --> ONNX_OUTPUT
classDef greyNode fill:#f5f5f5,stroke:#999,stroke-width:1px,color:#333
classDef nvNode fill:#76B900,stroke:#5a8f00,stroke-width:1px,color:#fff
classDef darkNode fill:#ffffff,stroke:#999,stroke-width:1px,color:#333
classDef inputNode fill:#f5f5f5,stroke:#999,stroke-width:1px,color:#333
classDef lightSubGraph fill:none,stroke:#aaa,stroke-width:1.5px
class HF_MODEL inputNode
class ONNX_OUTPUT darkNode
class QUANTIZATION,ONNX_EXPORT,GRAPH_SURGERY nvNode
class EXPORT_TOOLS lightSubGraph
Pipeline Stages#
Model Loading: Load HuggingFace model and tokenizer
Quantization (Optional): Apply precision reduction techniques
ONNX Export: Convert PyTorch model to ONNX format
Graph Surgery: Optimize ONNX graph for TensorRT
Configuration Generation: Create build configuration files
Export Tools#
TensorRT Edge-LLM provides specialized command-line tools to support quantization and export to ONNX format:
%%{init: {'theme':'neutral', 'themeVariables': {'primaryColor':'#76B900','primaryTextColor':'#fff','primaryBorderColor':'#5a8f00','lineColor':'#666','edgeLabelBackground':'#ffffff','labelTextColor':'#000','clusterBkg':'#ffffff','clusterBorder':'#999'}}}%%
graph LR
subgraph INPUTS [" "]
VLM_MODEL[VLM<BR>Model]
DRAFT_MODEL[EAGLE<BR>Draft Model]
BASE_MODEL[Base<BR>Model]
LORA_WEIGHTS[LoRA<BR>Weights]
end
subgraph QUANT [Optional Quantization]
QUANTIZE_VISUAL(Quantization via<BR>export-visual)
QUANTIZE_DRAFT(quantize-draft)
QUANTIZE_LLM(quantize-llm)
end
subgraph EXPORT [Export & Processing]
EXPORT_VISUAL(export-visual)
EXPORT_DRAFT(export-draft)
EXPORT_LLM(export-llm)
INSERT_LORA(insert-lora)
PROCESS_LORA(process-lora)
end
subgraph RESULTS [" "]
VISUAL_ONNX[Visual ONNX]
DRAFT_ONNX[Draft ONNX]
LLM_ONNX[Base ONNX]
LORA_ONNX[LoRA-Enabled<br>ONNX]
SAFETENSORS[SafeTensors]
end
VLM_MODEL --> QUANTIZE_VISUAL
VLM_MODEL --> QUANTIZE_LLM
QUANTIZE_VISUAL --> EXPORT_VISUAL
EXPORT_VISUAL --> VISUAL_ONNX
DRAFT_MODEL --> QUANTIZE_DRAFT
QUANTIZE_DRAFT --> EXPORT_DRAFT
EXPORT_DRAFT --> DRAFT_ONNX
BASE_MODEL --> QUANTIZE_LLM
QUANTIZE_LLM --> EXPORT_LLM
EXPORT_LLM --> LLM_ONNX
EXPORT_LLM -->|If LoRA| INSERT_LORA
INSERT_LORA --> LORA_ONNX
LORA_WEIGHTS --> PROCESS_LORA
PROCESS_LORA --> SAFETENSORS
classDef nvNode fill:#76B900,stroke:#5a8f00,stroke-width:1px,color:#fff
classDef darkNode fill:#ffffff,stroke:#999,stroke-width:1px,color:#333
classDef inputNode fill:#f5f5f5,stroke:#999,stroke-width:1px,color:#333
classDef lightSubGraph fill:none,stroke:#aaa,stroke-width:1.5px
classDef greyNode fill:#f5f5f5,stroke:#999,stroke-width:1px,color:#333
classDef contextBox fill:none,stroke:transparent
class VLM_MODEL,DRAFT_MODEL,BASE_MODEL,LORA_WEIGHTS inputNode
class VISUAL_ONNX,DRAFT_ONNX,LLM_ONNX,LORA_ONNX,SAFETENSORS darkNode
class EXPORT_LLM,EXPORT_VISUAL,EXPORT_DRAFT,INSERT_LORA,PROCESS_LORA,QUANTIZE_DRAFT,QUANTIZE_LLM nvNode
class QUANTIZE_VISUAL greyNode
class QUANT,EXPORT lightSubGraph
class INPUTS,RESULTS contextBox
Note: Vision Language Models (VLMs) require both
export-llm(for the language model component) andexport-visual(for the vision encoder) to be fully exported. See the Multimodal VLM Export example below.
Tool Overview#
The tensorrt-edgellm package provides seven specialized command-line tools for different export scenarios:
Tool |
Inputs |
Outputs |
Description |
|---|---|---|---|
|
HuggingFace Model |
Quantized Model |
Quantize LLM models using NVIDIA ModelOpt. Supports FP8, INT4 AWQ, and NVFP4 quantization methods for memory reduction and performance optimization |
|
HuggingFace/Quantized Model |
ONNX Model |
Export LLM models to ONNX format. Handles standard LLMs and EAGLE base models with precision-specific optimizations and graph surgery |
|
VLM Model |
Visual ONNX |
Export visual encoders for multimodal models. Supports vision components with FP8 quantization and dynamic resolution |
|
Base + Draft Models |
Draft ONNX |
Export EAGLE draft models for speculative decoding. Specialized export for EAGLE3 draft model architectures with vocabulary mapping support |
|
Base + Draft Models |
Quantized Draft |
Quantize EAGLE draft models. Specialized quantization for draft models using base model inputs for calibration |
|
ONNX Model |
LoRA-enabled ONNX |
Insert LoRA patterns into existing ONNX models. Adds dynamic LoRA support to ONNX models by modifying the computational graph |
|
LoRA Weights |
SafeTensors |
Process LoRA weights for runtime use. Processes LoRA adapter weights according to TensorRT Edge-LLM specifications for runtime loading |
Quantization Methods#
The export pipeline supports multiple quantization methods optimized for different hardware platforms and performance requirements.
For complete details on quantization methods, precision requirements, platform compatibility, and memory reduction, see the Precision Support section in the Supported Models guide.
Note: INT4 GPTQ models can be loaded directly from HuggingFace Hub or quantized using GPTQModel. No additional quantization step with tensorrt-edgellm-quantize-llm is required for pre-quantized GPTQ checkpoints.
Usage Examples#
Standard LLM Export#
# Step 1: Quantize model (optional)
tensorrt-edgellm-quantize-llm \
--model_dir Qwen/Qwen2.5-0.5B-Instruct \
--quantization fp8 \
--output_dir quantized/qwen2.5-0.5b-fp8
# Step 2: Export to ONNX
tensorrt-edgellm-export-llm \
--model_dir quantized/qwen2.5-0.5b-fp8 \
--output_dir onnx_models/qwen2.5-0.5b
Multimodal VLM Export#
# Export LLM component (Same as LLM)
tensorrt-edgellm-export-llm \
--model_dir Qwen/Qwen2.5-VL-3B-Instruct \
--output_dir onnx_models/qwen2.5-vl-3b
# Export visual encoder
tensorrt-edgellm-export-visual \
--model_dir Qwen/Qwen2.5-VL-3B-Instruct \
--output_dir onnx_models/qwen2.5-vl-3b/visual_enc_onnx
EAGLE3 Speculative Decoding Export#
# Download draft model from HF or prepare your own. Install git lfs first using https://git-lfs.com
git clone https://huggingface.co/Rayzl/qwen2.5-vl-7b-eagle3-sgl
cd qwen2.5-vl-7b-eagle3-sgl
git lfs pull
# Quantize base model
tensorrt-edgellm-quantize-llm \
--model_dir Qwen/Qwen2.5-VL-7B-Instruct \
--quantization fp8 \
--output_dir quantized/qwen2.5-vl-7b-base
# Export base model
tensorrt-edgellm-export-llm \
--model_dir quantized/qwen2.5-vl-7b-base \
--output_dir onnx_models/qwen2.5-vl-7b_eagle3_base \
--is_eagle_base
# Quantize draft model
tensorrt-edgellm-quantize-draft \
--base_model_dir Qwen/Qwen2.5-VL-7B-Instruct \
--draft_model_dir qwen2.5-vl-7b-eagle3-sgl \
--quantization fp8 \
--output_dir quantized/qwen2.5-vl-7b-draft
# Export draft model
tensorrt-edgellm-export-draft \
--draft_model_dir quantized/qwen2.5-vl-7b-draft \
--base_model_dir Qwen/Qwen2.5-VL-7B-Instruct \
--output_dir onnx_models/qwen2.5-vl-7b_eagle3_draft
# Export visual encoder
tensorrt-edgellm-export-visual \
--model_dir Qwen/Qwen2.5-VL-7B-Instruct \
--output_dir onnx_models/qwen2.5-vl-7b/visual_enc_onnx
Where to Get Draft Models for EAGLE3#
Open-Source Draft Models:
Draft models for EAGLE speculative decoding can be found on HuggingFace:
EAGLE-3 Models on HuggingFace - Official list of available EAGLE-3 draft models for various base models
Search HuggingFace for your specific base model name + “EAGLE”
Training Your Own Draft Models:
If no pre-trained draft model exists for your base model, you’ll need to train one yourself. Refer to the EAGLE training repository for instructions on training draft models.
Important: Draft models must be trained specifically for their corresponding base model. A draft model trained for Qwen2.5-7B will only work with that exact base model and cannot be used with other models.
LoRA-Enabled Export#
# Export base model
tensorrt-edgellm-export-llm \
--model_dir Qwen/Qwen2.5-0.5B-Instruct \
--output_dir onnx_models/qwen2.5-0.5b
# Insert LoRA support. This is LoRA-independent
tensorrt-edgellm-insert-lora \
--onnx_dir onnx_models/qwen2.5-0.5b
# Download LoRA model(s) that you want to serve. This is just an example.
git clone https://huggingface.co/madhurjindal/Jailbreak-Detector-2-XL
cd Jailbreak-Detector-2-XL
git lfs pull
# Process LoRA weights
tensorrt-edgellm-process-lora \
--input_dir Jailbreak-Detector-2-XL \
--output_dir lora_weights
Best Practices#
Model Selection#
Choose appropriate quantization: Match precision to platform capabilities
Validate accuracy: Test quantized models against FP16 baseline
Consider memory constraints: Use INT4/NVFP4 for memory-limited platforms
Quantization Strategy#
Start with FP16: Establish accuracy baseline
Try FP8 for accuracy: Best balance of accuracy and memory reduction on SM89+ hardware
Use INT4 for fast decoding: When prefill length is short and decode performance is critical
Leverage NVFP4 on Thor: Optimal prefill and decode performance on Thor
Export Workflow#
Validate model loading: Ensure model loads correctly from HuggingFace
Check tokenizer compatibility: Verify tokenizer exports properly
Test ONNX output: Validate ONNX model with ONNX Runtime
Document configurations: Save export parameters for reproducibility
Performance Optimization#
Calibration data quality: Use representative data for quantization
Batch export: Export multiple models in parallel when possible
Cache downloads: Reuse downloaded models across exports
Monitor memory usage: Track peak memory during export
Model Signing and Verification#
Verify model integrity: Users are responsible for verifying the integrity of model artifacts (base models, LoRA weights, tokenizers, configs) before deployment
Sign models: It is strongly recommended to use the model-signing package to sign and verify models before inference.
Installation:
pip install model-signing
Basic Usage:
# Sign a model
model_signing sign /path/to/your/model --signature model.sig
# Verify a model
model_signing verify /path/to/your/model \
--signature model.sig \
--identity "$identity" \
--identity_provider "$oidc_provider"
For more details, refer to the model-signing documentation
Common Issues and Solutions#
Issue: Model Download Fails or Times Out#
Cause: Network issues, insufficient disk space, or HuggingFace access problems.
Solution:
Check disk space:
df -h $WORKSPACE_DIR
# Ensure at least 10-20GB free for small models
Check network connectivity:
curl -I https://huggingface.co
# Should return HTTP 200 OK
For gated models (Llama, Phi-4), login to HuggingFace:
huggingface-cli login
# Enter your access token
Manual download as a workaround:
git lfs install
git clone https://huggingface.co/Qwen/Qwen3-0.6B
# Then use local path for quantization
tensorrt-edgellm-quantize-llm \
--model_dir ./Qwen3-0.6B \
--output_dir quantized/Qwen3-0.6B \
--quantization fp8
Issue: GPU Out of Memory During Export or Quantization#
Cause: Model size exceeds available GPU memory.
Solution:
Change to a larger GPU. Empirically a 40GB GPU is enough for 4B or less model and 80GB GPU is enough for 8B or less.
You may try
--device cpuflag during quantization and export. However, CPU support may fail for some precisions.
Issue: Calibration Dataset Download Fails (cnn_dailymail not found)#
Cause: Network connectivity issues preventing download of the calibration dataset from HuggingFace, or firewall/proxy blocking access.
Solution:
Check network connectivity to HuggingFace:
curl -I https://huggingface.co
curl -I https://huggingface.co/datasets/abisee/cnn_dailymail
# Should return HTTP 200 OK
If network is down or blocked, download the dataset manually:
git lfs install
git clone https://huggingface.co/datasets/abisee/cnn_dailymail
# Pass the local dataset path explicitly to quantization
tensorrt-edgellm-quantize-llm \
--model_dir Qwen/Qwen3-0.6B \
--output_dir quantized/Qwen3-0.6B \
--quantization fp8 \
--calib_dataset ./cnn_dailymail/3.0.0
Note: Replace
./cnn_dailymail/3.0.0with the actual path where you downloaded the dataset. The dataset version 3.0.0 is commonly used for calibration. This allows you to use a local dataset instead of relying on HuggingFace cache.
Issue: Quantization Degrades Accuracy#
Cause: Aggressive quantization or insufficient calibration.
Solution:
Use less aggressive quantization
# Use FP8 instead of INT4 for better accuracy
tensorrt-edgellm-quantize-llm \
--model_dir model_name \
--output_dir quantized/model_name \
--quantization fp8 # Better accuracy than int4, nvfp4, or int8_sq
Change the quantization recipe in
tensorrt_edgellm/quantization/llm_quantization.pyortensorrt_edgellm/quantization/visual_quantization.pyto disable quantization for most sensitive layers. Follow the documentation of NVIDIA Model OptimizerIncrease calibration size (
num_samplesfield) intensorrt_edgellm/quantization/llm_quantization.py.