Python Export Pipeline#
Overview#
The TensorRT Edge-LLM Export Pipeline is a comprehensive Python-based system that transforms HuggingFace models into optimized ONNX representations suitable for TensorRT engine compilation. The pipeline handles model quantization, ONNX export, and specialized features like LoRA adaptation and multimodal processing.
Purpose#
The export pipeline serves as the first stage in the TensorRT Edge-LLM workflow:
%%{init: {'theme':'neutral', 'themeVariables': {'primaryColor':'#76B900','primaryTextColor':'#fff','primaryBorderColor':'#5a8f00','lineColor':'#666','edgeLabelBackground':'#ffffff','labelTextColor':'#000','clusterBkg':'#ffffff','clusterBorder':'#999'}}}%%
graph LR
HF_MODEL[HuggingFace<br>Model]
ONNX_FILES[ONNX<br>Models]
ENGINE_BUILDER[Engine<br>Builder]
TRT_ENGINE[TensorRT<br>Engine]
CPP_RUNTIME[C++<br>Runtime]
OUTPUT[Inference<br>Results]
subgraph EXPORT_SG [" "]
PYTHON_EXPORT[Python<br>Export<br>Pipeline]
end
HF_MODEL --> PYTHON_EXPORT
PYTHON_EXPORT --> ONNX_FILES
ONNX_FILES --> ENGINE_BUILDER
ENGINE_BUILDER --> TRT_ENGINE
TRT_ENGINE --> CPP_RUNTIME
CPP_RUNTIME --> OUTPUT
classDef inputNode fill:#f5f5f5,stroke:#999,stroke-width:1px,color:#333
classDef nvNode fill:#76B900,stroke:#5a8f00,stroke-width:1px,color:#fff
classDef nvLightNode fill:#b8d67e,stroke:#76B900,stroke-width:1px,color:#333
classDef itemNode fill:#ffffff,stroke:#999,stroke-width:1px,color:#333
classDef darkNode fill:#ffffff,stroke:#999,stroke-width:1px,color:#333
classDef greenSubGraph fill:none,stroke:#76B900,stroke-width:1.5px
class HF_MODEL inputNode
class PYTHON_EXPORT nvNode
class ENGINE_BUILDER,CPP_RUNTIME nvLightNode
class ONNX_FILES,TRT_ENGINE itemNode
class OUTPUT darkNode
class EXPORT_SG greenSubGraph
%%{init: {'theme':'neutral', 'themeVariables': {'primaryColor':'#76B900','primaryTextColor':'#fff','primaryBorderColor':'#5a8f00','lineColor':'#666','edgeLabelBackground':'#ffffff','labelTextColor':'#000','clusterBkg':'#ffffff','clusterBorder':'#999'}}}%%
graph LR
HF_MODEL[HuggingFace<br>Model]
QUANTIZATION(Model<br>Quantization)
ONNX_EXPORT(ONNX<br>Export)
GRAPH_SURGERY(Graph<br>Surgery)
ONNX_OUTPUT[Optimized<br>ONNX Model]
subgraph EXPORT_TOOLS ["Python Export Pipeline"]
QUANTIZATION
ONNX_EXPORT
GRAPH_SURGERY
end
HF_MODEL --> QUANTIZATION
QUANTIZATION --> ONNX_EXPORT
ONNX_EXPORT --> GRAPH_SURGERY
GRAPH_SURGERY --> ONNX_OUTPUT
classDef greyNode fill:#f5f5f5,stroke:#999,stroke-width:1px,color:#333
classDef nvNode fill:#76B900,stroke:#5a8f00,stroke-width:1px,color:#fff
classDef darkNode fill:#ffffff,stroke:#999,stroke-width:1px,color:#333
classDef inputNode fill:#f5f5f5,stroke:#999,stroke-width:1px,color:#333
classDef lightSubGraph fill:none,stroke:#aaa,stroke-width:1.5px
class HF_MODEL inputNode
class ONNX_OUTPUT darkNode
class QUANTIZATION,ONNX_EXPORT,GRAPH_SURGERY nvNode
class EXPORT_TOOLS lightSubGraph
Pipeline Stages#
Model Loading: Load HuggingFace model and tokenizer
Quantization (Optional): Apply precision reduction techniques
ONNX Export: Convert PyTorch model to ONNX format
Graph Surgery: Optimize ONNX graph for TensorRT
Configuration Generation: Create build configuration files
Export Tools#
TensorRT Edge-LLM provides specialized command-line tools to support quantization and export to ONNX format:
%%{init: {'theme':'neutral', 'themeVariables': {'primaryColor':'#76B900','primaryTextColor':'#fff','primaryBorderColor':'#5a8f00','lineColor':'#666','edgeLabelBackground':'#ffffff','labelTextColor':'#000','clusterBkg':'#ffffff','clusterBorder':'#999'}}}%%
graph LR
subgraph INPUTS [" "]
VLM_MODEL[VLM<BR>Model]
DRAFT_MODEL[EAGLE<BR>Draft Model]
BASE_MODEL[Base<BR>Model]
LORA_WEIGHTS[LoRA<BR>Weights]
end
subgraph QUANT [Optional Quantization]
QUANTIZE_VISUAL(Quantization via<BR>export-visual)
QUANTIZE_DRAFT(quantize-draft)
QUANTIZE_LLM(quantize-llm)
end
subgraph EXPORT [Export & Processing]
EXPORT_VISUAL(export-visual)
EXPORT_DRAFT(export-draft)
EXPORT_LLM(export-llm)
INSERT_LORA(insert-lora)
PROCESS_LORA(process-lora)
end
subgraph RESULTS [" "]
VISUAL_ONNX[Visual ONNX]
DRAFT_ONNX[Draft ONNX]
LLM_ONNX[Base ONNX]
LORA_ONNX[LoRA-Enabled<br>ONNX]
SAFETENSORS[SafeTensors]
end
VLM_MODEL --> QUANTIZE_VISUAL
VLM_MODEL --> QUANTIZE_LLM
QUANTIZE_VISUAL --> EXPORT_VISUAL
EXPORT_VISUAL --> VISUAL_ONNX
DRAFT_MODEL --> QUANTIZE_DRAFT
QUANTIZE_DRAFT --> EXPORT_DRAFT
EXPORT_DRAFT --> DRAFT_ONNX
BASE_MODEL --> QUANTIZE_LLM
QUANTIZE_LLM --> EXPORT_LLM
EXPORT_LLM --> LLM_ONNX
EXPORT_LLM -->|If LoRA| INSERT_LORA
INSERT_LORA --> LORA_ONNX
LORA_WEIGHTS --> PROCESS_LORA
PROCESS_LORA --> SAFETENSORS
classDef nvNode fill:#76B900,stroke:#5a8f00,stroke-width:1px,color:#fff
classDef darkNode fill:#ffffff,stroke:#999,stroke-width:1px,color:#333
classDef inputNode fill:#f5f5f5,stroke:#999,stroke-width:1px,color:#333
classDef lightSubGraph fill:none,stroke:#aaa,stroke-width:1.5px
classDef greyNode fill:#f5f5f5,stroke:#999,stroke-width:1px,color:#333
classDef contextBox fill:none,stroke:transparent
class VLM_MODEL,DRAFT_MODEL,BASE_MODEL,LORA_WEIGHTS inputNode
class VISUAL_ONNX,DRAFT_ONNX,LLM_ONNX,LORA_ONNX,SAFETENSORS darkNode
class EXPORT_LLM,EXPORT_VISUAL,EXPORT_DRAFT,INSERT_LORA,PROCESS_LORA,QUANTIZE_DRAFT,QUANTIZE_LLM nvNode
class QUANTIZE_VISUAL greyNode
class QUANT,EXPORT lightSubGraph
class INPUTS,RESULTS contextBox
Note: Vision Language Models (VLMs) require both
export-llm(for the language model component) andexport-visual(for the vision encoder) to be fully exported. See the Multimodal VLM Export example below.
Tool Overview#
The tensorrt-edgellm package provides seven specialized command-line tools for different export scenarios:
Tool |
Inputs |
Outputs |
Description |
|---|---|---|---|
|
HuggingFace Model |
Quantized Model |
Quantize LLM models using NVIDIA ModelOpt. Supports FP8, INT4 AWQ, and NVFP4 quantization methods for memory reduction and performance optimization |
|
HuggingFace/Quantized Model |
ONNX Model |
Export LLM models to ONNX format. Handles standard LLMs and EAGLE base models with precision-specific optimizations and graph surgery |
|
VLM Model |
Visual ONNX |
Export visual encoders for multimodal models. Supports vision components with FP8 quantization and dynamic resolution |
|
Base + Draft Models |
Draft ONNX |
Export EAGLE draft models for speculative decoding. Specialized export for EAGLE3 draft model architectures with vocabulary mapping support |
|
Base + Draft Models |
Quantized Draft |
Quantize EAGLE draft models. Specialized quantization for draft models using base model inputs for calibration |
|
ONNX Model |
LoRA-enabled ONNX |
Insert LoRA patterns into existing ONNX models. Adds dynamic LoRA support to ONNX models by modifying the computational graph |
|
LoRA Weights |
SafeTensors |
Process LoRA weights for runtime use. Processes LoRA adapter weights according to TensorRT Edge-LLM specifications for runtime loading |
Quantization Methods#
The export pipeline supports multiple quantization methods optimized for different hardware platforms and performance requirements:
Method |
Description |
Precision |
Platform Requirements |
Memory Reduction |
|---|---|---|---|---|
FP16 |
Half-precision floating point |
16-bit |
All platforms |
Baseline |
FP8 |
8-bit floating point |
8-bit |
SM89+ (Ada Lovelace and newer) |
2x |
INT8 SQ |
8-bit SmoothQuant |
8-bit |
All platforms |
2x |
INT4 AWQ |
4-bit integer with AWQ |
4-bit |
All platforms |
4x |
INT4 GPTQ |
4-bit GPTQ weight quantization |
4-bit |
All platforms |
4x |
NVFP4 |
NVIDIA 4-bit floating point |
4-bit |
SM100+ (Blackwell and newer) |
4x |
Quantization Details#
FP16 (Baseline)
Standard half-precision floating point
Universal compatibility across all platforms
Best accuracy, largest memory footprint
Recommended for validation and accuracy baselines
FP8 (General Purpose)
8-bit floating point quantization
2x memory reduction with minimal accuracy loss
Requires SM89+ (Ada Lovelace generation or newer GPUs)
Automatic calibration using sample data
FP8 Vision Encoder: Supported for visual models
FP8 LM Head: Supported for language model heads
INT8 SQ (SmoothQuant)
8-bit integer quantization with SmoothQuant algorithm
Supported on all platforms, primarily for Ampere generation
Use FP8 or NVFP4 on Blackwell generation for better accuracy and performance
INT4 AWQ (Activation-Aware Weight Quantization)
4-bit integer weight quantization
Uses activation statistics for optimal quantization
4x memory reduction
Good accuracy preservation with proper calibration
Supported on all platforms
INT4 GPTQ
4-bit GPTQ weight quantization
Can load quantized models from HuggingFace directly
No additional quantization step needed for GPTQ checkpoints
Install:
BUILD_CUDA_EXT=0 pip install -v gptqmodel --no-build-isolationSupported on all platforms
NVFP4 (NVIDIA Floating Point 4-bit)
NVIDIA’s proprietary 4-bit floating point format
Hardware-accelerated on SM100+ (Blackwell generation and newer GPUs)
4x memory reduction with optimal performance
Recommended for Thor platforms
NVFP4 LM Head: Supported for language model heads
Note: INT4 GPTQ models can be loaded directly from HuggingFace Hub or quantized using GPTQModel. No additional quantization step with tensorrt-edgellm-quantize-llm is required for pre-quantized GPTQ checkpoints.
Security and Model Integrity#
⚠️ USER RESPONSIBILITY: Users are responsible for verifying the integrity of all model artifacts (base models, LoRA weights, tokenizers, configs) before exporting models to TensorRT Edge-LLM format.
Model Signing and Verification#
It is strongly recommended to use the model-signing package to sign and verify models before inference.
Installation:
pip install model-signing
Basic Usage:
# Sign a model
model_signing sign /path/to/your/model --signature model.sig
# Verify a model
model_signing verify /path/to/your/model \
--signature model.sig \
--identity "$identity" \
--identity_provider "$oidc_provider"
For more details, refer to the model-signing documentation
Usage Examples#
Standard LLM Export#
# Step 1: Quantize model (optional)
tensorrt-edgellm-quantize-llm \
--model_dir Qwen/Qwen2.5-0.5B-Instruct \
--quantization fp8 \
--output_dir quantized/qwen2.5-0.5b-fp8
# Step 2: Export to ONNX
tensorrt-edgellm-export-llm \
--model_dir quantized/qwen2.5-0.5b-fp8 \
--output_dir onnx_models/qwen2.5-0.5b
Multimodal VLM Export#
# Export LLM component (Same as LLM)
tensorrt-edgellm-export-llm \
--model_dir Qwen/Qwen2.5-VL-3B-Instruct \
--output_dir onnx_models/qwen2.5-vl-3b
# Export visual encoder
tensorrt-edgellm-export-visual \
--model_dir Qwen/Qwen2.5-VL-3B-Instruct \
--output_dir onnx_models/qwen2.5-vl-3b/visual_enc_onnx
EAGLE3 Speculative Decoding Export#
# Download draft model from HF or prepare your own. Install git lfs first using https://git-lfs.com
git clone https://huggingface.co/Rayzl/qwen2.5-vl-7b-eagle3-sgl
cd qwen2.5-vl-7b-eagle3-sgl
git lfs pull
# Quantize base model
tensorrt-edgellm-quantize-llm \
--model_dir Qwen/Qwen2.5-VL-7B-Instruct \
--quantization fp8 \
--output_dir quantized/qwen2.5-vl-7b-base
# Export base model
tensorrt-edgellm-export-llm \
--model_dir quantized/qwen2.5-vl-7b-base \
--output_dir onnx_models/qwen2.5-vl-7b_eagle3_base \
--is_eagle_base
# Quantize draft model
tensorrt-edgellm-quantize-draft \
--base_model_dir Qwen/Qwen2.5-VL-7B-Instruct \
--draft_model_dir qwen2.5-vl-7b-eagle3-sgl \
--quantization fp8 \
--output_dir quantized/qwen2.5-vl-7b-draft
# Export draft model
tensorrt-edgellm-export-draft \
--draft_model_dir quantized/qwen2.5-vl-7b-draft \
--base_model_dir Qwen/Qwen2.5-VL-7B-Instruct \
--output_dir onnx_models/qwen2.5-vl-7b_eagle3_draft \
--use_prompt_tuning
# Export visual encoder
tensorrt-edgellm-export-visual \
--model_dir Qwen/Qwen2.5-VL-7B-Instruct \
--output_dir onnx_models/qwen2.5-vl-7b/visual_enc_onnx
LoRA-Enabled Export#
# Export base model
tensorrt-edgellm-export-llm \
--model_dir Qwen/Qwen2.5-0.5B-Instruct \
--output_dir onnx_models/qwen2.5-0.5b
# Insert LoRA support. This is LoRA-independent
tensorrt-edgellm-insert-lora \
--onnx_dir onnx_models/qwen2.5-0.5b
# Download LoRA model(s) that you want to serve. This is just an example.
git clone https://huggingface.co/madhurjindal/Jailbreak-Detector-2-XL
cd Jailbreak-Detector-2-XL
git lfs pull
# Process LoRA weights
tensorrt-edgellm-process-lora \
--input_dir Jailbreak-Detector-2-XL \
--output_dir lora_weights
Best Practices#
Security#
Verify model integrity before export (see Security and Model Integrity)
Sign models before deployment using model-signing
Model Selection#
Choose appropriate quantization: Match precision to platform capabilities
Validate accuracy: Test quantized models against FP16 baseline
Consider memory constraints: Use INT4/NVFP4 for memory-limited platforms
Quantization Strategy#
Start with FP16: Establish accuracy baseline
Try FP8 for accuracy: Best balance of accuracy and memory reduction on SM89+ hardware
Use INT4 for fast decoding: When prefill length is short and decode performance is critical
Leverage NVFP4 on Thor: Optimal prefill and decode performance on Thor
Export Workflow#
Validate model loading: Ensure model loads correctly from HuggingFace
Check tokenizer compatibility: Verify tokenizer exports properly
Test ONNX output: Validate ONNX model with ONNX Runtime
Document configurations: Save export parameters for reproducibility
Performance Optimization#
Calibration data quality: Use representative data for quantization
Batch export: Export multiple models in parallel when possible
Cache downloads: Reuse downloaded models across exports
Monitor memory usage: Track peak memory during export
Common Issues and Solutions#
Issue: GPU Out of Memory During Export or Quantization#
Solution:
Change to a larger GPU. Empirically a 40GB GPU is enough for 4B or less model and 80GB GPU is enough for 8B or less.
You may try
--device cpuflag during quantization and export. However, CPU support may fail for some precisions.
Issue: Quantization Degrades Accuracy#
Solution:
Increase calibration dataset size or use less aggressive quantization
# Use FP8 instead of INT4 for better accuracy
tensorrt-edgellm-quantize-llm \
--model_dir model_name \
--output_dir quantized/model_name \
--quantization fp8 \ # Better accuracy than int4, nvfp4, or int8_sq
--calib_size 512 # Increase from default
Change the quantization recipe in
tensorrt_edgellm/quantization/llm_quantization.pyortensorrt_edgellm/quantization/visual_quantization.pyto disable quantization for most sensitive layers. Follow the documentation of NVIDIA Model Optimizer
Next Steps#
After exporting your model to ONNX:
Build TensorRT Engine: Use the Engine Builder to compile ONNX to TRT
Deploy with C++ Runtime: Use the C++ Runtime for inference
Run Examples: Try the Examples to validate your export
Additional Resources#
Model Signing: model-signing package
Python API Documentation: Refer to the
tensorrt_edgellm/directoryQuantization Details: Refer to NVIDIA Model Optimizer
ONNX Format: Refer to ONNX GitHub
Model Support: Refer to Supported Models