Python Export Pipeline#

Overview#

The TensorRT Edge-LLM Export Pipeline is a comprehensive Python-based system that transforms HuggingFace models into optimized ONNX representations suitable for TensorRT engine compilation. The pipeline handles model quantization, ONNX export, and specialized features like LoRA adaptation and multimodal processing.

Purpose#

The export pipeline serves as the first stage in the TensorRT Edge-LLM workflow:

        %%{init: {'theme':'neutral', 'themeVariables': {'primaryColor':'#76B900','primaryTextColor':'#fff','primaryBorderColor':'#5a8f00','lineColor':'#666','edgeLabelBackground':'#ffffff','labelTextColor':'#000','clusterBkg':'#ffffff','clusterBorder':'#999'}}}%%
graph LR
    HF_MODEL[HuggingFace<br>Model]
    ONNX_FILES[ONNX<br>Models]
    ENGINE_BUILDER[Engine<br>Builder]
    TRT_ENGINE[TensorRT<br>Engine]
    CPP_RUNTIME[C++<br>Runtime]
    OUTPUT[Inference<br>Results]
    
    subgraph EXPORT_SG [" "]
        PYTHON_EXPORT[Python<br>Export<br>Pipeline]
    end
    
    HF_MODEL --> PYTHON_EXPORT
    PYTHON_EXPORT --> ONNX_FILES
    ONNX_FILES --> ENGINE_BUILDER
    ENGINE_BUILDER --> TRT_ENGINE
    TRT_ENGINE --> CPP_RUNTIME
    CPP_RUNTIME --> OUTPUT
    
    classDef inputNode fill:#f5f5f5,stroke:#999,stroke-width:1px,color:#333
    classDef nvNode fill:#76B900,stroke:#5a8f00,stroke-width:1px,color:#fff
    classDef nvLightNode fill:#b8d67e,stroke:#76B900,stroke-width:1px,color:#333
    classDef itemNode fill:#ffffff,stroke:#999,stroke-width:1px,color:#333
    classDef darkNode fill:#ffffff,stroke:#999,stroke-width:1px,color:#333
    classDef greenSubGraph fill:none,stroke:#76B900,stroke-width:1.5px
    
    class HF_MODEL inputNode
    class PYTHON_EXPORT nvNode
    class ENGINE_BUILDER,CPP_RUNTIME nvLightNode
    class ONNX_FILES,TRT_ENGINE itemNode
    class OUTPUT darkNode
    class EXPORT_SG greenSubGraph

        %%{init: {'theme':'neutral', 'themeVariables': {'primaryColor':'#76B900','primaryTextColor':'#fff','primaryBorderColor':'#5a8f00','lineColor':'#666','edgeLabelBackground':'#ffffff','labelTextColor':'#000','clusterBkg':'#ffffff','clusterBorder':'#999'}}}%%

graph LR
    HF_MODEL[HuggingFace<br>Model]
    QUANTIZATION(Model<br>Quantization)
    ONNX_EXPORT(ONNX<br>Export)
    GRAPH_SURGERY(Graph<br>Surgery)
    ONNX_OUTPUT[Optimized<br>ONNX Model]
    
    subgraph EXPORT_TOOLS ["Python Export Pipeline"]
        QUANTIZATION
        ONNX_EXPORT
        GRAPH_SURGERY
    end
    
    HF_MODEL --> QUANTIZATION
    QUANTIZATION --> ONNX_EXPORT
    ONNX_EXPORT --> GRAPH_SURGERY
    GRAPH_SURGERY --> ONNX_OUTPUT

    classDef greyNode fill:#f5f5f5,stroke:#999,stroke-width:1px,color:#333
    classDef nvNode fill:#76B900,stroke:#5a8f00,stroke-width:1px,color:#fff
    classDef darkNode fill:#ffffff,stroke:#999,stroke-width:1px,color:#333
    classDef inputNode fill:#f5f5f5,stroke:#999,stroke-width:1px,color:#333
    classDef lightSubGraph fill:none,stroke:#aaa,stroke-width:1.5px

    class HF_MODEL inputNode
    class ONNX_OUTPUT darkNode
    class QUANTIZATION,ONNX_EXPORT,GRAPH_SURGERY nvNode
    class EXPORT_TOOLS lightSubGraph

Pipeline Stages#

Model Loading: Load HuggingFace model and tokenizer
Quantization (Optional): Apply precision reduction techniques
ONNX Export: Convert PyTorch model to ONNX format
Graph Surgery: Optimize ONNX graph for TensorRT
Configuration Generation: Create build configuration files

Export Tools#

TensorRT Edge-LLM provides specialized command-line tools to support quantization and export to ONNX format:

        %%{init: {'theme':'neutral', 'themeVariables': {'primaryColor':'#76B900','primaryTextColor':'#fff','primaryBorderColor':'#5a8f00','lineColor':'#666','edgeLabelBackground':'#ffffff','labelTextColor':'#000','clusterBkg':'#ffffff','clusterBorder':'#999'}}}%%

graph LR
    subgraph INPUTS [" "]
        VLM_MODEL[VLM<BR>Model]
        DRAFT_MODEL[EAGLE<BR>Draft Model]
        BASE_MODEL[Base<BR>Model]
        LORA_WEIGHTS[LoRA<BR>Weights]
    end
    
    subgraph QUANT [Optional Quantization]
        QUANTIZE_VISUAL(Quantization via<BR>export-visual)
        QUANTIZE_DRAFT(quantize-draft)
        QUANTIZE_LLM(quantize-llm)
    end
    
    subgraph EXPORT [Export & Processing]
        EXPORT_VISUAL(export-visual)
        EXPORT_DRAFT(export-draft)
        EXPORT_LLM(export-llm)
        INSERT_LORA(insert-lora)
        PROCESS_LORA(process-lora)
    end
    
    subgraph RESULTS [" "]
        VISUAL_ONNX[Visual ONNX]
        DRAFT_ONNX[Draft ONNX]
        LLM_ONNX[Base ONNX]
        LORA_ONNX[LoRA-Enabled<br>ONNX]
        SAFETENSORS[SafeTensors]
    end
    
    VLM_MODEL --> QUANTIZE_VISUAL
    VLM_MODEL --> QUANTIZE_LLM
    QUANTIZE_VISUAL --> EXPORT_VISUAL
    EXPORT_VISUAL --> VISUAL_ONNX
    
    DRAFT_MODEL --> QUANTIZE_DRAFT
    QUANTIZE_DRAFT --> EXPORT_DRAFT
    EXPORT_DRAFT --> DRAFT_ONNX
    
    BASE_MODEL --> QUANTIZE_LLM
    QUANTIZE_LLM --> EXPORT_LLM
    EXPORT_LLM --> LLM_ONNX
    EXPORT_LLM -->|If LoRA| INSERT_LORA
    INSERT_LORA --> LORA_ONNX
    
    LORA_WEIGHTS --> PROCESS_LORA
    PROCESS_LORA --> SAFETENSORS

    classDef nvNode fill:#76B900,stroke:#5a8f00,stroke-width:1px,color:#fff
    classDef darkNode fill:#ffffff,stroke:#999,stroke-width:1px,color:#333
    classDef inputNode fill:#f5f5f5,stroke:#999,stroke-width:1px,color:#333
    classDef lightSubGraph fill:none,stroke:#aaa,stroke-width:1.5px
    classDef greyNode fill:#f5f5f5,stroke:#999,stroke-width:1px,color:#333
    classDef contextBox fill:none,stroke:transparent

    class VLM_MODEL,DRAFT_MODEL,BASE_MODEL,LORA_WEIGHTS inputNode
    class VISUAL_ONNX,DRAFT_ONNX,LLM_ONNX,LORA_ONNX,SAFETENSORS darkNode
    class EXPORT_LLM,EXPORT_VISUAL,EXPORT_DRAFT,INSERT_LORA,PROCESS_LORA,QUANTIZE_DRAFT,QUANTIZE_LLM nvNode
    class QUANTIZE_VISUAL greyNode
    class QUANT,EXPORT lightSubGraph
    class INPUTS,RESULTS contextBox

Note: Vision Language Models (VLMs) require both export-llm (for the language model component) and export-visual (for the vision encoder) to be fully exported. See the Multimodal VLM Export example below.

Tool Overview#

The tensorrt-edgellm package provides seven specialized command-line tools for different export scenarios:

Tool	Inputs	Outputs	Description
`quantize-llm`	HuggingFace Model	Quantized Model	Quantize LLM models using NVIDIA ModelOpt. Supports FP8, INT4 AWQ, and NVFP4 quantization methods for memory reduction and performance optimization
`export-llm`	HuggingFace/Quantized Model	ONNX Model	Export LLM models to ONNX format. Handles standard LLMs and EAGLE base models with precision-specific optimizations and graph surgery
`export-visual`	VLM Model	Visual ONNX	Export visual encoders for multimodal models. Supports vision components with FP8 quantization and dynamic resolution
`export-draft`	Base + Draft Models	Draft ONNX	Export EAGLE draft models for speculative decoding. Specialized export for EAGLE3 draft model architectures with vocabulary mapping support
`quantize-draft`	Base + Draft Models	Quantized Draft	Quantize EAGLE draft models. Specialized quantization for draft models using base model inputs for calibration
`insert-lora`	ONNX Model	LoRA-enabled ONNX	Insert LoRA patterns into existing ONNX models. Adds dynamic LoRA support to ONNX models by modifying the computational graph
`process-lora`	LoRA Weights	SafeTensors	Process LoRA weights for runtime use. Processes LoRA adapter weights according to TensorRT Edge-LLM specifications for runtime loading

Quantization Methods#

The export pipeline supports multiple quantization methods optimized for different hardware platforms and performance requirements:

Method	Description	Precision	Platform Requirements	Memory Reduction
FP16	Half-precision floating point	16-bit	All platforms	Baseline
FP8	8-bit floating point	8-bit	SM89+ (Ada Lovelace and newer)	2x
INT8 SQ	8-bit SmoothQuant	8-bit	All platforms	2x
INT4 AWQ	4-bit integer with AWQ	4-bit	All platforms	4x
INT4 GPTQ	4-bit GPTQ weight quantization	4-bit	All platforms	4x
NVFP4	NVIDIA 4-bit floating point	4-bit	SM100+ (Blackwell and newer)	4x

Quantization Details#

FP16 (Baseline)

Standard half-precision floating point
Universal compatibility across all platforms
Best accuracy, largest memory footprint
Recommended for validation and accuracy baselines

FP8 (General Purpose)

8-bit floating point quantization
2x memory reduction with minimal accuracy loss
Requires SM89+ (Ada Lovelace generation or newer GPUs)
Automatic calibration using sample data
FP8 Vision Encoder: Supported for visual models
FP8 LM Head: Supported for language model heads

INT8 SQ (SmoothQuant)

8-bit integer quantization with SmoothQuant algorithm
Supported on all platforms, primarily for Ampere generation
Use FP8 or NVFP4 on Blackwell generation for better accuracy and performance

INT4 AWQ (Activation-Aware Weight Quantization)

4-bit integer weight quantization
Uses activation statistics for optimal quantization
4x memory reduction
Good accuracy preservation with proper calibration
Supported on all platforms

INT4 GPTQ

4-bit GPTQ weight quantization
Can load quantized models from HuggingFace directly
No additional quantization step needed for GPTQ checkpoints
Install: BUILD_CUDA_EXT=0 pip install -v gptqmodel --no-build-isolation
Supported on all platforms

NVFP4 (NVIDIA Floating Point 4-bit)

NVIDIA’s proprietary 4-bit floating point format
Hardware-accelerated on SM100+ (Blackwell generation and newer GPUs)
4x memory reduction with optimal performance
Recommended for Thor platforms
NVFP4 LM Head: Supported for language model heads

Note: INT4 GPTQ models can be loaded directly from HuggingFace Hub or quantized using GPTQModel. No additional quantization step with tensorrt-edgellm-quantize-llm is required for pre-quantized GPTQ checkpoints.

Security and Model Integrity#

⚠️ USER RESPONSIBILITY: Users are responsible for verifying the integrity of all model artifacts (base models, LoRA weights, tokenizers, configs) before exporting models to TensorRT Edge-LLM format.

Model Signing and Verification#

It is strongly recommended to use the model-signing package to sign and verify models before inference.

Installation:

pip install model-signing

Basic Usage:

# Sign a model
model_signing sign /path/to/your/model --signature model.sig

# Verify a model
model_signing verify /path/to/your/model \
  --signature model.sig \
  --identity "$identity" \
  --identity_provider "$oidc_provider"

For more details, refer to the model-signing documentation

Usage Examples#

Standard LLM Export#

# Step 1: Quantize model (optional)
tensorrt-edgellm-quantize-llm \
  --model_dir Qwen/Qwen2.5-0.5B-Instruct \
  --quantization fp8 \
  --output_dir quantized/qwen2.5-0.5b-fp8

# Step 2: Export to ONNX
tensorrt-edgellm-export-llm \
  --model_dir quantized/qwen2.5-0.5b-fp8 \
  --output_dir onnx_models/qwen2.5-0.5b

Multimodal VLM Export#

# Export LLM component (Same as LLM)
tensorrt-edgellm-export-llm \
  --model_dir Qwen/Qwen2.5-VL-3B-Instruct \
  --output_dir onnx_models/qwen2.5-vl-3b

# Export visual encoder
tensorrt-edgellm-export-visual \
  --model_dir Qwen/Qwen2.5-VL-3B-Instruct \
  --output_dir onnx_models/qwen2.5-vl-3b/visual_enc_onnx

EAGLE3 Speculative Decoding Export#

# Download draft model from HF or prepare your own. Install git lfs first using https://git-lfs.com
git clone https://huggingface.co/Rayzl/qwen2.5-vl-7b-eagle3-sgl
cd qwen2.5-vl-7b-eagle3-sgl
git lfs pull

# Quantize base model
tensorrt-edgellm-quantize-llm \
  --model_dir Qwen/Qwen2.5-VL-7B-Instruct \
  --quantization fp8 \
  --output_dir quantized/qwen2.5-vl-7b-base

# Export base model
tensorrt-edgellm-export-llm \
  --model_dir quantized/qwen2.5-vl-7b-base \
  --output_dir onnx_models/qwen2.5-vl-7b_eagle3_base \
  --is_eagle_base

# Quantize draft model
tensorrt-edgellm-quantize-draft \
  --base_model_dir Qwen/Qwen2.5-VL-7B-Instruct \
  --draft_model_dir qwen2.5-vl-7b-eagle3-sgl \
  --quantization fp8 \
  --output_dir quantized/qwen2.5-vl-7b-draft

# Export draft model
tensorrt-edgellm-export-draft \
  --draft_model_dir quantized/qwen2.5-vl-7b-draft \
  --base_model_dir Qwen/Qwen2.5-VL-7B-Instruct \
  --output_dir onnx_models/qwen2.5-vl-7b_eagle3_draft \
  --use_prompt_tuning

# Export visual encoder
tensorrt-edgellm-export-visual \
  --model_dir Qwen/Qwen2.5-VL-7B-Instruct \
  --output_dir onnx_models/qwen2.5-vl-7b/visual_enc_onnx

LoRA-Enabled Export#

# Export base model
tensorrt-edgellm-export-llm \
  --model_dir Qwen/Qwen2.5-0.5B-Instruct \
  --output_dir onnx_models/qwen2.5-0.5b

# Insert LoRA support. This is LoRA-independent
tensorrt-edgellm-insert-lora \
  --onnx_dir onnx_models/qwen2.5-0.5b

# Download LoRA model(s) that you want to serve. This is just an example.
git clone https://huggingface.co/madhurjindal/Jailbreak-Detector-2-XL
cd Jailbreak-Detector-2-XL
git lfs pull

# Process LoRA weights
tensorrt-edgellm-process-lora \
  --input_dir Jailbreak-Detector-2-XL \
  --output_dir lora_weights

Best Practices#

Security#

Verify model integrity before export (see Security and Model Integrity)
Sign models before deployment using model-signing

Model Selection#

Choose appropriate quantization: Match precision to platform capabilities
Validate accuracy: Test quantized models against FP16 baseline
Consider memory constraints: Use INT4/NVFP4 for memory-limited platforms

Quantization Strategy#

Start with FP16: Establish accuracy baseline
Try FP8 for accuracy: Best balance of accuracy and memory reduction on SM89+ hardware
Use INT4 for fast decoding: When prefill length is short and decode performance is critical
Leverage NVFP4 on Thor: Optimal prefill and decode performance on Thor

Export Workflow#

Validate model loading: Ensure model loads correctly from HuggingFace
Check tokenizer compatibility: Verify tokenizer exports properly
Test ONNX output: Validate ONNX model with ONNX Runtime
Document configurations: Save export parameters for reproducibility

Performance Optimization#

Calibration data quality: Use representative data for quantization
Batch export: Export multiple models in parallel when possible
Cache downloads: Reuse downloaded models across exports
Monitor memory usage: Track peak memory during export

Common Issues and Solutions#

Issue: GPU Out of Memory During Export or Quantization#

Solution:

Change to a larger GPU. Empirically a 40GB GPU is enough for 4B or less model and 80GB GPU is enough for 8B or less.
You may try --device cpu flag during quantization and export. However, CPU support may fail for some precisions.

Issue: Quantization Degrades Accuracy#

Solution:

Increase calibration dataset size or use less aggressive quantization

# Use FP8 instead of INT4 for better accuracy
tensorrt-edgellm-quantize-llm \
  --model_dir model_name \
  --output_dir quantized/model_name \
  --quantization fp8 \  # Better accuracy than int4, nvfp4, or int8_sq
  --calib_size 512      # Increase from default

Change the quantization recipe in tensorrt_edgellm/quantization/llm_quantization.py or tensorrt_edgellm/quantization/visual_quantization.py to disable quantization for most sensitive layers. Follow the documentation of NVIDIA Model Optimizer

Next Steps#

After exporting your model to ONNX:

Build TensorRT Engine: Use the Engine Builder to compile ONNX to TRT
Deploy with C++ Runtime: Use the C++ Runtime for inference
Run Examples: Try the Examples to validate your export

Additional Resources#

Model Signing: model-signing package
Python API Documentation: Refer to the tensorrt_edgellm/ directory
Quantization Details: Refer to NVIDIA Model Optimizer
ONNX Format: Refer to ONNX GitHub
Model Support: Refer to Supported Models