Python Export Pipeline#

Overview#

The TensorRT Edge-LLM Export Pipeline is a comprehensive Python-based system that transforms HuggingFace models into optimized ONNX representations suitable for TensorRT engine compilation. The pipeline handles model quantization, ONNX export, and specialized features like LoRA adaptation and multimodal processing.

Purpose#

The export pipeline serves as the first stage in the TensorRT Edge-LLM workflow:

        %%{init: {'theme':'neutral', 'themeVariables': {'primaryColor':'#76B900','primaryTextColor':'#fff','primaryBorderColor':'#5a8f00','lineColor':'#666','edgeLabelBackground':'#ffffff','labelTextColor':'#000','clusterBkg':'#ffffff','clusterBorder':'#999'}}}%%
graph LR
    HF_MODEL[HuggingFace<br>Model]
    ONNX_FILES[ONNX<br>Models]
    ENGINE_BUILDER[Engine<br>Builder]
    TRT_ENGINE[TensorRT<br>Engine]
    CPP_RUNTIME[C++<br>Runtime]
    OUTPUT[Inference<br>Results]

    subgraph EXPORT_SG [" "]
        PYTHON_EXPORT[Python<br>Export<br>Pipeline]
    end

    HF_MODEL --> PYTHON_EXPORT
    PYTHON_EXPORT --> ONNX_FILES
    ONNX_FILES --> ENGINE_BUILDER
    ENGINE_BUILDER --> TRT_ENGINE
    TRT_ENGINE --> CPP_RUNTIME
    CPP_RUNTIME --> OUTPUT

    classDef inputNode fill:#f5f5f5,stroke:#999,stroke-width:1px,color:#333
    classDef nvNode fill:#76B900,stroke:#5a8f00,stroke-width:1px,color:#fff
    classDef nvLightNode fill:#b8d67e,stroke:#76B900,stroke-width:1px,color:#333
    classDef itemNode fill:#ffffff,stroke:#999,stroke-width:1px,color:#333
    classDef darkNode fill:#ffffff,stroke:#999,stroke-width:1px,color:#333
    classDef greenSubGraph fill:none,stroke:#76B900,stroke-width:1.5px

    class HF_MODEL inputNode
    class PYTHON_EXPORT nvNode
    class ENGINE_BUILDER,CPP_RUNTIME nvLightNode
    class ONNX_FILES,TRT_ENGINE itemNode
    class OUTPUT darkNode
    class EXPORT_SG greenSubGraph
    
        %%{init: {'theme':'neutral', 'themeVariables': {'primaryColor':'#76B900','primaryTextColor':'#fff','primaryBorderColor':'#5a8f00','lineColor':'#666','edgeLabelBackground':'#ffffff','labelTextColor':'#000','clusterBkg':'#ffffff','clusterBorder':'#999'}}}%%

graph LR
    HF_MODEL[HuggingFace<br>Model]
    QUANTIZATION(Model<br>Quantization)
    ONNX_EXPORT(ONNX<br>Export)
    GRAPH_SURGERY(Graph<br>Surgery)
    ONNX_OUTPUT[Optimized<br>ONNX Model]

    subgraph EXPORT_TOOLS ["Python Export Pipeline"]
        QUANTIZATION
        ONNX_EXPORT
        GRAPH_SURGERY
    end

    HF_MODEL --> QUANTIZATION
    QUANTIZATION --> ONNX_EXPORT
    ONNX_EXPORT --> GRAPH_SURGERY
    GRAPH_SURGERY --> ONNX_OUTPUT

    classDef greyNode fill:#f5f5f5,stroke:#999,stroke-width:1px,color:#333
    classDef nvNode fill:#76B900,stroke:#5a8f00,stroke-width:1px,color:#fff
    classDef darkNode fill:#ffffff,stroke:#999,stroke-width:1px,color:#333
    classDef inputNode fill:#f5f5f5,stroke:#999,stroke-width:1px,color:#333
    classDef lightSubGraph fill:none,stroke:#aaa,stroke-width:1.5px

    class HF_MODEL inputNode
    class ONNX_OUTPUT darkNode
    class QUANTIZATION,ONNX_EXPORT,GRAPH_SURGERY nvNode
    class EXPORT_TOOLS lightSubGraph
    

Pipeline Stages#

  1. Model Loading: Load HuggingFace model and tokenizer

  2. Quantization (Optional): Apply precision reduction techniques

  3. ONNX Export: Convert PyTorch model to ONNX format

  4. Graph Surgery: Optimize ONNX graph for TensorRT

  5. Configuration Generation: Create build configuration files

Export Tools#

TensorRT Edge-LLM provides specialized command-line tools to support quantization and export to ONNX format:

        %%{init: {'theme':'neutral', 'themeVariables': {'primaryColor':'#76B900','primaryTextColor':'#fff','primaryBorderColor':'#5a8f00','lineColor':'#666','edgeLabelBackground':'#ffffff','labelTextColor':'#000','clusterBkg':'#ffffff','clusterBorder':'#999'}}}%%

graph LR
    subgraph INPUTS [" "]
        VLM_MODEL[VLM<BR>Model]
        DRAFT_MODEL[EAGLE<BR>Draft Model]
        BASE_MODEL[Base<BR>Model]
        LORA_WEIGHTS[LoRA<BR>Weights]
    end

    subgraph QUANT [Optional Quantization]
        QUANTIZE_VISUAL(Quantization via<BR>export-visual)
        QUANTIZE_DRAFT(quantize-draft)
        QUANTIZE_LLM(quantize-llm)
    end

    subgraph EXPORT [Export & Processing]
        EXPORT_VISUAL(export-visual)
        EXPORT_DRAFT(export-draft)
        EXPORT_LLM(export-llm)
        INSERT_LORA(insert-lora)
        PROCESS_LORA(process-lora)
    end

    subgraph RESULTS [" "]
        VISUAL_ONNX[Visual ONNX]
        DRAFT_ONNX[Draft ONNX]
        LLM_ONNX[Base ONNX]
        LORA_ONNX[LoRA-Enabled<br>ONNX]
        SAFETENSORS[SafeTensors]
    end

    VLM_MODEL --> QUANTIZE_VISUAL
    VLM_MODEL --> QUANTIZE_LLM
    QUANTIZE_VISUAL --> EXPORT_VISUAL
    EXPORT_VISUAL --> VISUAL_ONNX

    DRAFT_MODEL --> QUANTIZE_DRAFT
    QUANTIZE_DRAFT --> EXPORT_DRAFT
    EXPORT_DRAFT --> DRAFT_ONNX

    BASE_MODEL --> QUANTIZE_LLM
    QUANTIZE_LLM --> EXPORT_LLM
    EXPORT_LLM --> LLM_ONNX
    EXPORT_LLM -->|If LoRA| INSERT_LORA
    INSERT_LORA --> LORA_ONNX

    LORA_WEIGHTS --> PROCESS_LORA
    PROCESS_LORA --> SAFETENSORS

    classDef nvNode fill:#76B900,stroke:#5a8f00,stroke-width:1px,color:#fff
    classDef darkNode fill:#ffffff,stroke:#999,stroke-width:1px,color:#333
    classDef inputNode fill:#f5f5f5,stroke:#999,stroke-width:1px,color:#333
    classDef lightSubGraph fill:none,stroke:#aaa,stroke-width:1.5px
    classDef greyNode fill:#f5f5f5,stroke:#999,stroke-width:1px,color:#333
    classDef contextBox fill:none,stroke:transparent

    class VLM_MODEL,DRAFT_MODEL,BASE_MODEL,LORA_WEIGHTS inputNode
    class VISUAL_ONNX,DRAFT_ONNX,LLM_ONNX,LORA_ONNX,SAFETENSORS darkNode
    class EXPORT_LLM,EXPORT_VISUAL,EXPORT_DRAFT,INSERT_LORA,PROCESS_LORA,QUANTIZE_DRAFT,QUANTIZE_LLM nvNode
    class QUANTIZE_VISUAL greyNode
    class QUANT,EXPORT lightSubGraph
    class INPUTS,RESULTS contextBox
    

Note: Vision Language Models (VLMs) require both export-llm (for the language model component) and export-visual (for the vision encoder) to be fully exported. See the Multimodal VLM Export example below.

Tool Overview#

The tensorrt-edgellm package provides seven specialized command-line tools for different export scenarios:

Tool

Inputs

Outputs

Description

quantize-llm

HuggingFace Model

Quantized Model

Quantize LLM models using NVIDIA ModelOpt. Supports FP8, INT4 AWQ, and NVFP4 quantization methods for memory reduction and performance optimization

export-llm

HuggingFace/Quantized Model

ONNX Model

Export LLM models to ONNX format. Handles standard LLMs and EAGLE base models with precision-specific optimizations and graph surgery

export-visual

VLM Model

Visual ONNX

Export visual encoders for multimodal models. Supports vision components with FP8 quantization and dynamic resolution

export-draft

Base + Draft Models

Draft ONNX

Export EAGLE draft models for speculative decoding. Specialized export for EAGLE3 draft model architectures with vocabulary mapping support

quantize-draft

Base + Draft Models

Quantized Draft

Quantize EAGLE draft models. Specialized quantization for draft models using base model inputs for calibration

insert-lora

ONNX Model

LoRA-enabled ONNX

Insert LoRA patterns into existing ONNX models. Adds dynamic LoRA support to ONNX models by modifying the computational graph

process-lora

LoRA Weights

SafeTensors

Process LoRA weights for runtime use. Processes LoRA adapter weights according to TensorRT Edge-LLM specifications for runtime loading


Quantization Methods#

The export pipeline supports multiple quantization methods optimized for different hardware platforms and performance requirements.

For complete details on quantization methods, precision requirements, platform compatibility, and memory reduction, see the Precision Support section in the Supported Models guide.

Note: INT4 GPTQ models can be loaded directly from HuggingFace Hub or quantized using GPTQModel. No additional quantization step with tensorrt-edgellm-quantize-llm is required for pre-quantized GPTQ checkpoints.


Usage Examples#

Standard LLM Export#

# Step 1: Quantize model (optional)
tensorrt-edgellm-quantize-llm \
  --model_dir Qwen/Qwen2.5-0.5B-Instruct \
  --quantization fp8 \
  --output_dir quantized/qwen2.5-0.5b-fp8

# Step 2: Export to ONNX
tensorrt-edgellm-export-llm \
  --model_dir quantized/qwen2.5-0.5b-fp8 \
  --output_dir onnx_models/qwen2.5-0.5b

Multimodal VLM Export#

# Export LLM component (Same as LLM)
tensorrt-edgellm-export-llm \
  --model_dir Qwen/Qwen2.5-VL-3B-Instruct \
  --output_dir onnx_models/qwen2.5-vl-3b

# Export visual encoder
tensorrt-edgellm-export-visual \
  --model_dir Qwen/Qwen2.5-VL-3B-Instruct \
  --output_dir onnx_models/qwen2.5-vl-3b/visual_enc_onnx

EAGLE3 Speculative Decoding Export#

# Download draft model from HF or prepare your own. Install git lfs first using https://git-lfs.com
git clone https://huggingface.co/Rayzl/qwen2.5-vl-7b-eagle3-sgl
cd qwen2.5-vl-7b-eagle3-sgl
git lfs pull

# Quantize base model
tensorrt-edgellm-quantize-llm \
  --model_dir Qwen/Qwen2.5-VL-7B-Instruct \
  --quantization fp8 \
  --output_dir quantized/qwen2.5-vl-7b-base

# Export base model
tensorrt-edgellm-export-llm \
  --model_dir quantized/qwen2.5-vl-7b-base \
  --output_dir onnx_models/qwen2.5-vl-7b_eagle3_base \
  --is_eagle_base

# Quantize draft model
tensorrt-edgellm-quantize-draft \
  --base_model_dir Qwen/Qwen2.5-VL-7B-Instruct \
  --draft_model_dir qwen2.5-vl-7b-eagle3-sgl \
  --quantization fp8 \
  --output_dir quantized/qwen2.5-vl-7b-draft

# Export draft model
tensorrt-edgellm-export-draft \
  --draft_model_dir quantized/qwen2.5-vl-7b-draft \
  --base_model_dir Qwen/Qwen2.5-VL-7B-Instruct \
  --output_dir onnx_models/qwen2.5-vl-7b_eagle3_draft

# Export visual encoder
tensorrt-edgellm-export-visual \
  --model_dir Qwen/Qwen2.5-VL-7B-Instruct \
  --output_dir onnx_models/qwen2.5-vl-7b/visual_enc_onnx

Where to Get Draft Models for EAGLE3#

Open-Source Draft Models:

Draft models for EAGLE speculative decoding can be found on HuggingFace:

  • EAGLE-3 Models on HuggingFace - Official list of available EAGLE-3 draft models for various base models

  • Search HuggingFace for your specific base model name + “EAGLE”

Training Your Own Draft Models:

If no pre-trained draft model exists for your base model, you’ll need to train one yourself. Refer to the EAGLE training repository for instructions on training draft models.

Important: Draft models must be trained specifically for their corresponding base model. A draft model trained for Qwen2.5-7B will only work with that exact base model and cannot be used with other models.

LoRA-Enabled Export#

# Export base model
tensorrt-edgellm-export-llm \
  --model_dir Qwen/Qwen2.5-0.5B-Instruct \
  --output_dir onnx_models/qwen2.5-0.5b

# Insert LoRA support. This is LoRA-independent
tensorrt-edgellm-insert-lora \
  --onnx_dir onnx_models/qwen2.5-0.5b

# Download LoRA model(s) that you want to serve. This is just an example.
git clone https://huggingface.co/madhurjindal/Jailbreak-Detector-2-XL
cd Jailbreak-Detector-2-XL
git lfs pull

# Process LoRA weights
tensorrt-edgellm-process-lora \
  --input_dir Jailbreak-Detector-2-XL \
  --output_dir lora_weights

Best Practices#

Model Selection#

  1. Choose appropriate quantization: Match precision to platform capabilities

  2. Validate accuracy: Test quantized models against FP16 baseline

  3. Consider memory constraints: Use INT4/NVFP4 for memory-limited platforms

Quantization Strategy#

  1. Start with FP16: Establish accuracy baseline

  2. Try FP8 for accuracy: Best balance of accuracy and memory reduction on SM89+ hardware

  3. Use INT4 for fast decoding: When prefill length is short and decode performance is critical

  4. Leverage NVFP4 on Thor: Optimal prefill and decode performance on Thor

Export Workflow#

  1. Validate model loading: Ensure model loads correctly from HuggingFace

  2. Check tokenizer compatibility: Verify tokenizer exports properly

  3. Test ONNX output: Validate ONNX model with ONNX Runtime

  4. Document configurations: Save export parameters for reproducibility

Performance Optimization#

  1. Calibration data quality: Use representative data for quantization

  2. Batch export: Export multiple models in parallel when possible

  3. Cache downloads: Reuse downloaded models across exports

  4. Monitor memory usage: Track peak memory during export

Model Signing and Verification#

  1. Verify model integrity: Users are responsible for verifying the integrity of model artifacts (base models, LoRA weights, tokenizers, configs) before deployment

  2. Sign models: It is strongly recommended to use the model-signing package to sign and verify models before inference.

Installation:

pip install model-signing

Basic Usage:

# Sign a model
model_signing sign /path/to/your/model --signature model.sig

# Verify a model
model_signing verify /path/to/your/model \
  --signature model.sig \
  --identity "$identity" \
  --identity_provider "$oidc_provider"

For more details, refer to the model-signing documentation


Common Issues and Solutions#

Issue: Model Download Fails or Times Out#

Cause: Network issues, insufficient disk space, or HuggingFace access problems.

Solution:

  1. Check disk space:

df -h $WORKSPACE_DIR
# Ensure at least 10-20GB free for small models
  1. Check network connectivity:

curl -I https://huggingface.co
# Should return HTTP 200 OK
  1. For gated models (Llama, Phi-4), login to HuggingFace:

huggingface-cli login
# Enter your access token
  1. Manual download as a workaround:

git lfs install
git clone https://huggingface.co/Qwen/Qwen3-0.6B

# Then use local path for quantization
tensorrt-edgellm-quantize-llm \
    --model_dir ./Qwen3-0.6B \
    --output_dir quantized/Qwen3-0.6B \
    --quantization fp8

Issue: GPU Out of Memory During Export or Quantization#

Cause: Model size exceeds available GPU memory.

Solution:

  1. Change to a larger GPU. Empirically a 40GB GPU is enough for 4B or less model and 80GB GPU is enough for 8B or less.

  2. You may try --device cpu flag during quantization and export. However, CPU support may fail for some precisions.

Issue: Calibration Dataset Download Fails (cnn_dailymail not found)#

Cause: Network connectivity issues preventing download of the calibration dataset from HuggingFace, or firewall/proxy blocking access.

Solution:

  1. Check network connectivity to HuggingFace:

curl -I https://huggingface.co
curl -I https://huggingface.co/datasets/abisee/cnn_dailymail
# Should return HTTP 200 OK
  1. If network is down or blocked, download the dataset manually:

git lfs install
git clone https://huggingface.co/datasets/abisee/cnn_dailymail

# Pass the local dataset path explicitly to quantization
tensorrt-edgellm-quantize-llm \
    --model_dir Qwen/Qwen3-0.6B \
    --output_dir quantized/Qwen3-0.6B \
    --quantization fp8 \
    --calib_dataset ./cnn_dailymail/3.0.0

Note: Replace ./cnn_dailymail/3.0.0 with the actual path where you downloaded the dataset. The dataset version 3.0.0 is commonly used for calibration. This allows you to use a local dataset instead of relying on HuggingFace cache.

Issue: Quantization Degrades Accuracy#

Cause: Aggressive quantization or insufficient calibration.

Solution:

  1. Use less aggressive quantization

# Use FP8 instead of INT4 for better accuracy
tensorrt-edgellm-quantize-llm \
  --model_dir model_name \
  --output_dir quantized/model_name \
  --quantization fp8  # Better accuracy than int4, nvfp4, or int8_sq
  1. Change the quantization recipe in tensorrt_edgellm/quantization/llm_quantization.py or tensorrt_edgellm/quantization/visual_quantization.py to disable quantization for most sensitive layers. Follow the documentation of NVIDIA Model Optimizer

  2. Increase calibration size (num_samples field) in tensorrt_edgellm/quantization/llm_quantization.py.