Export ONNX for EdgeLLM#
AutoDeploy provides a mode to export PyTorch/HuggingFace models to ONNX format specifically designed for EdgeLLM deployment. This mode performs graph transformations to fuse RoPE (Rotary Position Embedding) and attention operations into a single AttentionPlugin operation, then exports the optimized graph to ONNX.
Overview#
The export_edgellm_onnx mode differs from the standard AutoDeploy workflow in two key ways:
Operation Fusion: Fuses
torch_rope_with_explicit_cos_sinandtorch_cached_attention_with_cacheinto a singleAttentionPluginoperationONNX Export: Outputs an ONNX model file instead of a TensorRT Engine
Quick Start#
Use the onnx_export_llm.py script to export a model:
cd examples/auto_deploy
python onnx_export_llm.py --model "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
This will export the model to ONNX format in the current directory.
Command Line Options#
Option |
Type |
Default |
Description |
|---|---|---|---|
|
str |
Required |
HuggingFace model name or path to a local checkpoint |
|
str |
|
Device to use for export ( |
|
str |
|
Directory to save the exported ONNX model |
Examples#
Basic Export#
Export a DeepSeek model with default settings:
python onnx_export_llm.py --model "Qwen/Qwen2.5-0.5B-Instruct"
Custom Output Location#
Export to a specific directory with a custom filename:
python onnx_export_llm.py \
--model "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B" \
--output_dir "./exported_models"
Output Files#
The export process generates the following files in the output directory:
File |
Description |
|---|---|
|
The exported ONNX model with fused attention operations |
|
Model configuration (architecture, hidden size, etc.) |
|
Tokenizer vocabulary and configuration |
|
Tokenizer settings |
|
Special token mappings |
|
Processed chat template for inference |
Programmatic Usage#
You can also use the ONNX export functionality programmatically:
from tensorrt_llm._torch.auto_deploy import LLM, AutoDeployConfig
# Create AutoDeploy config with export_edgellm_onnx mode
ad_config = AutoDeployConfig(
model="Qwen/Qwen2.5-0.5B-Instruct",
mode="export_edgellm_onnx",
max_batch_size=8,
max_seq_len=512,
device="cpu",
)
# Configure attention backend
ad_config.attn_backend = "torch"
# Optionally customize output location
ad_config.transforms["export_to_onnx"]["output_dir"] = "./my_output"
# Run the export
LLM(**ad_config.to_llm_kwargs())
Notes#
Device Selection: Using
cpufor the--deviceoption is recommended to reduce GPU memory footprint during export.Custom Operations: The exported ONNX model contains custom operations (e.g.,
AttentionPlugin) in thetrtdomain that require corresponding implementations in the target inference runtime.