Supported Models#
Code Location:
tensorrt_edgellm/(checkpoint export),tensorrt_edgellm/quantization/(checkpoint quantization),experimental/server/(Python API/server),cpp/(runtime)Pre-Quantized Checkpoints: When a supported pre-quantized checkpoint is available, the checkpoint exporter can export it directly without a separate quantization step.
Support Policy#
TensorRT Edge-LLM supports the checkpoint IDs listed below. Dense LLM families include official dense checkpoints below 30B parameters. Larger dense checkpoints and non-dense variants require case-by-case validation. MoE, multimodal, audio, TTS, omni, and EAGLE support is limited to the listed rows.
The model coverage list is not comprehensive, and not every listed checkpoint has been fully verified on every supported platform and precision. If a listed model does not export, build, or run correctly, please report an issue with the checkpoint ID, precision, platform, and command line used.
The model class names were checked against the installed transformers==5.9.0 package and the upstream Transformers model source tree. Checkpoint IDs are linked to their Hugging Face pages and grouped into original checkpoints and quantized checkpoints.
Precision Notes#
Dense precision set: FP16/BF16 checkpoints, ModelOpt FP8/MXFP8/FP4/NVFP4/INT4 AWQ/INT8 SmoothQuant checkpoints, and INT4 GPTQ checkpoints. INT8 GPTQ is not supported.
Jetson Orin supports FP16, INT8, and INT4 runtime precision in the supported JetPack configurations. Do not select FP8, MXFP8, FP4, or NVFP4 checkpoints for Orin.
For INT4 engine builds on Jetson Orin devices with less system memory, such as Jetson Orin Nano, pass
--externalize-weights int4_ffnfor dense checkpoints or--externalize-weights int4_ffn int4_moefor MoE checkpoints to reduce engine build memory.For FP16/BF16 source checkpoints, use the Quantization script to create a unified quantized checkpoint for
tensorrt_edgellm, then export the generated checkpoint.FP8 KV cache is detected automatically from checkpoint metadata by
tensorrt_edgellm.tensorrt-edgellm-exportexports visual encoders. Usetensorrt-edgellm-quantize --visual_quantization fp8before export when FP8 visual weights are required.MXFP8 and FP4/NVFP4 require Blackwell-class hardware for runtime execution.
Support Matrix#
Dense LLM#
Model Series |
Transformers Class |
|
Supported Precisions |
|---|---|---|---|
Llama 3.x Instruct |
|
Dense precision set |
|
Qwen2/Qwen2.5 dense |
|
Dense precision set |
|
Qwen3 dense |
|
Dense precision set |
|
Qwen3.5/3.6 text |
|
Dense precision set |
|
Nemotron Nano dense |
|
BF16, FP8, NVFP4 |
Llama 3.x Instruct checkpoints
Original:
Quantized:
Qwen2/Qwen2.5 dense and Qwen-derived dense checkpoints
Original:
Quantized:
Qwen3 dense checkpoints
Original:
Quantized:
Qwen3.5/3.6 text checkpoints
Qwen3.5:
Qwen3.6 (same architecture as Qwen3.5):
Nemotron Nano dense checkpoints
Original:
Quantized:
MoE#
Model Series |
Transformers Class |
|
Supported Precisions |
|---|---|---|---|
Qwen3-MoE |
|
INT4, NVFP4 |
|
Qwen3.5/3.6-MoE |
|
INT4 GPTQ, NVFP4 |
|
Nemotron3-MoE |
|
NVFP4 only |
For NVFP4 MoE exports, the --nvfp4-moe-backend flag selects the plugin backend:
thor— usesNvfp4MoePlugin(SM100/101/110, CuTe DSL kernels). Default when checkpoint config does not specify.geforce— usesNvFP4MoEPluginGeforce(SM120/121).
tensorrt-edgellm-export \
/path/to/Qwen3-MoE-NVFP4 \
/tmp/qwen3_moe_onnx \
--nvfp4-moe-backend thor
Qwen3-MoE checkpoints
Qwen3.5/3.6-MoE checkpoints
Nemotron3-MoE checkpoints
VLM#
Model Series |
Transformers Class |
|
Supported Precisions |
|---|---|---|---|
Qwen2.5-VL |
|
Dense precision set for LLM backbone |
|
Qwen3-VL / compatible |
|
Dense precision set for LLM backbone |
|
Qwen3.5/3.6 VLM |
|
VLM original checkpoints only |
|
InternVL3 / InternVL3.5 HF format |
|
Dense precision set for LLM backbone |
|
Phi-4-Multimodal |
|
Merge vision LoRA, then dense precision set for the LLM backbone |
Qwen2.5-VL checkpoints
Original:
Quantized:
Qwen3-VL / compatible checkpoints
Original:
Quantized:
Qwen3.5/3.6 VLM — same checkpoints as Qwen3.5/3.6 text
Qwen3.5 and Qwen3.6 checkpoints are unified text+VLM models. The same checkpoints listed under Qwen3.5/3.6 text are used; tensorrt_edgellm selects the VLM path (qwen3_5 handler) when visual inputs are provided.
InternVL3 / InternVL3.5 HF format checkpoints
Original:
Quantized:
Phi-4-Multimodal checkpoints
VLA#
Model Series |
Transformers Class |
|
Supported Precisions |
|---|---|---|---|
Alpamayo R1 |
Checkpoint architecture |
|
FP16 |
Alpamayo R1 checkpoints
Audio / Speech#
Model Series |
Transformers Class |
|
Supported Precisions |
|---|---|---|---|
Qwen3-ASR |
Checkpoint architecture |
|
FP16 |
Qwen3-ASR checkpoints
TTS#
Model Series |
Transformers Class |
|
Supported Precisions |
|---|---|---|---|
Qwen3-TTS |
Checkpoint architecture |
|
FP16 |
Qwen3-TTS checkpoints
Omni#
Model Series |
Transformers Class |
|
Supported Precisions |
|---|---|---|---|
Checkpoint architecture |
|
NVFP4 only |
Nemotron-Omni checkpoints
Qwen3-ASR and Qwen3-TTS use checkpoint architecture names that are not present in the installed transformers==5.3.0 package, so TensorRT Edge-LLM handles their speech/audio/talker/Code2Wav components with local model implementations. Qwen3-TTS support is limited to the CustomVoice checkpoints listed above.
EAGLE3 Draft Models#
EAGLE3 draft checkpoints are detected by draft_vocab_size in config.json and exported with Eagle3DraftModel. Draft checkpoints can be quantized with tensorrt-edgellm-quantize using the same ModelOpt methods exposed by the draft quantization CLI: fp8, int4_awq, nvfp4, mxfp8, and int8_sq for the backbone; fp8, int4_awq, nvfp4, and mxfp8 for the LM head; and fp8 for KV cache.
Draft checkpoint |
Base model |
Draft config class |
|---|---|---|
|
||
|
||
|
||
|
||
|
||
|