Supported Models#

Code Location: experimental/llm_loader/ (recommended export), experimental/quantization/ (checkpoint quantization), experimental/server/ (Python API/server), tensorrt_edgellm/ (legacy export), cpp/ (runtime)

Pre-Quantized Checkpoints: When a supported pre-quantized checkpoint is available, the checkpoint-based loader can export it directly without a separate quantization step.

Support Policy#

TensorRT Edge-LLM supports the checkpoint IDs listed below. Dense LLM families include official dense checkpoints below 30B parameters. Larger dense checkpoints and non-dense variants require case-by-case validation. MoE, multimodal, audio, TTS, omni, and EAGLE support is limited to the listed rows.

The model coverage list is not comprehensive, and not every listed checkpoint has been fully verified on every supported platform and precision. If a listed model does not export, build, or run correctly, please report an issue with the checkpoint ID, precision, platform, and command line used.

The model class names were checked against the installed transformers==5.3.0 package and the upstream Transformers model source tree. Checkpoint IDs are linked to their Hugging Face pages and grouped into original checkpoints and quantized checkpoints.

Precision Notes#

  • Dense precision set: FP16/BF16 checkpoints, ModelOpt FP8/MXFP8/FP4/NVFP4/INT4 AWQ/INT8 SmoothQuant checkpoints, and INT4 GPTQ checkpoints. INT8 GPTQ is not supported.

  • For FP16/BF16 source checkpoints, use the Quantization script to create a unified quantized checkpoint for llm_loader, then export the generated checkpoint.

  • FP8 KV cache is detected automatically from checkpoint metadata by llm_loader.

  • llm_loader exports visual encoders in FP16. FP8 visual encoder export is available through the legacy tensorrt_edgellm visual quantization/export tools.

  • MXFP8 and FP4/NVFP4 require Blackwell-class hardware for runtime execution.

Support Matrix#

Dense LLM#

Model Series

Transformers Class

llm_loader Handling

Supported Precisions

Llama 3.x Instruct

LlamaForCausalLM

llama -> default CausalLM

Dense precision set

Qwen2/Qwen2.5 dense

Qwen2ForCausalLM

qwen2 -> default CausalLM

Dense precision set

Qwen3 dense

Qwen3ForCausalLM

qwen3 -> default CausalLM

Dense precision set

Qwen3.5/3.6 text

Qwen3_5ForCausalLM

qwen3_5_text -> Qwen3_5CausalLM

Dense precision set

Nemotron Nano dense

NemotronHForCausalLM

nemotron_h -> NemotronHCausalLM

BF16, FP8, NVFP4

Llama 3.x Instruct checkpoints

Original:

Quantized:

Qwen2/Qwen2.5 dense and Qwen-derived dense checkpoints

Original:

Quantized:

Qwen3 dense checkpoints

Original:

Quantized:

Qwen3.5/3.6 text checkpoints

Qwen3.5:

Qwen3.6 (same architecture as Qwen3.5):

Quantized:

Nemotron Nano dense checkpoints

Original:

Quantized:


MoE#

Model Series

Transformers Class

llm_loader Handling

Supported Precisions

Qwen3-MoE

Qwen3MoeForCausalLM

qwen3_moe -> Qwen3MoeCausalLM

INT4 only

Nemotron3-MoE

NemotronHForCausalLM

nemotron_h -> NemotronHCausalLM

NVFP4 only

Qwen3-MoE checkpoints
Nemotron3-MoE checkpoints

VLM#

Model Series

Transformers Class

llm_loader Handling

Supported Precisions

Qwen2.5-VL

Qwen2_5_VLForConditionalGeneration

qwen2_5_vl + Qwen2_5VLVisualModel

Dense precision set for LLM backbone

Qwen3-VL / compatible

Qwen3VLForConditionalGeneration

qwen3_vl + Qwen3VLVisualModel

Dense precision set for LLM backbone

Qwen3.5/3.6 VLM

Qwen3_5ForConditionalGeneration

qwen3_5 -> Qwen3_5CausalLM + Qwen3_5VLVisualModel

VLM original checkpoints only

InternVL3 / InternVL3.5 HF format

InternVLForConditionalGeneration

internvl_chat / internvl + InternVL visual models

Dense precision set for LLM backbone

Phi-4-Multimodal

Phi4MultimodalForCausalLM

phi4mm / phi4_multimodal + Phi4MMVisualModel

Merge vision LoRA, then dense precision set for the LLM backbone

Qwen2.5-VL checkpoints

Original:

Quantized:

Qwen3-VL / compatible checkpoints

Original:

Quantized:

Qwen3.5/3.6 VLM — same checkpoints as Qwen3.5/3.6 text

Qwen3.5 and Qwen3.6 checkpoints are unified text+VLM models. The same checkpoints listed under Qwen3.5/3.6 text are used; llm_loader selects the VLM path (qwen3_5 handler) when visual inputs are provided.

InternVL3 / InternVL3.5 HF format checkpoints

Original:

Quantized:

Phi-4-Multimodal checkpoints

VLA#

Model Series

Transformers Class

llm_loader Handling

Supported Precisions

Alpamayo R1

Checkpoint architecture alpamayo_r1; VLM backbone compatible with Qwen3VLForConditionalGeneration

qwen3_vl + Qwen3VLVisualModel + AlpamayoAction

FP16

Alpamayo R1 checkpoints

Audio / Speech#

Model Series

Transformers Class

llm_loader Handling

Supported Precisions

Qwen3-ASR

Checkpoint architecture Qwen3ASRForConditionalGeneration; text backbone compatible with Qwen3ForCausalLM

Qwen3ASRLanguageModel + QwenAudioEncoder

FP16

Qwen3-ASR checkpoints

TTS#

Model Series

Transformers Class

llm_loader Handling

Supported Precisions

Qwen3-TTS

Checkpoint architecture Qwen3TTSForConditionalGeneration; talker/code-predictor decoders compatible with Qwen3ForCausalLM

TalkerCausalLM + CodePredictorCausalLM + Code2Wav from speech_tokenizer/

FP16

Qwen3-TTS checkpoints

Omni#

Model Series

Transformers Class

llm_loader Handling

Supported Precisions

Nemotron-Omni

Checkpoint architecture NemotronH_Nano_Omni_Reasoning_V3; LLM is Nemotron-H compatible with NemotronHForCausalLM

NemotronHCausalLM + NemotronOmniVisualModel + NemotronOmniAudioModel

NVFP4 only

Nemotron-Omni checkpoints

Qwen3-ASR and Qwen3-TTS use checkpoint architecture names that are not present in the installed transformers==5.3.0 package, so TensorRT Edge-LLM handles their speech/audio/talker/Code2Wav components with local model implementations. Qwen3-TTS support is limited to the CustomVoice checkpoints listed above.

EAGLE3 Draft Models#

EAGLE3 draft checkpoints are detected by draft_vocab_size in config.json and exported with Eagle3DraftModel. Draft checkpoints can be quantized with experimental.quantization using the same ModelOpt methods exposed by the draft quantization CLI: fp8, int4_awq, nvfp4, mxfp8, and int8_sq for the backbone; fp8, int4_awq, nvfp4, and mxfp8 for the LM head; and fp8 for KV cache.

Draft checkpoint

Base model

Draft config class

yuhuili/EAGLE3-LLaMA3.1-Instruct-8B

meta-llama/Llama-3.1-8B-Instruct

LlamaForCausalLM-style draft

AngelSlim/Qwen3-1.7B_eagle3

Qwen/Qwen3-1.7B

LlamaForCausalLMEagle3-style draft

AngelSlim/Qwen3-4B_eagle3

Qwen/Qwen3-4B

Eagle3LlamaForCausalLM-style draft

Tengyunw/qwen3_8b_eagle3

Qwen/Qwen3-8B

LlamaForCausalLMEagle3-style draft

AngelSlim/Qwen3-8B_eagle3

Qwen/Qwen3-8B

LlamaForCausalLMEagle3-style draft

Rayzl/qwen2.5-vl-7b-eagle3-sgl

Qwen/Qwen2.5-VL-7B-Instruct

LlamaForCausalLMEagle3-style draft