Supported Models#

Code Location: experimental/llm_loader/ (recommended export), experimental/quantization/ (checkpoint quantization), experimental/server/ (Python API/server), tensorrt_edgellm/ (legacy export), cpp/ (runtime)

Pre-Quantized Checkpoints: When a supported pre-quantized checkpoint is available, the checkpoint-based loader can export it directly without a separate quantization step.

Support Policy#

TensorRT Edge-LLM supports the checkpoint IDs listed below. Dense LLM families include official dense checkpoints below 30B parameters. Larger dense checkpoints and non-dense variants require case-by-case validation. MoE, multimodal, audio, TTS, omni, and EAGLE support is limited to the listed rows.

The model coverage list is not comprehensive, and not every listed checkpoint has been fully verified on every supported platform and precision. If a listed model does not export, build, or run correctly, please report an issue with the checkpoint ID, precision, platform, and command line used.

The model class names were checked against the installed transformers==5.3.0 package and the upstream Transformers model source tree. Checkpoint IDs are linked to their Hugging Face pages and grouped into original checkpoints and quantized checkpoints.

Precision Notes#

Dense precision set: FP16/BF16 checkpoints, ModelOpt FP8/MXFP8/FP4/NVFP4/INT4 AWQ/INT8 SmoothQuant checkpoints, and INT4 GPTQ checkpoints. INT8 GPTQ is not supported.
For FP16/BF16 source checkpoints, use the Quantization script to create a unified quantized checkpoint for llm_loader, then export the generated checkpoint.
FP8 KV cache is detected automatically from checkpoint metadata by llm_loader.
llm_loader exports visual encoders in FP16. FP8 visual encoder export is available through the legacy tensorrt_edgellm visual quantization/export tools.
MXFP8 and FP4/NVFP4 require Blackwell-class hardware for runtime execution.

Support Matrix#

Dense LLM#

Model Series	Transformers Class	`llm_loader` Handling	Supported Precisions
Llama 3.x Instruct	`LlamaForCausalLM`	`llama` -> default `CausalLM`	Dense precision set
Qwen2/Qwen2.5 dense	`Qwen2ForCausalLM`	`qwen2` -> default `CausalLM`	Dense precision set
Qwen3 dense	`Qwen3ForCausalLM`	`qwen3` -> default `CausalLM`	Dense precision set
Qwen3.5/3.6 text	`Qwen3_5ForCausalLM`	`qwen3_5_text` -> `Qwen3_5CausalLM`	Dense precision set
Nemotron Nano dense	`NemotronHForCausalLM`	`nemotron_h` -> `NemotronHCausalLM`	BF16, FP8, NVFP4

Llama 3.x Instruct checkpoints

Original:

Quantized:

Qwen2/Qwen2.5 dense and Qwen-derived dense checkpoints

Original:

Quantized:

Qwen3 dense checkpoints

Original:

Quantized:

Qwen3.5/3.6 text checkpoints

Qwen3.5:

Qwen3.6 (same architecture as Qwen3.5):

Qwen/Qwen3.6-27B

Quantized:

Qwen/Qwen3.6-27B-FP8

Nemotron Nano dense checkpoints

Original:

Quantized:

MoE#

Model Series	Transformers Class	`llm_loader` Handling	Supported Precisions
Qwen3-MoE	`Qwen3MoeForCausalLM`	`qwen3_moe` -> `Qwen3MoeCausalLM`	INT4 only
Nemotron3-MoE	`NemotronHForCausalLM`	`nemotron_h` -> `NemotronHCausalLM`	NVFP4 only

Qwen3-MoE checkpoints

Qwen/Qwen3-30B-A3B-GPTQ-Int4

Nemotron3-MoE checkpoints

nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4

VLM#

Model Series	Transformers Class	`llm_loader` Handling	Supported Precisions
Qwen2.5-VL	`Qwen2_5_VLForConditionalGeneration`	`qwen2_5_vl` + `Qwen2_5VLVisualModel`	Dense precision set for LLM backbone
Qwen3-VL / compatible	`Qwen3VLForConditionalGeneration`	`qwen3_vl` + `Qwen3VLVisualModel`	Dense precision set for LLM backbone
Qwen3.5/3.6 VLM	`Qwen3_5ForConditionalGeneration`	`qwen3_5` -> `Qwen3_5CausalLM` + `Qwen3_5VLVisualModel`	VLM original checkpoints only
InternVL3 / InternVL3.5 HF format	`InternVLForConditionalGeneration`	`internvl_chat` / `internvl` + InternVL visual models	Dense precision set for LLM backbone
Phi-4-Multimodal	`Phi4MultimodalForCausalLM`	`phi4mm` / `phi4_multimodal` + `Phi4MMVisualModel`	Merge vision LoRA, then dense precision set for the LLM backbone

Qwen2.5-VL checkpoints

Original:

Quantized:

Qwen3-VL / compatible checkpoints

Original:

Quantized:

Qwen3.5/3.6 VLM — same checkpoints as Qwen3.5/3.6 text

Qwen3.5 and Qwen3.6 checkpoints are unified text+VLM models. The same checkpoints listed under Qwen3.5/3.6 text are used; llm_loader selects the VLM path (qwen3_5 handler) when visual inputs are provided.

InternVL3 / InternVL3.5 HF format checkpoints

Original:

Quantized:

Phi-4-Multimodal checkpoints

microsoft/Phi-4-multimodal-instruct

VLA#

Model Series	Transformers Class	`llm_loader` Handling	Supported Precisions
Alpamayo R1	Checkpoint architecture `alpamayo_r1`; VLM backbone compatible with `Qwen3VLForConditionalGeneration`	`qwen3_vl` + `Qwen3VLVisualModel` + `AlpamayoAction`	FP16

Alpamayo R1 checkpoints

nvidia/Alpamayo-R1-10B

Audio / Speech#

Model Series	Transformers Class	`llm_loader` Handling	Supported Precisions
Qwen3-ASR	Checkpoint architecture `Qwen3ASRForConditionalGeneration`; text backbone compatible with `Qwen3ForCausalLM`	`Qwen3ASRLanguageModel` + `QwenAudioEncoder`	FP16

Qwen3-ASR checkpoints

TTS#

Model Series	Transformers Class	`llm_loader` Handling	Supported Precisions
Qwen3-TTS	Checkpoint architecture `Qwen3TTSForConditionalGeneration`; talker/code-predictor decoders compatible with `Qwen3ForCausalLM`	`TalkerCausalLM` + `CodePredictorCausalLM` + Code2Wav from `speech_tokenizer/`	FP16

Qwen3-TTS checkpoints

Omni#

Model Series	Transformers Class	`llm_loader` Handling	Supported Precisions
Nemotron-Omni	Checkpoint architecture `NemotronH_Nano_Omni_Reasoning_V3`; LLM is Nemotron-H compatible with `NemotronHForCausalLM`	`NemotronHCausalLM` + `NemotronOmniVisualModel` + `NemotronOmniAudioModel`	NVFP4 only

Nemotron-Omni checkpoints

nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4

Qwen3-ASR and Qwen3-TTS use checkpoint architecture names that are not present in the installed transformers==5.3.0 package, so TensorRT Edge-LLM handles their speech/audio/talker/Code2Wav components with local model implementations. Qwen3-TTS support is limited to the CustomVoice checkpoints listed above.

EAGLE3 Draft Models#

EAGLE3 draft checkpoints are detected by draft_vocab_size in config.json and exported with Eagle3DraftModel. Draft checkpoints can be quantized with experimental.quantization using the same ModelOpt methods exposed by the draft quantization CLI: fp8, int4_awq, nvfp4, mxfp8, and int8_sq for the backbone; fp8, int4_awq, nvfp4, and mxfp8 for the LM head; and fp8 for KV cache.

Draft checkpoint	Base model	Draft config class
yuhuili/EAGLE3-LLaMA3.1-Instruct-8B	meta-llama/Llama-3.1-8B-Instruct	`LlamaForCausalLM`-style draft
AngelSlim/Qwen3-1.7B_eagle3	Qwen/Qwen3-1.7B	`LlamaForCausalLMEagle3`-style draft
AngelSlim/Qwen3-4B_eagle3	Qwen/Qwen3-4B	`Eagle3LlamaForCausalLM`-style draft
Tengyunw/qwen3_8b_eagle3	Qwen/Qwen3-8B	`LlamaForCausalLMEagle3`-style draft
AngelSlim/Qwen3-8B_eagle3	Qwen/Qwen3-8B	`LlamaForCausalLMEagle3`-style draft
Rayzl/qwen2.5-vl-7b-eagle3-sgl	Qwen/Qwen2.5-VL-7B-Instruct	`LlamaForCausalLMEagle3`-style draft