Supported Models#

Code Location: tensorrt_edgellm/ (checkpoint export), tensorrt_edgellm/quantization/ (checkpoint quantization), experimental/server/ (Python API/server), cpp/ (runtime)

Pre-Quantized Checkpoints: When a supported pre-quantized checkpoint is available, the checkpoint exporter can export it directly without a separate quantization step.

Support Policy#

TensorRT Edge-LLM supports the checkpoint IDs listed below. Dense LLM families include official dense checkpoints below 30B parameters. Larger dense checkpoints and non-dense variants require case-by-case validation. MoE, multimodal, audio, TTS, omni, EAGLE3, and DFlash support is limited to the listed rows.

The model coverage list is not comprehensive, and not every listed checkpoint has been fully verified on every supported platform and precision. If a listed model does not export, build, or run correctly, please report an issue with the checkpoint ID, precision, platform, and command line used.

The model class names were checked against the installed transformers==5.9.0 package and the upstream Transformers model source tree. Checkpoint IDs are linked to their Hugging Face pages and grouped into original checkpoints and quantized checkpoints.

Precision Notes#

Dense precision set: FP16/BF16 checkpoints, ModelOpt FP8/MXFP8/FP4/NVFP4/INT4 AWQ/INT8 SmoothQuant checkpoints, and INT4 GPTQ checkpoints. INT8 GPTQ is not supported.
Jetson Orin supports FP16, INT8, and INT4 runtime precision in the supported JetPack configurations. Do not select FP8, MXFP8, FP4, or NVFP4 checkpoints for Orin.
For INT4 engine builds on Jetson Orin devices with less system memory, such as Jetson Orin Nano, pass --externalize-weights int4_ffn for dense checkpoints or --externalize-weights int4_ffn int4_moe for MoE checkpoints to reduce engine build memory.
For FP16/BF16 source checkpoints, use the Quantization script to create a unified quantized checkpoint for tensorrt_edgellm, then export the generated checkpoint.
FP8 KV cache is detected automatically from checkpoint metadata by tensorrt_edgellm.
tensorrt-edgellm-export exports visual encoders. Use tensorrt-edgellm-quantize llm --visual_quantization fp8 before export when FP8 visual weights are required.
MXFP8 and FP4/NVFP4 require Blackwell-class hardware for runtime execution.

Support Matrix#

Dense LLM#

Model Series	Transformers Class	`tensorrt_edgellm` Handling	Supported Precisions
Llama 3.x Instruct	`LlamaForCausalLM`	`llama` -> default `CausalLM`	Dense precision set
Qwen2/Qwen2.5 dense	`Qwen2ForCausalLM`	`qwen2` -> default `CausalLM`	Dense precision set
Qwen3 dense	`Qwen3ForCausalLM`	`qwen3` -> default `CausalLM`	Dense precision set
Qwen3.5/3.6 text	`Qwen3_5ForCausalLM`	`qwen3_5_text` -> `Qwen3_5CausalLM`	Dense precision set
Nemotron Nano dense	`NemotronHForCausalLM`	`nemotron_h` -> `NemotronHCausalLM`	BF16, FP8, NVFP4

Llama 3.x Instruct checkpoints

Original:

Quantized:

Qwen2/Qwen2.5 dense and Qwen-derived dense checkpoints

Original:

Quantized:

Qwen3 dense checkpoints

Original:

Quantized:

Qwen3.5/3.6 text checkpoints

Qwen3.5:

Qwen3.6 (same architecture as Qwen3.5):

Qwen/Qwen3.6-27B

Nemotron Nano dense checkpoints

Original:

Quantized:

MoE#

Model Series	Transformers Class	`tensorrt_edgellm` Handling	Supported Precisions
Qwen3-MoE	`Qwen3MoeForCausalLM`	`qwen3_moe` -> `Qwen3MoeCausalLM`	INT4, NVFP4
Qwen3.5/3.6-MoE	`Qwen3_5MoeForConditionalGeneration`	`qwen3_5_moe` -> `Qwen3_5MoeCausalLM` + `Qwen3_5VLVisualModel`	INT4 GPTQ, NVFP4
Nemotron3-MoE (Nano 30B-A3B, Super 120B-A12B)	`NemotronHForCausalLM`	`nemotron_h` -> `NemotronHCausalLM` (the Super 120B-A12B variant uses a latent MoE routed path)	NVFP4 only

NVFP4 MoE export picks the plugin and FC1 weight layout from EDGELLM_NVFP4_MOE_TARGET; see MoE Example.

Qwen3-MoE checkpoints

Qwen3.5/3.6-MoE checkpoints

Nemotron3-MoE checkpoints

Nemotron3 Super uses latent MoE: routing is computed from the model hidden states, while the routed expert payload is projected to moe_latent_size before the NVFP4 MoE plugin path. The shared expert path remains separate.

VLM#

Model Series	Transformers Class	`tensorrt_edgellm` Handling	Supported Precisions
Qwen2.5-VL	`Qwen2_5_VLForConditionalGeneration`	`qwen2_5_vl` + `Qwen2_5VLVisualModel`	Dense precision set for LLM backbone
Qwen3-VL / compatible	`Qwen3VLForConditionalGeneration`	`qwen3_vl` + `Qwen3VLVisualModel`	Dense precision set for LLM backbone
Qwen3.5/3.6 VLM	`Qwen3_5ForConditionalGeneration`	`qwen3_5` -> `Qwen3_5CausalLM` + `Qwen3_5VLVisualModel`	VLM original checkpoints only
InternVL3 / InternVL3.5 HF format	`InternVLForConditionalGeneration`	`internvl_chat` / `internvl` + InternVL visual models	Dense precision set for LLM backbone
Phi-4-Multimodal	`Phi4MultimodalForCausalLM`	`phi4mm` / `phi4_multimodal` + `Phi4MMVisualModel`	Merge vision LoRA, then dense precision set for the LLM backbone

Qwen2.5-VL checkpoints

Original:

Quantized:

Qwen3-VL / compatible checkpoints

Original:

Quantized:

Qwen3.5/3.6 VLM — same checkpoints as Qwen3.5/3.6 text

Qwen3.5 and Qwen3.6 checkpoints are unified text+VLM models. The same checkpoints listed under Qwen3.5/3.6 text are used; tensorrt_edgellm selects the VLM path (qwen3_5 handler) when visual inputs are provided.

InternVL3 / InternVL3.5 HF format checkpoints

Original:

Quantized:

Phi-4-Multimodal checkpoints

microsoft/Phi-4-multimodal-instruct

VLA#

Model Series	Transformers Class	`tensorrt_edgellm` Handling	Supported Precisions
Alpamayo R1	Checkpoint architecture `alpamayo_r1`; VLM backbone compatible with `Qwen3VLForConditionalGeneration`	`qwen3_vl` + `Qwen3VLVisualModel` + `AlpamayoAction`	FP16

Alpamayo R1 checkpoints

nvidia/Alpamayo-R1-10B

Audio / Speech#

Model Series	Transformers Class	`tensorrt_edgellm` Handling	Supported Precisions
Qwen3-ASR	Checkpoint architecture `Qwen3ASRForConditionalGeneration`; text backbone compatible with `Qwen3ForCausalLM`	`Qwen3ASRLanguageModel` + `QwenAudioEncoder`	FP16; FP8 LLM (optional FP8 audio); NVFP4 LLM (optional FP8 audio; see ASR example)

Qwen3-ASR checkpoints

TTS#

Model Series	Transformers Class	`tensorrt_edgellm` Handling	Supported Precisions
Qwen3-TTS	Checkpoint architecture `Qwen3TTSForConditionalGeneration`; talker/code-predictor decoders compatible with `Qwen3ForCausalLM`	`TalkerCausalLM` + `CodePredictorCausalLM` + Code2Wav from `speech_tokenizer/`	FP16

Qwen3-TTS checkpoints

Omni#

Model Series	Transformers Class	`tensorrt_edgellm` Handling	Supported Precisions
Qwen3-Omni	`Qwen3OmniMoeForConditionalGeneration`	`Qwen3OmniMoeThinkerCausalLM` + `Qwen3OmniMoeTalkerCausalLM` + `CodePredictorCausalLM` + visual/audio/Code2Wav (six-engine layout; see the Omni example)	NVFP4 only
Nemotron-Omni	Checkpoint architecture `NemotronH_Nano_Omni_Reasoning_V3`; LLM is Nemotron-H compatible with `NemotronHForCausalLM`	`NemotronHCausalLM` + `NemotronOmniVisualModel` + `NemotronOmniAudioModel`	NVFP4 only
Gemma4 E2B/E4B (text + image + audio)	`Gemma4ForCausalLM`	`gemma4` / `gemma4_text` -> text decoder (PLE, dual-RoPE) with paired-assistant MTP, plus `Gemma4VisualModel` (image) and `Gemma4AudioModel` (audio)	BF16/FP16 source checkpoints; paired-assistant MTP via a matched Gemma4 assistant checkpoint (released for both sizes); text + image + audio input
Gemma4 31B (text + image)	`Gemma4ForCausalLM`	`gemma4` / `gemma4_text` -> text decoder (PLE, dual-RoPE) with paired-assistant MTP, plus `Gemma4VisualModel` for image input	BF16/FP16 source plus NVFP4; paired-assistant MTP via a matched Gemma4 assistant checkpoint; text + image input
Gemma4 26B-A4B MoE (text + image)	`Gemma4ForCausalLM`	`gemma4` / `gemma4_text` -> text decoder (PLE, dual-RoPE) with GeGLU sparse-MoE FFN and paired-assistant MTP, plus `Gemma4VisualModel` for image input	NVFP4; paired-assistant MTP via a matched Gemma4 assistant checkpoint; text + image input
Gemma4 Unified 12B	Checkpoint architecture `Gemma4UnifiedForConditionalGeneration`; text backbone compatible with `Gemma4ForCausalLM`	`gemma4_unified` -> Gemma4 text decoder (dual-RoPE, per-layer heterogeneous KV, decoder-side vision-block bidirectional attention) + `Gemma4UnifiedVisualModel` + `Gemma4UnifiedAudioModel` (encoder-free patch/PCM embedders)	FP16 LLM backbone; FP32 multimodal embedders; image and audio input

Nemotron-Omni checkpoints

nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4

Note: The end-to-end Omni example workflow currently documents Qwen3-Omni. Nemotron-Omni is supported through its dedicated NemotronOmniVisualModel / NemotronOmniAudioModel component paths; follow the same export → engine-build → inference structure as the Qwen3-Omni example, substituting the Nemotron-Omni checkpoint.

Gemma4 E2B/E4B/12B/31B checkpoints

Modality: E2B, E4B, and the Unified 12B accept text + image + audio; 31B accepts text + image. E2B/E4B/31B support paired-assistant MTP.

Original (BF16/FP16):

Quantized:

nvidia/Gemma-4-31B-IT-NVFP4

Paired MTP assistant checkpoints (one per base; pair with the matching size — released for every size):

Gemma4 26B-A4B MoE checkpoints

Quantized (NVFP4):

nvidia/Gemma-4-26B-A4B-NVFP4

Paired MTP assistant:

google/gemma-4-26B-A4B-it-assistant

Qwen3-ASR and Qwen3-TTS use checkpoint architecture names that are not present in the installed transformers==5.9.0 package, so TensorRT Edge-LLM handles their speech/audio/talker/Code2Wav components with local model implementations. Qwen3-TTS support is limited to the CustomVoice checkpoints listed above.

EAGLE3 Draft Models#

EAGLE3 draft checkpoints are detected by draft_vocab_size in config.json and exported with Eagle3DraftModel. Draft checkpoints can be quantized with tensorrt-edgellm-quantize using the same ModelOpt methods exposed by the draft quantization CLI: fp8, int4_awq, nvfp4, mxfp8, and int8_sq for the backbone; fp8, int4_awq, nvfp4, and mxfp8 for the LM head; and fp8 for KV cache.

Draft checkpoint	Base model	Draft config class
yuhuili/EAGLE3-LLaMA3.1-Instruct-8B	meta-llama/Llama-3.1-8B-Instruct	`LlamaForCausalLM`-style draft
AngelSlim/Qwen3-1.7B_eagle3	Qwen/Qwen3-1.7B	`LlamaForCausalLMEagle3`-style draft
AngelSlim/Qwen3-4B_eagle3	Qwen/Qwen3-4B	`Eagle3LlamaForCausalLM`-style draft
Tengyunw/qwen3_8b_eagle3	Qwen/Qwen3-8B	`LlamaForCausalLMEagle3`-style draft
AngelSlim/Qwen3-8B_eagle3	Qwen/Qwen3-8B	`LlamaForCausalLMEagle3`-style draft
Rayzl/qwen2.5-vl-7b-eagle3-sgl	Qwen/Qwen2.5-VL-7B-Instruct	`LlamaForCausalLMEagle3`-style draft

DFlash Draft Models#

DFlash draft checkpoints are detected by dflash_config in config.json and exported with DFlashDraftModel. Linear DFlash base export uses --dflash-base --dflash-draft-dir <draft_checkpoint>, Qwen3.5 hybrid DDTree base export uses --dflash-tree-base --dflash-draft-dir <draft_checkpoint>, and draft export uses --dflash-draft --dflash-draft-dir <draft_checkpoint>. Use the tree-base engine only with DFlash DDTree runtime settings such as --specDraftTopK 8, not with linear --specDraftTopK 1. DFlash draft checkpoints can be quantized with tensorrt-edgellm-quantize draft; NVFP4 backbone quantization and optional NVFP4 LM-head quantization are validated.

So far DFlash support in TensorRT Edge-LLM is validated for Qwen3 and Qwen3.5 only. Other DFlash draft models in the z-lab collection are not tested for TensorRT Edge-LLM accuracy, acceptance rate, or runtime compatibility. For the listed pairs, match the paired HuggingFace generation behavior when evaluating performance: enable thinking for Qwen3.5 DFlash models and disable thinking for Qwen3 DFlash models.

Draft checkpoint	Base model	Draft config class
z-lab/Qwen3-4B-DFlash-b16	Qwen/Qwen3-4B-Instruct-2507	`DFlashDraftModel`
z-lab/Qwen3-8B-DFlash-b16	Qwen/Qwen3-8B	`DFlashDraftModel`
z-lab/Qwen3.5-4B-DFlash	Qwen/Qwen3.5-4B	`DFlashDraftModel`
z-lab/Qwen3.5-4B-DFlash (quantized checkpoint: `Qwen3.5-4B-DFlash-NVFP4`)	Qwen3.5-4B-NVFP4	`DFlashDraftModel`
z-lab/Qwen3.5-9B-DFlash	Qwen/Qwen3.5-9B	`DFlashDraftModel`
z-lab/Qwen3.5-27B-DFlash	Qwen/Qwen3.5-27B	`DFlashDraftModel`
z-lab/Qwen3.5-35B-A3B-DFlash	Qwen/Qwen3.5-35B-A3B-GPTQ-Int4	`DFlashDraftModel`