Supported Models#

Code Location: tensorrt_edgellm/ (export), cpp/ (runtime)

Large Language Models (LLMs)#

Model	Parameters	FP16	FP8	INT4	NVFP4
Llama-3-8B-Instruct	8B	✅	✅	✅	✅
Llama-3.1-8B-Instruct	8B	✅	✅	✅	✅
Llama-3.2-3B-Instruct	3B	✅	✅	✅	✅

Model	Parameters	FP16	FP8	INT4	NVFP4
Qwen2-0.5B-Instruct	0.5B	✅	✅	✅	✅
Qwen2-1.5B-Instruct	1.5B	✅	✅	✅	✅
Qwen2-7B-Instruct	7B	✅	✅	✅	✅
Qwen2.5-0.5B-Instruct	0.5B	✅	✅	✅	✅
Qwen2.5-1.5B-Instruct	1.5B	✅	✅	✅	✅
Qwen2.5-3B-Instruct	3B	✅	✅	✅	✅
Qwen2.5-7B-Instruct	7B	✅	✅	✅	✅

Model	Parameters	FP16	FP8	INT4	NVFP4
Qwen3-0.6B	0.6B	✅	✅	✅	✅
Qwen3-1.7B	1.7B	✅	✅	✅	✅
Qwen3-4B-Instruct-2507	4B	✅	✅	✅	✅
Qwen3-8B	8B	✅	✅	✅	✅

Model	Parameters	FP16	FP8	INT4	NVFP4
DeepSeek-R1-Distill-Qwen-1.5B	1.5B	✅	✅	✅	✅
DeepSeek-R1-Distill-Qwen-7B	7B	✅	✅	✅	✅

Model	Parameters	FP16	FP8	INT4	NVFP4
Qwen2-VL-2B-Instruct	2B	✅	✅	✅	✅
Qwen2-VL-7B-Instruct	7B	✅	✅	✅	✅
Qwen2.5-VL-3B-Instruct	3B	✅	✅	✅	✅
Qwen2.5-VL-7B-Instruct	7B	✅	✅	✅	✅
Qwen3-VL-2B-Instruct	2B	✅	✅	✅	✅
Qwen3-VL-4B-Instruct	4B	✅	✅	✅	✅
Qwen3-VL-8B-Instruct	8B	✅	✅	✅	✅
InternVL3-1B	1B	✅	✅	✅	✅
InternVL3-2B	2B	✅	✅	✅	✅
Phi-4-multimodal-instruct	5.6B	✅	✅	✅	✅

Precision	Memory	Compute	Platform Requirements	Best For
FP16	1x (baseline)	FP16	All platforms	Accuracy baseline, universal compatibility
FP8	2x reduction	FP8 GEMMs + FP16	SM89+ (Ada Lovelace and newer)	Balanced performance on modern GPUs
INT4 AWQ	4x reduction	FP16 (AWQ quantization)	All platforms	Memory-constrained devices
INT4 GPTQ	4x reduction	FP16 (GPTQ quantization)	All platforms	Memory-constrained devices
NVFP4	4x reduction	NVFP4 GEMMs + FP16	SM100+ (Blackwell and newer)	Thor platforms (recommended)

FP8 Vision Encoder: Supported for visual models on SM89+
FP8/NVFP4 LM Head: Supported for language model heads with platform-specific requirements

Experimental Support: INT8 SmoothQuant and MXFP8 are experimental features. Functionality, accuracy and performance are not guaranteed.

Notes:

FP16 and INT4 (AWQ/GPTQ) quantization methods work on all CUDA-capable platforms
FP8 quantization requires SM89+ (Ada Lovelace architecture or newer, e.g., RTX 40-series)
NVFP4 quantization requires SM100+ (Blackwell architecture or newer, e.g., Thor platforms)
Platform requirements apply to both model weights and operations (including ViT encoders and LM heads)

Development GPUs:

For development purposes, TensorRT Edge-LLM supports the following discrete GPU compute capabilities:

Note: While these GPUs are supported for development and testing, for performant inference solutions on these GPUs please refer to TensorRT-LLM