Supported Models#

Code Location: tensorrt_edgellm/ (export), cpp/ (runtime)

Large Language Models (LLMs)#

Llama Family#

Model

Parameters

FP16

FP8

INT4

NVFP4

Llama-3-8B-Instruct

8B

Llama-3.1-8B

8B

Llama-3.2-3B

3B

Qwen2/2.5 Family#

Model

Parameters

FP16

FP8

INT4

NVFP4

Qwen2-0.5B-Instruct

0.5B

Qwen2-1.5B-Instruct

1.5B

Qwen2-7B-Instruct

7B

Qwen2.5-0.5B-Instruct

0.5B

Qwen2.5-1.5B-Instruct

1.5B

Qwen2.5-3B-Instruct

3B

Qwen2.5-7B-Instruct

7B

Qwen3 Family#

Model

Parameters

FP16

FP8

INT4

NVFP4

Qwen3-0.6B

0.6B

Qwen3-4B-Instruct-2507

4B

Qwen3-8B

8B

DeepSeek-R1 Distilled Family#

Model

Parameters

FP16

FP8

INT4

NVFP4

DeepSeek-R1-Distill-Qwen-1.5B

1.5B

DeepSeek-R1-Distill-Qwen-7B

7B


Vision-Language Models (VLMs)#

Model

Parameters

FP16

FP8

INT4

NVFP4

Qwen2-VL-2B-Instruct

2B

Qwen2-VL-7B-Instruct

7B

Qwen2.5-VL-3B-Instruct

3B

Qwen2.5-VL-7B-Instruct

7B

Qwen3-VL-2B-Instruct

2B

Qwen3-VL-4B-Instruct

4B

Qwen3-VL-8B-Instruct

8B

InternVL3-1B

1B

InternVL3-2B

2B

Phi-4-multimodal-instruct

5.6B


Precision Support#

Precision

Memory

Compute

Platform Requirements

Best For

FP16

1x (baseline)

FP16

All platforms

Accuracy baseline, universal compatibility

FP8

2x reduction

FP8 GEMMs + FP16

SM89+ (Ada Lovelace and newer)

Balanced performance on modern GPUs

INT4 AWQ

4x reduction

FP16 (AWQ quantization)

All platforms

Memory-constrained devices

INT4 GPTQ

4x reduction

FP16 (GPTQ quantization)

All platforms

Memory-constrained devices

NVFP4

4x reduction

NVFP4 GEMMs + FP16

SM100+ (Blackwell and newer)

Thor platforms (recommended)

Additional Features#

  • FP8 Vision Encoder: Supported for visual models (Qwen2-VL, InternVL3) on SM89+

  • FP8/NVFP4 LM Head: Supported for language model heads with platform-specific requirements


Platform Compatibility#

GPU Architecture

Compute Capability

Supported Precisions

All Platforms

Any

FP16, INT4 AWQ, INT4 GPTQ

Ada Lovelace+

SM89+

FP16, FP8, INT4 AWQ, INT4 GPTQ

Blackwell+

SM100+

FP16, FP8, INT4 AWQ, INT4 GPTQ, NVFP4

Notes:

  • FP16 and INT4 (AWQ/GPTQ) quantization methods work on all CUDA-capable platforms

  • FP8 quantization requires SM89+ (Ada Lovelace architecture or newer, e.g., RTX 40-series)

  • NVFP4 quantization requires SM100+ (Blackwell architecture or newer, e.g., Thor platforms)

  • Platform requirements apply to both model weights and operations (including ViT encoders and LM heads)

Development GPUs:

For development purposes, TensorRT Edge-LLM supports the following discrete GPU compute capabilities:

  • SM80: Ampere (e.g., A100, A30, A10)

  • SM86: Ampere (e.g., RTX 30 series, RTX Pro Ampere series)

  • SM89: Ada Lovelace (e.g., RTX 40 series, L4, L40, RTX Pro Ada series)

  • SM100: Blackwell (e.g., GB200)

  • SM120: Blackwell (e.g., RTX 50 series, RTX Pro Blackwell series)

Note: While these GPUs are supported for development and testing, the officially supported deployment platforms are NVIDIA Jetson Thor (JetPack 7.1) and NVIDIA DRIVE Thor (DriveOS 7). For performant inference solutions on these GPUs please refer to TensorRT-LLM


Additional Resources#