Supported Models#
Code Location:
tensorrt_edgellm/(export),cpp/(runtime)
Large Language Models (LLMs)#
Llama Family#
Model |
Parameters |
FP16 |
FP8 |
INT4 |
NVFP4 |
|---|---|---|---|---|---|
8B |
✅ |
✅ |
✅ |
✅ |
|
8B |
✅ |
✅ |
✅ |
✅ |
|
3B |
✅ |
✅ |
✅ |
✅ |
Qwen2/2.5 Family#
Model |
Parameters |
FP16 |
FP8 |
INT4 |
NVFP4 |
|---|---|---|---|---|---|
0.5B |
✅ |
✅ |
✅ |
✅ |
|
1.5B |
✅ |
✅ |
✅ |
✅ |
|
7B |
✅ |
✅ |
✅ |
✅ |
|
0.5B |
✅ |
✅ |
✅ |
✅ |
|
1.5B |
✅ |
✅ |
✅ |
✅ |
|
3B |
✅ |
✅ |
✅ |
✅ |
|
7B |
✅ |
✅ |
✅ |
✅ |
Qwen3 Family#
Model |
Parameters |
FP16 |
FP8 |
INT4 |
NVFP4 |
|---|---|---|---|---|---|
0.6B |
✅ |
✅ |
✅ |
✅ |
|
4B |
✅ |
✅ |
✅ |
✅ |
|
8B |
✅ |
✅ |
✅ |
✅ |
DeepSeek-R1 Distilled Family#
Model |
Parameters |
FP16 |
FP8 |
INT4 |
NVFP4 |
|---|---|---|---|---|---|
1.5B |
✅ |
✅ |
✅ |
✅ |
|
7B |
✅ |
✅ |
✅ |
✅ |
Vision-Language Models (VLMs)#
Model |
Parameters |
FP16 |
FP8 |
INT4 |
NVFP4 |
|---|---|---|---|---|---|
2B |
✅ |
✅ |
✅ |
✅ |
|
7B |
✅ |
✅ |
✅ |
✅ |
|
3B |
✅ |
✅ |
✅ |
✅ |
|
7B |
✅ |
✅ |
✅ |
✅ |
|
2B |
✅ |
✅ |
✅ |
✅ |
|
4B |
✅ |
✅ |
✅ |
✅ |
|
8B |
✅ |
✅ |
✅ |
✅ |
|
1B |
✅ |
✅ |
✅ |
✅ |
|
2B |
✅ |
✅ |
✅ |
✅ |
|
5.6B |
✅ |
✅ |
✅ |
✅ |
Precision Support#
Precision |
Memory |
Compute |
Platform Requirements |
Best For |
|---|---|---|---|---|
FP16 |
1x (baseline) |
FP16 |
All platforms |
Accuracy baseline, universal compatibility |
FP8 |
2x reduction |
FP8 GEMMs + FP16 |
SM89+ (Ada Lovelace and newer) |
Balanced performance on modern GPUs |
INT4 AWQ |
4x reduction |
FP16 (AWQ quantization) |
All platforms |
Memory-constrained devices |
INT4 GPTQ |
4x reduction |
FP16 (GPTQ quantization) |
All platforms |
Memory-constrained devices |
NVFP4 |
4x reduction |
NVFP4 GEMMs + FP16 |
SM100+ (Blackwell and newer) |
Thor platforms (recommended) |
Additional Features#
FP8 Vision Encoder: Supported for visual models (Qwen2-VL, InternVL3) on SM89+
FP8/NVFP4 LM Head: Supported for language model heads with platform-specific requirements
Platform Compatibility#
GPU Architecture |
Compute Capability |
Supported Precisions |
|---|---|---|
All Platforms |
Any |
FP16, INT4 AWQ, INT4 GPTQ |
Ada Lovelace+ |
SM89+ |
FP16, FP8, INT4 AWQ, INT4 GPTQ |
Blackwell+ |
SM100+ |
FP16, FP8, INT4 AWQ, INT4 GPTQ, NVFP4 |
Notes:
FP16 and INT4 (AWQ/GPTQ) quantization methods work on all CUDA-capable platforms
FP8 quantization requires SM89+ (Ada Lovelace architecture or newer, e.g., RTX 40-series)
NVFP4 quantization requires SM100+ (Blackwell architecture or newer, e.g., Thor platforms)
Platform requirements apply to both model weights and operations (including ViT encoders and LM heads)
Development GPUs:
For development purposes, TensorRT Edge-LLM supports the following discrete GPU compute capabilities:
SM80: Ampere (e.g., A100, A30, A10)
SM86: Ampere (e.g., RTX 30 series, RTX Pro Ampere series)
SM89: Ada Lovelace (e.g., RTX 40 series, L4, L40, RTX Pro Ada series)
SM100: Blackwell (e.g., GB200)
SM120: Blackwell (e.g., RTX 50 series, RTX Pro Blackwell series)
Note: While these GPUs are supported for development and testing, the officially supported deployment platforms are NVIDIA Jetson Thor (JetPack 7.1) and NVIDIA DRIVE Thor (DriveOS 7). For performant inference solutions on these GPUs please refer to TensorRT-LLM