Supported Models#
Code Location:
tensorrt_edgellm/(export),cpp/(runtime)
Large Language Models (LLMs)#
Llama Family#
Model |
Parameters |
FP16 |
FP8 |
INT4 |
NVFP4 |
|---|---|---|---|---|---|
8B |
✅ |
✅ |
✅ |
✅ |
|
8B |
✅ |
✅ |
✅ |
✅ |
|
3B |
✅ |
✅ |
✅ |
✅ |
Qwen2/2.5 Family#
Model |
Parameters |
FP16 |
FP8 |
INT4 |
NVFP4 |
|---|---|---|---|---|---|
0.5B |
✅ |
✅ |
✅ |
✅ |
|
1.5B |
✅ |
✅ |
✅ |
✅ |
|
7B |
✅ |
✅ |
✅ |
✅ |
|
0.5B |
✅ |
✅ |
✅ |
✅ |
|
1.5B |
✅ |
✅ |
✅ |
✅ |
|
3B |
✅ |
✅ |
✅ |
✅ |
|
7B |
✅ |
✅ |
✅ |
✅ |
Qwen3 Family#
Model |
Parameters |
FP16 |
FP8 |
INT4 |
NVFP4 |
|---|---|---|---|---|---|
0.6B |
✅ |
✅ |
✅ |
✅ |
|
1.7B |
✅ |
✅ |
✅ |
✅ |
|
4B |
✅ |
✅ |
✅ |
✅ |
|
8B |
✅ |
✅ |
✅ |
✅ |
DeepSeek-R1 Distilled Family#
Model |
Parameters |
FP16 |
FP8 |
INT4 |
NVFP4 |
|---|---|---|---|---|---|
1.5B |
✅ |
✅ |
✅ |
✅ |
|
7B |
✅ |
✅ |
✅ |
✅ |
Vision-Language Models (VLMs)#
Model |
Parameters |
FP16 |
FP8 |
INT4 |
NVFP4 |
|---|---|---|---|---|---|
2B |
✅ |
✅ |
✅ |
✅ |
|
7B |
✅ |
✅ |
✅ |
✅ |
|
3B |
✅ |
✅ |
✅ |
✅ |
|
7B |
✅ |
✅ |
✅ |
✅ |
|
2B |
✅ |
✅ |
✅ |
✅ |
|
4B |
✅ |
✅ |
✅ |
✅ |
|
8B |
✅ |
✅ |
✅ |
✅ |
|
1B |
✅ |
✅ |
✅ |
✅ |
|
2B |
✅ |
✅ |
✅ |
✅ |
|
5.6B |
✅ |
✅ |
✅ |
✅ |
Precision Support#
Precision |
Memory |
Compute |
Platform Requirements |
Best For |
|---|---|---|---|---|
FP16 |
1x (baseline) |
FP16 |
All platforms |
Accuracy baseline, universal compatibility |
FP8 |
2x reduction |
FP8 GEMMs + FP16 |
SM89+ (Ada Lovelace and newer) |
Balanced performance on modern GPUs |
INT4 AWQ |
4x reduction |
FP16 (AWQ quantization) |
All platforms |
Memory-constrained devices |
INT4 GPTQ |
4x reduction |
FP16 (GPTQ quantization) |
All platforms |
Memory-constrained devices |
NVFP4 |
4x reduction |
NVFP4 GEMMs + FP16 |
SM100+ (Blackwell and newer) |
Thor platforms (recommended) |
Additional Features#
FP8 Vision Encoder: Supported for visual models on SM89+
FP8/NVFP4 LM Head: Supported for language model heads with platform-specific requirements
Experimental Support: INT8 SmoothQuant and MXFP8 are experimental features. Functionality, accuracy and performance are not guaranteed.
Platform Compatibility#
GPU Architecture |
Compute Capability |
Supported Precisions |
|---|---|---|
All Platforms |
Any |
FP16, INT4 AWQ, INT4 GPTQ |
Ada Lovelace+ |
SM89+ |
FP16, FP8, INT4 AWQ, INT4 GPTQ |
Blackwell+ |
SM100+ |
FP16, FP8, INT4 AWQ, INT4 GPTQ, NVFP4 |
Notes:
FP16 and INT4 (AWQ/GPTQ) quantization methods work on all CUDA-capable platforms
FP8 quantization requires SM89+ (Ada Lovelace architecture or newer, e.g., RTX 40-series)
NVFP4 quantization requires SM100+ (Blackwell architecture or newer, e.g., Thor platforms)
Platform requirements apply to both model weights and operations (including ViT encoders and LM heads)
Development GPUs:
For development purposes, TensorRT Edge-LLM supports the following discrete GPU compute capabilities:
SM80: Ampere (e.g., A100, A30, A10)
SM86: Ampere (e.g., RTX 30 series, RTX Pro Ampere series)
SM89: Ada Lovelace (e.g., RTX 40 series, L4, L40, RTX Pro Ada series)
SM100: Blackwell (e.g., GB200)
SM120: Blackwell (e.g., RTX 50 series, RTX Pro Blackwell series)
Note: While these GPUs are supported for development and testing, for performant inference solutions on these GPUs please refer to TensorRT-LLM