Support Matrix#

TensorRT-LLM optimizes the performance of a range of well-known models on NVIDIA GPUs. The following sections provide a list of supported GPU architectures as well as important features implemented in TensorRT-LLM.

Models (PyTorch Backend)#

Architecture

Model

HuggingFace Example

Modality

BertForSequenceClassification

BERT-based

textattack/bert-base-uncased-yelp-polarity

L

DeciLMForCausalLM

Nemotron

nvidia/Llama-3_1-Nemotron-51B-Instruct

L

DeepseekV3ForCausalLM

DeepSeek-V3

deepseek-ai/DeepSeek-V3

L

LlavaLlamaModel

VILA

Efficient-Large-Model/NVILA-8B

L + V

LlavaNextForConditionalGeneration

LLaVA-NeXT

llava-hf/llava-v1.6-mistral-7b-hf

L + V

LlamaForCausalLM

Llama 3.1, Llama 3, Llama 2, LLaMA

meta-llama/Meta-Llama-3.1-70B

L

Llama4ForConditionalGeneration

Llama 4

meta-llama/Llama-4-Scout-17B-16E-Instruct

L

MistralForCausalLM

Mistral

mistralai/Mistral-7B-v0.1

L

MixtralForCausalLM

Mixtral

mistralai/Mixtral-8x7B-v0.1

L

MllamaForConditionalGeneration

Llama 3.2

meta-llama/Llama-3.2-11B-Vision

L

NemotronForCausalLM

Nemotron-3, Nemotron-4, Minitron

nvidia/Minitron-8B-Base

L

NemotronNASForCausalLM

NemotronNAS

nvidia/Llama-3_3-Nemotron-Super-49B-v1

L

Qwen2ForCausalLM

QwQ, Qwen2

Qwen/Qwen2-7B-Instruct

L

Qwen2ForProcessRewardModel

Qwen2-based

Qwen/Qwen2.5-Math-PRM-7B

L

Qwen2ForRewardModel

Qwen2-based

Qwen/Qwen2.5-Math-RM-72B

L

Qwen2VLForConditionalGeneration

Qwen2-VL

Qwen/Qwen2-VL-7B-Instruct

L + V

Qwen2_5_VLForConditionalGeneration

Qwen2.5-VL

Qwen/Qwen2.5-VL-7B-Instruct

L + V

Note:

  • L: Language only

  • L + V: Language and Vision multimodal support

  • Llama 3.2 accepts vision input, but our support currently limited to text only.

Models (TensorRT Backend)#

LLM Models#

Multi-Modal Models [3]#

Hardware#

The following table shows the supported hardware for TensorRT-LLM.

If a GPU architecture is not listed, the TensorRT-LLM team does not develop or test the software on the architecture and support is limited to community support. In addition, older architectures can have limitations for newer software releases.

Hardware Compatibility

Operating System

TensorRT-LLM requires Linux x86_64 or Linux aarch64.

GPU Model Architectures

Software#

The following table shows the supported software for TensorRT-LLM.

Software Compatibility

Container

25.04

TensorRT

10.10

Precision

  • Hopper (SM90) - FP32, FP16, BF16, FP8, INT8, INT4

  • Ada Lovelace (SM89) - FP32, FP16, BF16, FP8, INT8, INT4

  • Ampere (SM80, SM86) - FP32, FP16, BF16, INT8, INT4[5]

Note

Support for FP8 and quantized data types (INT8 or INT4) is not implemented for all the models. Refer to Numerical Precision and examples folder for additional information.