Support Matrix

TensorRT-LLM optimizes the performance of a range of well-known models on NVIDIA GPUs. The following sections provide a list of supported GPU architectures as well as important features implemented in TensorRT-LLM.

Hardware

The following table shows the supported hardware for TensorRT-LLM.

If a GPU is not listed, it is important to note that TensorRT-LLM is expected to work on GPUs based on the Volta, Turing, Ampere, Hopper, and Ada Lovelace architectures. Certain limitations may, however, apply.

Hardware Compatibility

Operating System

TensorRT-LLM requires Linux x86_64 or Windows.

GPU Model Architectures

Software

The following table shows the supported software for TensorRT-LLM.

Software Compatibility

Container

[23.10](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html#framework-matrix-2023)

TensorRT

[9.2](https://docs.nvidia.com/deeplearning/tensorrt/release-notes/index.html)

Precision

  • Hopper (SM90) - FP32, FP16, BF16, FP8, INT8, INT4

  • Ada Lovelace (SM89) - FP32, FP16, BF16, FP8, INT8, INT4

  • Ampere (SM80, SM86) - FP32, FP16, BF16, INT8, INT4(3)

  • Turing (SM75) - FP32, FP16, INT8(1), INT4(2)

  • Volta (SM70) - FP32, FP16, INT8(1), INT4(2)

Models

Multi-Modal Models (5)

(1) INT8 SmoothQuant is not supported on SM70 and SM75.
(2) INT4 AWQ and GPTQ are not supported on SM < 80.
(3) INT4 AWQ and GPTQ with FP8 activations require SM >= 89.
(4) Encoder-Decoder provides general encoder-decoder functionality that supports many encoder-decoder models such as T5 family, BART family, Whisper family, NMT family, and so on. (5) Multi-modal provides general multi-modal functionality that supports many multi-modal architectures such as BLIP family, LLaVA family, and so on.

Note

Support for FP8 and quantized data types (INT8 or INT4) is not implemented for all the models. Refer to Numerical Precision and examples folder for additional information.