Quantization#

Quantization in TensorRT LLM#

Quantization is a technique used to reduce memory footprint and computational cost by converting the model’s weights and/or activations from high-precision floating-point numbers (like BF16) to lower-precision data types, such as INT8, FP8, or FP4.

TensorRT LLM offers a variety of quantization recipes to optimize LLM inference. These recipes can be broadly categorized as follows:

  • FP4

  • FP8 Per Tensor

  • FP8 Block Scaling

  • FP8 Rowwise

  • FP8 KV Cache

  • NVFP4 KV Cache

  • W4A16 GPTQ

  • W4A8 GPTQ

  • W4A16 AWQ

  • W4A8 AWQ

Usage#

The default PyTorch backend supports FP4 and FP8 quantization on the latest Blackwell and Hopper GPUs.

Running Pre-quantized Models#

TensorRT LLM can directly run pre-quantized models generated with the NVIDIA Model Optimizer.

from tensorrt_llm import LLM
llm = LLM(model='nvidia/Llama-3.1-8B-Instruct-FP8')
llm.generate("Hello, my name is")

FP8 KV Cache#

Note

TensorRT LLM allows you to enable the FP8 KV cache manually, even for checkpoints that do not have it enabled by default.

Here is an example of how to set the FP8 KV Cache option:

from tensorrt_llm import LLM
from tensorrt_llm.llmapi import KvCacheConfig
llm = LLM(model='/path/to/model',
          kv_cache_config=KvCacheConfig(dtype='fp8'))
llm.generate("Hello, my name is")

NVFP4 KV Cache#

To enable NVFP4 KV cache, offline quantization with ModelOpt is required. Please follow the below section for instructions. After the quantization is done, the NVFP4 KV cache option can be set by:

from tensorrt_llm import LLM
from tensorrt_llm.llmapi import KvCacheConfig
llm = LLM(model='/path/to/model',
          kv_cache_config=KvCacheConfig(dtype='nvfp4'))
llm.generate("Hello, my name is")

Offline Quantization with ModelOpt#

If a pre-quantized model is not available on the Hugging Face Hub, you can quantize it offline using ModelOpt.

Follow this step-by-step guide to quantize a model:

git clone https://github.com/NVIDIA/Model-Optimizer.git
cd Model-Optimizer/examples/llm_ptq
scripts/huggingface_example.sh --model <huggingface_model_card> --quant fp8

NVFP4 KV Cache#

To generate the checkpoint for NVFP4 KV cache:

git clone https://github.com/NVIDIA/Model-Optimizer.git
cd TensorRT-Model-Optimizer/examples/llm_ptq
scripts/huggingface_example.sh --model <huggingface_model_card> --quant fp8 --kv_cache_quant nvfp4

Note that currently TRT-LLM only supports FP8 weight/activation quantization when NVFP4 KV cache is enabled. Therefore, --quant fp8 is required here.

Model Supported Matrix#

Model

NVFP4

MXFP4

FP8(per tensor)

FP8(block scaling)

FP8(rowwise)

FP8 KV Cache

NVFP4 KV Cache

W4A8 AWQ

W4A16 AWQ

W4A8 GPTQ

W4A16 GPTQ

BERT

.

.

.

.

.

Y

.

.

.

.

.

DeepSeek-R1

Y

.

.

Y

.

Y

.

.

.

.

.

EXAONE

.

.

Y

.

.

Y

.

Y

Y

.

.

Gemma 3

.

.

Y

.

.

Y

.

Y

Y

.

.

GPT-OSS

.

Y

.

.

.

Y

.

.

.

.

.

LLaMA

Y

.

Y

.

.

Y

.

.

Y

.

Y

LLaMA-v2

Y

.

Y

.

.

Y

Y

Y

Y

.

Y

LLaMA 3

.

.

.

.

Y

Y

Y

Y

.

.

.

LLaMA 4

Y

.

Y

.

.

Y

.

.

.

.

.

Mistral

.

.

Y

.

.

Y

.

.

Y

.

.

Mixtral

Y

.

Y

.

.

Y

.

.

.

.

.

Phi

.

.

.

.

.

Y

.

Y

.

.

.

Qwen

.

.

.

.

.

Y

.

Y

Y

.

Y

Qwen-2/2.5

Y

.

Y

.

.

Y

.

Y

Y

.

Y

Qwen-3

Y

.

Y

.

.

Y

Y

.

Y

.

Y

BLIP2-OPT

.

.

.

.

.

Y

.

.

.

.

.

BLIP2-T5

.

.

.

.

.

Y

.

.

.

.

.

LLaVA

.

.

Y

.

.

Y

.

.

Y

.

Y

VILA

.

.

Y

.

.

Y

.

.

Y

.

Y

Nougat

.

.

.

.

.

Y

.

.

.

.

.

Note

The vision component of multi-modal models(BLIP2-OPT/BLIP2-T5/LLaVA/VILA/Nougat) uses FP16 by default. The language component decides which quantization methods are supported by a given multi-modal model.

Hardware Support Matrix#

Model

NVFP4

MXFP4

FP8(per tensor)

FP8(block scaling)

FP8(rowwise)

FP8 KV Cache

NVFP4 KV Cache

W4A8 AWQ

W4A16 AWQ

W4A8 GPTQ

W4A16 GPTQ

Blackwell(sm120)

Y

Y

Y

.

.

Y

.

.

.

.

.

Blackwell(sm100)

Y

Y

Y

Y

.

Y

Y

.

.

.

.

Hopper

.

.

Y

Y

Y

Y

.

Y

Y

Y

Y

Ada Lovelace

.

.

Y

.

.

Y

.

Y

Y

Y

Y

Ampere

.

.

.

.

.

Y

.

.

Y

.

Y

Note

FP8 block wise scaling GEMM kernels for sm100 are using MXFP8 recipe (E4M3 act/weight and UE8M0 act/weight scale), which is slightly different from SM90 FP8 recipe (E4M3 act/weight and FP32 act/weight scale).