Quantization#

Quantization in TensorRT LLM#

Quantization is a technique used to reduce memory footprint and computational cost by converting the model’s weights and/or activations from high-precision floating-point numbers (like BF16) to lower-precision data types, such as INT8, FP8, or FP4.

TensorRT LLM offers a variety of quantization recipes to optimize LLM inference. These recipes can be broadly categorized as follows:

FP4
FP8 Per Tensor
FP8 Block Scaling
FP8 Rowwise
FP8 KV Cache
NVFP4 KV Cache
W4A16 GPTQ
W4A8 GPTQ
W4A16 AWQ
W4A8 AWQ

Usage#

The default PyTorch backend supports FP4 and FP8 quantization on the latest Blackwell and Hopper GPUs.

Running Pre-quantized Models#

TensorRT LLM can directly run pre-quantized models generated with the NVIDIA Model Optimizer.

from tensorrt_llm import LLM
llm = LLM(model='nvidia/Llama-3.1-8B-Instruct-FP8')
llm.generate("Hello, my name is")

FP8 KV Cache#

Note

TensorRT LLM allows you to enable the FP8 KV cache manually, even for checkpoints that do not have it enabled by default.

Here is an example of how to set the FP8 KV Cache option:

from tensorrt_llm import LLM
from tensorrt_llm.llmapi import KvCacheConfig
llm = LLM(model='/path/to/model',
          kv_cache_config=KvCacheConfig(dtype='fp8'))
llm.generate("Hello, my name is")

NVFP4 KV Cache#

To enable NVFP4 KV cache, offline quantization with ModelOpt is required. Please follow the below section for instructions. After the quantization is done, the NVFP4 KV cache option can be set by:

from tensorrt_llm import LLM
from tensorrt_llm.llmapi import KvCacheConfig
llm = LLM(model='/path/to/model',
          kv_cache_config=KvCacheConfig(dtype='nvfp4'))
llm.generate("Hello, my name is")

Offline Quantization with ModelOpt#

If a pre-quantized model is not available on the Hugging Face Hub, you can quantize it offline using ModelOpt.

Follow this step-by-step guide to quantize a model:

git clone https://github.com/NVIDIA/Model-Optimizer.git
cd Model-Optimizer/examples/llm_ptq
scripts/huggingface_example.sh --model <huggingface_model_card> --quant fp8

NVFP4 KV Cache#

To generate the checkpoint for NVFP4 KV cache:

git clone https://github.com/NVIDIA/Model-Optimizer.git
cd TensorRT-Model-Optimizer/examples/llm_ptq
scripts/huggingface_example.sh --model <huggingface_model_card> --quant fp8 --kv_cache_quant nvfp4

Note that currently TRT-LLM only supports FP8 weight/activation quantization when NVFP4 KV cache is enabled. Therefore, --quant fp8 is required here.

Model Supported Matrix#

Model	NVFP4	MXFP4	FP8(per tensor)	FP8(block scaling)	FP8(rowwise)	FP8 KV Cache	NVFP4 KV Cache	W4A8 AWQ	W4A16 AWQ	W4A8 GPTQ	W4A16 GPTQ
BERT	.	.	.	.	.	Y	.	.	.	.	.
DeepSeek-R1	Y	.	.	Y	.	Y	.	.	.	.	.
EXAONE	.	.	Y	.	.	Y	.	Y	Y	.	.
Gemma 3	.	.	Y	.	.	Y	.	Y	Y	.	.
GPT-OSS	.	Y	.	.	.	Y	.	.	.	.	.
LLaMA	Y	.	Y	.	.	Y	.	.	Y	.	Y
LLaMA-v2	Y	.	Y	.	.	Y	Y	Y	Y	.	Y
LLaMA 3	.	.	.	.	Y	Y	Y	Y	.	.	.
LLaMA 4	Y	.	Y	.	.	Y	.	.	.	.	.
Mistral	.	.	Y	.	.	Y	.	.	Y	.	.
Mixtral	Y	.	Y	.	.	Y	.	.	.	.	.
Phi	.	.	.	.	.	Y	.	Y	.	.	.
Qwen	.	.	.	.	.	Y	.	Y	Y	.	Y
Qwen-2/2.5	Y	.	Y	.	.	Y	.	Y	Y	.	Y
Qwen-3	Y	.	Y	.	.	Y	Y	.	Y	.	Y
BLIP2-OPT	.	.	.	.	.	Y	.	.	.	.	.
BLIP2-T5	.	.	.	.	.	Y	.	.	.	.	.
LLaVA	.	.	Y	.	.	Y	.	.	Y	.	Y
VILA	.	.	Y	.	.	Y	.	.	Y	.	Y
Nougat	.	.	.	.	.	Y	.	.	.	.	.

Note

The vision component of multi-modal models(BLIP2-OPT/BLIP2-T5/LLaVA/VILA/Nougat) uses FP16 by default. The language component decides which quantization methods are supported by a given multi-modal model.

Hardware Support Matrix#

Model	NVFP4	MXFP4	FP8(per tensor)	FP8(block scaling)	FP8(rowwise)	FP8 KV Cache	NVFP4 KV Cache	W4A8 AWQ	W4A16 AWQ	W4A8 GPTQ	W4A16 GPTQ
Blackwell(sm120)	Y	Y	Y	.	.	Y	.	.	.	.	.
Blackwell(sm100)	Y	Y	Y	Y	.	Y	Y	.	.	.	.
Hopper	.	.	Y	Y	Y	Y	.	Y	Y	Y	Y
Ada Lovelace	.	.	Y	.	.	Y	.	Y	Y	Y	Y
Ampere	.	.	.	.	.	Y	.	.	Y	.	Y

Note

FP8 block wise scaling GEMM kernels for sm100 are using MXFP8 recipe (E4M3 act/weight and UE8M0 act/weight scale), which is slightly different from SM90 FP8 recipe (E4M3 act/weight and FP32 act/weight scale).

Quantization#

Quantization in TensorRT LLM#

Usage#

Running Pre-quantized Models#

FP8 KV Cache#

NVFP4 KV Cache#

Offline Quantization with ModelOpt#

NVFP4 KV Cache#

Model Supported Matrix#

Hardware Support Matrix#

Quick Links#