FP8 Quantization
Quantization is a technique that allows models to run in lower precisions like int8 and fp8 while maintaining acceptable output quality. Running in lower precisions can greatly boost performance, significantly increasing throughput and decreasing latency. The tradeoff is a drop in output quality, but in many cases the output quality is still acceptable and many real world deployments utilize quantization. If you want to learn more about quantization refer to Mastering LLM Techniques - Inference Optimization
This section walks through enabling fp8 quantization and highlight some fp8 quantization specific configuration options for boosting performance. It also continues the case study of Llama-3.3-70B split across 4 H100-sxm-80GB GPUs via tensor parallelism and showcase the effects of enabling these configuration options on performance.
Disclaimer: While performance numbers shown here are real, they are only for demonstration purposes. Differences in environment, SKU, interconnect, and workload can all significantly affect performance and lead to your results differing from what is shown here.
Enabling Quantization
To enable quantization you need to configure the QuantConfig
class and pass it to the quant_config
parameter of the LLM class. At a minimum the quant_algo
parameter, which sets the quantization algorithm (fp8, fp8 per token, int8awq, etc.) must be specified. You can find all supported quantization algorithms and other configurable options for QuantConfig
in the LLM-API->Reference section of the docs. While it is not required if you are using weights/checkpoints from that are already quantized, if you are using an fp16 checkpoint then you also need to specify the calibration dataset that will be used to determine the quantization scales via CalibConfig
. CalibConfig
provides several options for setting the calibration dataset that can also be referenced in the LLM-API->Reference section of the docs. Although TensorRT-LLM supports several other types of quantization, this guide focuses on fp8.
Here is an example of building and saving an fp8 engine from a bf16 checkpoint (Note that fp8 is supported only on devices with compute capability > 8.9 - Ada, Hopper, Blackwell, and beyond):
from tensorrt_llm import LLM, BuildConfig
from tensorrt_llm.llmapi import QuantConfig, QuantAlgo, CalibConfig
def main():
quant_config = QuantConfig(quant_algo=QuantAlgo.FP8)
calib_config = CalibConfig(
calib_batches=512,
calib_batch_size=1,
calib_max_seq_length=2048,
tokenizer_max_seq_length=4096
)
build_config = BuildConfig(
max_num_tokens=2048,
max_batch_size=512,
)
build_config.plugin_config.use_paged_context_fmha = True
build_config.plugin_config.multiple_profiles = True
llm = LLM(
model="/path/to/Llama-3.3-70B",
tensor_parallel_size=4,
pipeline_parallel_size=1,
build_config=build_config,
quant_config=quant_config,
calib_config=calib_config
)
llm.save("baseline_fp8_engine")
if __name__ == '__main__':
main()
For an example of how to build an fp8 engine using the TensorRT-LLM CLI workflow flow see TensorRT-LLM LLaMA examples. In short you first run examples/quantization/quantize.py
to quantize and convert the model checkpoint to TensorRT-LLM format and then use trtllm-build
.
Note: While quantization aims to preserve model accuracy this is not guaranteed and it is extremely important you check that the quality of outputs remains sufficient after quantization.
FP8 “Baseline” Performance
Benchmarking the engine produced by the example above yielded the following performance results. Note that we enabled some of the build flags we mentioned earlier (multiple profiles, paged_context_fmha) and also tuned max batch size and max num tokens. This is done to give a sense of what performance is achievable if you tune an fp8 engine but exclude options that have been tailored for quantization. We recommend disabling the gemm plugin for quantized engines which is why it is not included here (it is off by default). Reduce fusion has a quantization specific optimization that will be covered later. For the remainder of this page we will refer to this setup as the “baseline” numbers for fp8.
Metric |
Value |
---|---|
Token Throughput (tokens/sec) |
3389.5305 |
Request Throughput (req/sec) |
1.6550 |
Average Time To First Token (ms) |
96.1597 |
Average Inter-Token Latency (ms) |
12.4248 |
Quantized KV-Cache
By default the KV-Cache is not quantized but TensorRT-LLM supports quantizing the KV-Cache to further improve performance. However, quantizing the model more aggressively also increases the risk of model output quality degrading so it is important to check that when using this feature.
Enabling Quantized KV Cache
The LLM-API exposes the quantization algorithm to be used for kv cache via the kv_cache_quant_algo
field in QuantConfig
. To enable fp8 kv cache, you would modify QuantConfig
as such:
quant_config = QuantConfig(quant_algo=QuantAlgo.FP8,
kv_cache_quant_algo=QuantAlgo.FP8)
If you are using the CLI flow for building engines pass --kv_cache_dtype fp8
to examples/quantization/quantize.py
.
Performance with Quantized KV Cache
Metric |
Baseline |
FP8 KV-Cache ON |
---|---|---|
Token Throughput (tokens/sec) |
3389.5305 |
5299.6372 |
Request Throughput (req/sec) |
1.6550 |
2.5877 |
Average Time To First Token (ms) |
96.1597 |
97.1287 |
Average Inter-Token Latency (ms) |
12.4248 |
12.5496 |
Reduce Norm Fusion with User Buffers for Llama Models
The Reduce Norm Fusion feature is supported for fp8. An additional optimization called “User Buffers” is also supported for fp8 models. The user buffer feature aims to eliminate extra copies from the local buffer to the shared buffer in the communication kernel, leading to improved end-to-end performance.
Enabling Reduce Norm Fusion with User Buffers
To enable reduce norm fusion with user buffers, add the following lines below BuildConfig
’s initialization
build_config.plugin_config.reduce_fusion = True
build_config.plugin_config.user_buffer = True
If you are using the CLI flow for building engines pass --reduce_fusion enable
and --user_buffer enable
to trtllm-build
to enable the feature.
Note: You must have enabled
reduce_fusion
in order to enableuser_buffer
Performance with Reduce Norm Fusion + User Buffers:
Reduce Norm Fusion + User Buffer ON: Same engine previously referred to as FP8 KV-Cache ON.
Reduce Norm Fusion + User Buffer ON: Previous example with reduce fusion and user buffers enabled. Max-num tokens set to 16384 and max-batch size set to 512 after tuning.
Metric |
Reduce Norm Fusion + User Buffer OFF |
Reduce Norm Fusion + User Buffer ON |
---|---|---|
Token Throughput (tokens/sec) |
5299.6372 |
5980.7842 |
Request Throughput (req/sec) |
2.5877 |
2.9203 |
Average Time To First Token (ms) |
97.1287 |
82.2679 |
Average Inter-Token Latency (ms) |
12.5496 |
12.6975 |
GEMM + SwiGLU Fusion in Gated-MLP
The GEMM + SwiGLU fusion in Gated-MLP combines two Matmul operations and one SwiGLU operation into a single kernel. Currently this is only supported for FP8 precision on Hopper. While this fusion improves performance, it can slightly reduce accuracy in FP8 PTQ because one quantization scaling factor is discarded.
We recommend enabling this feature for large models running on Hopper with FP8 precision.We do not recommend enabling this feature for very small workloads or if the accuracy loss is unacceptable.
Enabling GEMM + SwiGLU Fusion
To enable the GEMM + SwiGLU fusion, add the following lines below BuildConfig
’s initialization
build_config.plugin_config.gemm_swiglu_plugin = 'fp8'
For small batch size cases where latency is important, you can replace the above line with
build_config.plugin_config.low_latency_gemm_swiglu_plugin = 'fp8'
If you are using the CLI flow for building engines pass --gemm_swiglu_plugin=fp8
or --low_latency_gemm_swiglu_plugin=fp8
for the low latency case (only include one or the other) to trtllm-build
.
Performance with GEMM + SwiGLU Fusion
Metric |
GEMM + SwiGLU fusion OFF |
GEMM + SwiGLU fusion ON |
---|---|---|
Token Throughput (tokens/sec) |
5980.7842 |
5976.7977 |
Request Throughput (req/sec) |
2.9203 |
2.9184 |
Average Time To First Token (ms) |
82.2679 |
81.8841 |
Average Inter-Token Latency (ms) |
12.6975 |
11.7031 |
In this case, the GEMM + SwiGLU plugin performs almost equivalently to when it was disabled. The throughput drop is within run to run variance and the TTFT and ITL improvements are slight. However, we found that when paired with the low latency gemm plugin discussed next, enabling this feature was necessary for getting the maximum throughput.
Low Latency GEMM Plugin
Previously we mentioned the GEMM Plugin feature. Although it has fp8 support we recommend disabling it (by default it is disabled). However for low-latency scenarios in fp8 we recommend trying the low latency GEMM plugin to see if it is effective for your workload.
Enabling Low Latency GEMM plugin
To enable the low latency GEMM plugin, add the following lines below BuildConfig
’s initialization
build_config.plugin_config.low_latency_gemm_plugin = 'fp8'
If you are using the CLI flow for building engines pass --low_latency_gemm_plugin=fp8
to trtllm-build
to enable the feature. Again, we recommend disabling the gemm plugin for fp8 so if you are passing --gemm_plugin=fp8
to trtllm-build
we recommend removing that.
Performance with Low Latency GEMM plugin
Low Latency GEMM ON: Same configuration as previous example but with low latency GEMM plugin enabled. Max num tokens was set to 16384 and max-batch size was set to 512 after tuning.
Metric |
Low Latency GEMM OFF |
Low Latency GEMM ON |
---|---|---|
Token Throughput (tokens/sec) |
5976.7977 |
6049.1625 |
Request Throughput (req/sec) |
2.9184 |
2.9537 |
Average Time To First Token (ms) |
81.8841 |
88.0162 |
Average Inter-Token Latency (ms) |
11.7031 |
10.8225 |
In this case, enabling the low-latency gemm plugin actually provided a meaningful boost to throughput. Additionally it also improved ITL but at the expense of TTFT. Furthermore, when used without the gemm+swiglu fusion, performance was actually worse than with out the plugin turned on. This suggests that for this workload the low-latency gemm plugin was choosing a worse kernel for the gemm right before the swiglu, but once that was handled by the gemm+swiglu fusion custom kernel, the rest of the kernels the low-latency gemm plugin was choosing was better than the baseline, resulting in improved performance. This underscores the importance of benchmarking different settings as the impact of this plugin is highly workload dependent. If possible some grid searching can be useful for extremely performance sensitive workloads
Conclusion
Overall leveraging quantization can provide significant uplifts in performance. Here are the performance uplifts from our tuned fp8 model as compared to the tuned fp16 numbers we reached in the previous page of guide
Metric |
Tuned FP16 Model |
Tuned FP8 Model |
% Improvement |
---|---|---|---|
Token Throughput (tokens/sec) |
2474.2581 |
6049.1625 |
144.48 |
Request Throughput (req/sec) |
1.2081 |
2.9537 |
144.49 |
Average Time To First Token (ms) |
147.5742 |
88.0162 |
40.36 |
Average Inter-Token Latency (ms) |
14.6852 |
10.8225 |
26.30 |
Additionally, compared to the fp8 baseline numbers (the baseline numbers had some degree of tuning, see Baseline Performance for details), we received the following performance uplifts from enabling the flags discussed above:
Metric |
Baseline FP8 Model |
Tuned FP8 Model |
% Improvement |
---|---|---|---|
Token Throughput (tokens/sec) |
3389.5305 |
6049.1625 |
78.47 |
Request Throughput (req/sec) |
1.6550 |
2.9537 |
78.47 |
Average Time To First Token (ms) |
96.1597 |
88.0162 |
8.47 |
Average Inter-Token Latency (ms) |
12.4248 |
10.8225 |
12.90 |
As mentioned previously, the caveat with leveraging quantization are potential drops in accuracy, and we strongly recommend having a way to test whether model output quality is acceptable before attempting to use quantization. That said, many real world cases successfully use quantization and the significant performance boosts it enables are often worth the effort to see if it is a fit.
Summary of Configuration Option Recommendations:
Quantized KV-cache: Typically provides significant throughput boost. We recommend turning it on as long as output quality is still acceptable with the feature enabled.
Reduce fusion + user buffers: This feature is only supported on fp8 Llama and Mistral/Mixtral models. Effectiveness is workload dependent so we recommend turning it on and benchmarking to check.
Gemm + Swiglu Plugin: This feature is only supported on fp8 models with Swiglu operators like Llama, Mixtral etc. Like reduce fusion effectiveness is workload dependent and we recommend sanity checking effectiveness. Has increased risk of affecting accuracy since it drops a quantization scale.
Low-Latency GEMM plugin: Effectiveness is workload dependent so we recommend turning it on and benchmarking. Effectiveness can be affected by other flags as we saw in our case study, so if possible benchmarking various combinations of configuration options is ideal.