Best practices to choose the right quantization methods
A quantization method comprises three primary components:
Weight precision format
Activation precision format
Calibration algorithms
Typically, in the context of small-batch inference scenarios (batch size ≤ 4), the inference is often ‘memory-bound’. In memory-bound inference, the throughput is limited by the weight loading time from GPU memory to GPU cache - i.e, inference is memory bandwidth limited. In this regime of operation, weight-only quantization methods such as INT4 AWQ or INT4-FP8 AWQ gives superior performance improvement.
Conversely, for large-batch inference scenarios, such as serving scenarios (batch size ≥ 16), both memory bandwidth and computation density become crucial factors. Consequently, it’s recommended to opt for a quantization method that has both weights & activation quantization as well as lower precision computation kernels. For batch size ≥ 16, the choice of quantization method can be model specific.
We suggest prioritizing using FP8 first, as FP8 causes very little accuracy degradation and gives strong performance. If FP8 performance does not meet your requirements, you could try INT4-FP8 AWQ. If your deployment is on Ampere GPUs or earlier, we recommend using INT4 AWQ or INT8 SQ.
Based on specific use cases, users might have different tolerances on accuracy degradation and calibration time. The table below summarizes the tradeoffs* to consider when choosing a quantization method.
Quantization Methods |
Performance small-batch |
Performance large-batch |
Accuracy degradation |
Details |
---|---|---|---|---|
FP8 |
Medium |
Medium |
Very Low |
|
INT8 SmoothQuant |
Medium |
Medium |
Medium |
|
INT4 Weights only AWQ (W4A16) |
High |
Low |
Low |
|
INT4-FP8 AWQ (W4A8) |
High |
Medium |
Low |
|
- Please see how to apply these quantization methods below: