Quantization¶

Quantization reduces memory and compute requirements by running operations in low precision:

Scaling is required to translate to/from low precision.
Scaling factors are chosen such that they minimize accuracy loss.
They can be either:
- Loaded into quantization-enabled nvtripy.Modules, or
- Used with nvtripy.quantize()/nvtripy.dequantize().

Post-Training Quantization With ModelOpt¶

If the model was not trained with quantization-aware training (QAT), we can use TensorRT ModelOpt to do calibration to determine scaling factors.

Info

Calibration runs a model with a small set of input data to determine the numerical distribution of each tensor.

The dynamic range is the most important range within this distribution and scales are chosen to target this range.

Let’s calibrate a GPT model:

Install ModelOpt:

python3 -m pip install nvidia-modelopt==0.11.1 transformers==4.46.2 datasets==2.21.0

Download the model:

from transformers import GPT2LMHeadModel

model = GPT2LMHeadModel.from_pretrained("gpt2")

Output:

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`

Calibrate for int8 precision:

Define the forward pass:

from transformers import AutoTokenizer
from modelopt.torch.utils.dataset_utils import create_forward_loop

MAX_SEQ_LEN = 512
tokenizer = AutoTokenizer.from_pretrained(
    "gpt2",
    use_fast=True,
    model_max_length=MAX_SEQ_LEN,
    padding_side="left",
    trust_remote_code=True,
)
tokenizer.pad_token = tokenizer.eos_token

forward_loop = create_forward_loop(
    model=model,
    dataset_name="cnn_dailymail",
    tokenizer=tokenizer,
    device=model.device,
    num_samples=8,
)

Set up quantization configuration:

import modelopt.torch.quantization as mtq

quant_cfg = mtq.INT8_DEFAULT_CFG

Run calibration to replace linear layers with QuantLinear, which contain calibration information:
```
1mtq.quantize(model, quant_cfg, forward_loop=forward_loop)
```

The amax attributes of QuantLinear’s quantizers specify dynamic ranges:

torch_qlinear = model.transformer.h[0].attn.c_attn
print(torch_qlinear)

Output:

QuantLinear(
  in_features=768, out_features=2304, bias=True
  (input_quantizer): TensorQuantizer(8 bit fake per-tensor amax=0.8646 calibrator=MaxCalibrator quant)
  (output_quantizer): TensorQuantizer(disabled)
  (weight_quantizer): TensorQuantizer(8 bit fake axis=0 amax=[0.1202, 2.8436](2304) calibrator=MaxCalibrator quant)
)

We must convert dynamic ranges to scaling factors to load them into Tripy:

def get_scale(quantizer):
    amax = quantizer.export_amax()
    # `maxbound` is the maximum value representible by the data type.
    # For `int8`, this is 127.
    scale = amax.float() / quantizer.maxbound
    return tp.Tensor(scale.squeeze().contiguous())


input_scale = get_scale(torch_qlinear.input_quantizer)
weight_scale = get_scale(torch_qlinear.weight_quantizer)

Local Variables

>>> input_scale
tensor(0.00680824, dtype=float32, loc=cpu:0, shape=())

>>> weight_scale
tensor([0.00729984, 0.00696254, 0.00668715, ..., 0.00260341, 0.00158257, 0.00213622], dtype=float32, loc=cpu:0, shape=(2304,))

Loading Scales Into Tripy¶

Using Modules¶

Modules that support quantization usually:

Expose additional model parameters for scales.
Accept arguments that control how quantization is performed.

Let’s load the scales into an nvtripy.Linear module:

qlinear = tp.Linear(
    768,
    2304,
    # The data type to quantize to:
    quant_dtype=tp.int8,
    # The dimension along which the weights are quantized:
    weight_quant_dim=torch_qlinear.weight_quantizer.axis,
)

# Load weights:
qlinear.weight = tp.Tensor(torch_qlinear.weight.detach().contiguous())
qlinear.bias = tp.Tensor(torch_qlinear.bias.detach().contiguous())

# Load scaling factors:
qlinear.input_scale = input_scale
qlinear.weight_scale = weight_scale

Local Variables

>>> qlinear
Linear(
    weight: Parameter = (shape=(2304, 768), dtype=float32),
    bias: Parameter = (shape=(2304,), dtype=float32),
    weight_scale: Parameter = (shape=(2304,), dtype=float32),
    input_scale: Parameter = (shape=(), dtype=float32),
)
>>> qlinear.state_dict()
{
    weight: tensor(
        [[-0.473848, 0.0874221, 0.00388936, ..., -0.259196, 0.151656, -0.410016],
         [-0.261366, 0.147343, 0.0694663, ..., -0.0163663, 0.217021, -0.192354],
         [-0.0978037, 0.238701, 0.366805, ..., 0.199146, 0.104342, -0.24003],
         ...,
         [0.0513254, -0.0525351, 0.114281, ..., 0.00953369, 0.0293388, -0.00459218],
         [-0.0584389, -0.0112599, 0.0362952, ..., -0.0515984, -0.0428717, 0.00697855],
         [0.0249957, -0.0155876, -0.0318486, ..., 0.0318619, -0.0474668, 0.0198442]],
        dtype=float32, loc=cpu:0, shape=(2304, 768)),
    bias: tensor([0.480339, -0.525433, -0.429265, ..., 0.012573, -0.0498772, 0.00324764], dtype=float32, loc=cpu:0, shape=(2304,)),
    weight_scale: tensor([0.00729984, 0.00696254, 0.00668715, ..., 0.00260341, 0.00158257, 0.00213622], dtype=float32, loc=cpu:0, shape=(2304,)),
    input_scale: tensor(0.00680824, dtype=float32, loc=cpu:0, shape=()),
}

Note

We use scales from ModelOpt here, but scaling factors can come from anywhere.

We can run it just like a regular float32 module. Inputs/weights are quantized internally:

input = tp.ones((1, 768), dtype=tp.float32)

output = qlinear(input)

Local Variables

>>> input
tensor(
    [[1, 1, 1, ..., 1, 1, 1]],
    dtype=float32, loc=gpu:0, shape=(1, 768))

>>> output
tensor(
    [[-11.8781, 11.4667, 12.3143, ..., 0.129627, 1.87678, -0.659854]],
    dtype=float32, loc=gpu:0, shape=(1, 2304))

Manually¶

When using nvtripy.quantize()/nvtripy.dequantize(), dequantize must immediately follow quantize.

TensorRT will rotate dequantize over subsequent ops as needed.

See also

The TensorRT developer guide includes recommendations on placement of quantization and dequantization ops.

To mimic the behavior of the nvtripy.Linear module above, we can:

Quantize the input:

input = tp.ones((1, 768), dtype=tp.float32)

input = tp.quantize(input, input_scale, dtype=tp.int8)
# Note the placement of dequantize:
input = tp.dequantize(input, input_scale, dtype=tp.float32)

Quantize the weights:

weight = tp.Tensor(torch_qlinear.weight.detach().contiguous())

dim = torch_qlinear.weight_quantizer.axis
weight = tp.quantize(weight, weight_scale, dtype=tp.int8, dim=dim)
weight = tp.dequantize(weight, weight_scale, dtype=tp.float32, dim=dim)

Perform the computation (matrix multiply in this case):

bias = tp.Tensor(torch_qlinear.bias.detach().contiguous())

output = input @ tp.transpose(weight, 0, 1) + bias

Local Variables

>>> output
tensor(
    [[-11.8781, 11.4667, 12.3143, ..., 0.129627, 1.87678, -0.659854]],
    dtype=float32, loc=gpu:0, shape=(1, 2304))

Warning

Evaluating the tensor produced by dequantize will affect accuracy.

Why: Evaluation replaces the tensor with a constant, losing information like which op produced it.

So, TensorRT won’t see dequantize when evaluating subsequent ops and won’t rotate it correctly.

For example, don’t do this:

tensor = tp.ones(...)

tensor = tp.quantize(tensor, ...)
tensor = tp.dequantize(tensor, ...)

# The `print` below will trigger an evaluation of the tensor which will prevent
# TensorRT from rotating the dequantization node. This will affect accuracy!
print(tensor)

# Rest of the program, including some computation involving tensor
...