ONNX Quantization - Linux (Beta)

ModelOpt provides ONNX quantization that works together with TensorRT Explicit Quantization (EQ). The key advantages offered by ModelOpt’s ONNX quantization:

  1. Easy to use for non-expert users.

  2. White-box design allowing expert users to customize the quantization process.

  3. Better support for vision transformers.

Currently ONNX quantization supports FP8, INT4 and INT8 quantization.

Note

ModelOpt ONNX quantization generates new ONNX models with QDQ nodes following TensorRT rules. For real speedup, the generated ONNX should be compiled into TensorRT engine.

Requirements

  1. TensorRT >= 8.6 ( >= 10.0 preferred). Please refer to TensorRT 10.0 download link.

Apply Post Training Quantization (PTQ)

PTQ should be done with a calibration dataset. If calibration dataset is not provided, ModelOpt will use random scales for the QDQ nodes.

Prepare calibration dataset

ModelOpt supports npz/npy file as calibration data format and that numpy file should be a dictionary with keys as model input names and values as numpy arrays.

# Example numpy file for single-input ONNX
calib_data = np.random.randn(batch_size, channels, h, w)
np.save("calib_data.npy", calib_data)

# Example numpy file for single/multi-input ONNX
# Dict key should match the input names of ONNX
calib_data = {
    "input_name": np.random.randn(*shape),
    "input_name2": np.random.randn(*shape2),
}
np.savez("calib_data.npz", calib_data)

Call PTQ function

import modelopt.onnx.quantization as moq

calibration_data = np.load(calibration_data_path)

moq.quantize(
    onnx_path=onnx_path,
    calibration_data=calibration_data,
    output_path="quant.onnx",
    quantize_mode="int8",
)

Alternatively, you can call PTQ function in command line:

python -m modelopt.onnx.quantization \
    --calibration_data_path /calibration/data/in/npz/npy/format \
    --output_path /path/to/the/quantized/onnx/output \
    --quantize_mode int8

By default, after running the calibration, the quantization tool will insert the QDQ nodes by following TensorRT friendly QDQ insertion algorithm. Users can change the default quantization behavior by tweaking the API params like op_types_to_quantize, op_types_to_exclude etc. See the modelopt.onnx.quantization.quantize() for details.

Deploy Quantized ONNX Model

trtexec is a command-line tool provided by TensorRT. Typically, it’s within the /usr/src/tensorrt/bin/ directory. Below is a simple command to compile the quantized onnx model generated by the previous step into a TensorRT engine file.

trtexec --onnx=quant.onnx --saveEngine=quant.engine --best

Compare the performance

The following command will build the engine using fp16 precision. After building, check the reported “Latency” and “Throughput” fields and compare.

trtexec --onnx=original.onnx --saveEngine=fp16.engine --fp16

Note

If you replace --fp16 flag with --best flag, this command will create an int8 engine with TensorRT’s implicit quantization.