ONNX Quantization - Linux (Beta)
ModelOpt provides ONNX quantization that works together with TensorRT Explicit Quantization (EQ). The key advantages offered by ModelOpt’s ONNX quantization:
Easy to use for non-expert users.
White-box design allowing expert users to customize the quantization process.
Better support for vision transformers.
Currently ONNX quantization supports FP8, INT4 and INT8 quantization.
Note
ModelOpt ONNX quantization generates new ONNX models with QDQ nodes following TensorRT rules. For real speedup, the generated ONNX should be compiled into TensorRT engine.
Requirements
TensorRT >= 8.6 ( >= 10.0 preferred). Please refer to TensorRT 10.0 download link.
Apply Post Training Quantization (PTQ)
PTQ should be done with a calibration dataset. If calibration dataset is not provided, ModelOpt will use random scales for the QDQ nodes.
Prepare calibration dataset
ModelOpt supports npz/npy file as calibration data format and that numpy file should be a dictionary with keys as model input names and values as numpy arrays.
# Example numpy file for single-input ONNX
calib_data = np.random.randn(batch_size, channels, h, w)
np.save("calib_data.npy", calib_data)
# Example numpy file for single/multi-input ONNX
# Dict key should match the input names of ONNX
calib_data = {
"input_name": np.random.randn(*shape),
"input_name2": np.random.randn(*shape2),
}
np.savez("calib_data.npz", calib_data)
Call PTQ function
import modelopt.onnx.quantization as moq
calibration_data = np.load(calibration_data_path)
moq.quantize(
onnx_path=onnx_path,
calibration_data=calibration_data,
output_path="quant.onnx",
quantize_mode="int8",
)
Alternatively, you can call PTQ function in command line:
python -m modelopt.onnx.quantization \
--calibration_data_path /calibration/data/in/npz/npy/format \
--output_path /path/to/the/quantized/onnx/output \
--quantize_mode int8
By default, after running the calibraton, the quantization tool will insert the QDQ nodes by following TensorRT friendly QDQ insertion algorithm. Users can change the default quantization behavior by tweaking the API params like op_types_to_quantize, op_types_to_exclude etc. See the modelopt.onnx.quantization.quantize()
for details.
Deploy Quantized ONNX Model
trtexec
is a command-line tool provided by TensorRT. Typically, it’s within the /usr/src/tensorrt/bin/
directory. Below is a simple command to compile the quantized onnx model generated by the previous step into a TensorRT engine file.
trtexec --onnx=quant.onnx --saveEngine=quant.engine --best
Compare the performance
The following command will build the engine using fp16 precision. After building, check the reported “Latency” and “Throughput” fields and compare.
trtexec --onnx=original.onnx --saveEngine=fp16.engine --fp16
Note
If you replace --fp16
flag with --best
flag, this command will create an int8 engine with TensorRT’s implicit quantization.