ONNX Quantization - Linux (Beta)
ModelOpt provides ONNX quantization that works together with TensorRT Explicit Quantization (EQ). The key advantages offered by ModelOpt’s ONNX quantization:
Easy to use for non-expert users.
White-box design allowing expert users to customize the quantization process.
Better support for vision transformers.
Currently ONNX quantization supports FP8, INT4 and INT8 quantization.
Note
ModelOpt ONNX quantization generates new ONNX models with QDQ nodes following TensorRT rules. For real speedup, the generated ONNX should be compiled into TensorRT engine.
Requirements
Execution Provider |
Requirements |
---|---|
CPU |
|
CUDA |
|
TensorRT |
|
Apply Post Training Quantization (PTQ)
PTQ should be done with a calibration dataset. If calibration dataset is not provided, ModelOpt will use random scales for the QDQ nodes.
Prepare calibration dataset
ModelOpt supports npz/npy file as calibration data format and that numpy file should be a dictionary with keys as model input names and values as numpy arrays.
# Example numpy file for single-input ONNX
calib_data = np.random.randn(batch_size, channels, h, w)
np.save("calib_data.npy", calib_data)
# Example numpy file for single/multi-input ONNX
# Dict key should match the input names of ONNX
calib_data = {
"input_name": np.random.randn(*shape),
"input_name2": np.random.randn(*shape2),
}
np.savez("calib_data.npz", calib_data)
Call PTQ function
import modelopt.onnx.quantization as moq
calibration_data = np.load(calibration_data_path)
moq.quantize(
onnx_path=onnx_path,
calibration_data=calibration_data,
output_path="quant.onnx",
quantize_mode="int8",
)
Alternatively, you can call PTQ function in command line:
usage: python -m modelopt.onnx.quantization [-h] --onnx_path ONNX_PATH
[--quantize_mode {fp8,int8,int4}]
[--calibration_method {max,entropy,awq_clip,rtn_dq}]
[--calibration_data_path CALIBRATION_DATA_PATH | --calibration_cache_path CALIBRATION_CACHE_PATH]
[--calibration_shapes CALIBRATION_SHAPES]
[--calibration_eps CALIBRATION_EPS [CALIBRATION_EPS ...]]
[--override_shapes OVERRIDE_SHAPES]
[--op_types_to_quantize OP_TYPES_TO_QUANTIZE [OP_TYPES_TO_QUANTIZE ...]]
[--op_types_to_exclude OP_TYPES_TO_EXCLUDE [OP_TYPES_TO_EXCLUDE ...]]
[--nodes_to_quantize NODES_TO_QUANTIZE [NODES_TO_QUANTIZE ...]]
[--nodes_to_exclude NODES_TO_EXCLUDE [NODES_TO_EXCLUDE ...]]
[--use_external_data_format]
[--keep_intermediate_files]
[--output_path OUTPUT_PATH]
[--log_level {DEBUG,INFO,WARNING,ERROR,debug,info,warning,error}]
[--log_file LOG_FILE]
[--trt_plugins TRT_PLUGINS]
[--trt_plugins_precision TRT_PLUGINS_PRECISION [TRT_PLUGINS_PRECISION ...]]
[--high_precision_dtype HIGH_PRECISION_DTYPE]
[--mha_accumulation_dtype MHA_ACCUMULATION_DTYPE]
[--disable_mha_qdq] [--dq_only]
[--use_zero_point USE_ZERO_POINT]
[--passes {concat_elimination} [{concat_elimination} ...]]
[--simplify]
Named Arguments
- --onnx_path
Input onnx model without Q/DQ nodes.
- --quantize_mode
Possible choices: fp8, int8, int4
Quantization mode for the given ONNX model.
Default:
'int8'
- --calibration_method
Possible choices: max, entropy, awq_clip, rtn_dq
Calibration method choices for fp8: {max (default)}, int8: {entropy (default), max}, int4: {awq_clip (default), rtn_dq}.
- --calibration_data_path
Calibration data in npz/npy format. If None, random data for calibration will be used.
- --calibration_cache_path
Pre-calculated activation tensor scaling factors aka calibration cache path.
- --calibration_shapes
Optional model input shapes for calibration.Users should provide the shapes specifically if the model has non-batch dynamic dimensions.Example input shapes spec: input0:1x3x256x256,input1:1x3x128x128
- --calibration_eps
Priority order for the execution providers (EP) to calibrate the model. Any subset of [‘trt’, ‘cuda:x’, dml:x, ‘cpu’], where ‘x’ is the device id.If a custom op is detected in the model, ‘trt’ will automatically be added to the EP list.
Default:
['cpu', 'cuda:0', 'trt']
- --override_shapes
Override model input shapes with static shapes.Example input shapes spec: input0:1x3x256x256,input1:1x3x128x128
- --op_types_to_quantize
A space-separated list of node types to quantize.
Default:
[]
- --op_types_to_exclude
A space-separated list of node types to exclude from quantization.
Default:
[]
- --nodes_to_quantize
A space-separated list of node names to quantize. Regular expressions are supported.
Default:
[]
- --nodes_to_exclude
A space-separated list of node names to exclude from quantization. Regular expressions are supported.
Default:
[]
- --use_external_data_format
If True, <MODEL_NAME>.onnx_data will be used to load and/or write weights and constants.
Default:
False
- --keep_intermediate_files
If True, keep the files generated during the ONNX models’ conversion/calibration. Otherwise, only the converted ONNX file is kept for the user.
Default:
False
- --output_path
Output filename to save the converted ONNX model. If None, save it in the same dir as the original ONNX model with an appropriate suffix.
- --log_level
Possible choices: DEBUG, INFO, WARNING, ERROR, debug, info, warning, error
Set the logging level for the quantization process.
Default:
'INFO'
- --log_file
Path to the log file for the quantization process.
- --trt_plugins
Specifies custom TensorRT plugin library paths in .so format (compiled shared library). For multiple paths, separate them with a semicolon, i.e.: “lib_1.so;lib_2.so”. If this is not None, the TensorrtExecutionProvider is invoked, so make sure that the TensorRT libraries are in the PATH or LD_LIBRARY_PATH variables.
- --trt_plugins_precision
A space-separated list indicating the precision for each custom op. Each item should have the format <op_type>:<precision>, where precision can be fp32 (default) or fp16. For example: op_type_1:fp16 op_type_2:fp32.
- --high_precision_dtype
High precision data type, one of [‘fp32’, ‘fp16’]. For int8 quantization, the default value is ‘fp32’ and ‘fp16’ for other quantization modes.
- --mha_accumulation_dtype
Accumulation dtype of MHA. This flag will only take effect when mha_accumulation_dtype == ‘fp32’ and quantize_mode == ‘fp8’. One of [‘fp32’, ‘fp16’]
Default:
'fp16'
- --disable_mha_qdq
If True, Q/DQ will not be added to MatMuls in MHA pattern.
Default:
False
- --dq_only
If True, FP32/FP16 weights will be converted to INT8/FP8 weights. Q nodes will get removed from the weights and have only DQ nodes with those converted INT8/FP8 weights in the output model.
Default:
False
- --use_zero_point
If True, zero-point based quantization will be used - currently, applicable for awq_lite algorithm.
Default:
False
- --passes
Possible choices: concat_elimination
A space-separated list of optimization passes name, if set, appropriate pre/post-processing passes will be invoked.
Default:
['concat_elimination']
- --simplify
If True, the given ONNX model will be simplified before quantization is performed.
Default:
False
If the model contains custom ops, enable calibration with the TensorRT Execution Provider backend with CUDA and CPU fallback (--calibration_eps trt cuda:0 cpu
) and, if relevant, provide the location of the TensorRT plugin in .so
format via the --trt_plugins
flag.
By default, after running the calibration, the quantization tool will insert the QDQ nodes by following TensorRT friendly QDQ insertion algorithm. Users can change the default quantization behavior by tweaking the API params like op_types_to_quantize, op_types_to_exclude etc. See the modelopt.onnx.quantization.quantize()
for details.
Deploy Quantized ONNX Model
trtexec
is a command-line tool provided by TensorRT. Typically, it’s within the /usr/src/tensorrt/bin/
directory. Below is a simple command to compile the quantized onnx model generated by the previous step into a TensorRT engine file.
trtexec --onnx=quant.onnx --saveEngine=quant.engine --best
Compare the performance
The following command will build the engine using fp16 precision. After building, check the reported “Latency” and “Throughput” fields and compare.
trtexec --onnx=original.onnx --saveEngine=fp16.engine --fp16
Note
If you replace --fp16
flag with --best
flag, this command will create an int8 engine with TensorRT’s implicit quantization.