ONNX Quantization - Linux (Beta)

ModelOpt provides ONNX quantization that works together with TensorRT Explicit Quantization (EQ). The key advantages offered by ModelOpt’s ONNX quantization:

Easy to use for non-expert users.
White-box design allowing expert users to customize the quantization process.
Better support for vision transformers.

Currently ONNX quantization supports FP8, INT4 and INT8 quantization.

Note

ModelOpt ONNX quantization generates new ONNX models with QDQ nodes following TensorRT rules. For real speedup, the generated ONNX should be compiled into TensorRT engine.

Requirements

Execution Provider	Requirements
CPU	Default
CUDA	Add `libcudnn.so` path to `LD_LIBRARY_PATH` (check required cuDNN version here)
TensorRT	Same requirement as CUDA EP TensorRT >= 10.0 (add TensorRT `lib/` path to `LD_LIBRARY_PATH` and install python wheel). Please refer to TensorRT 10.0 download link.

Apply Post Training Quantization (PTQ)

PTQ should be done with a calibration dataset. If calibration dataset is not provided, ModelOpt will use random scales for the QDQ nodes.

Prepare calibration dataset

ModelOpt supports npz/npy file as calibration data format and that numpy file should be a dictionary with keys as model input names and values as numpy arrays.

# Example numpy file for single-input ONNX
calib_data = np.random.randn(batch_size, channels, h, w)
np.save("calib_data.npy", calib_data)

# Example numpy file for single/multi-input ONNX
# Dict key should match the input names of ONNX
calib_data = {
    "input_name": np.random.randn(*shape),
    "input_name2": np.random.randn(*shape2),
}
np.savez("calib_data.npz", calib_data)

Call PTQ function

import modelopt.onnx.quantization as moq

calibration_data = np.load(calibration_data_path)

moq.quantize(
    onnx_path=onnx_path,
    calibration_data=calibration_data,
    output_path="quant.onnx",
    quantize_mode="int8",
)

Alternatively, you can call PTQ function in command line:

usage: python -m modelopt.onnx.quantization [-h] --onnx_path ONNX_PATH
                                            [--quantize_mode {fp8,int8,int4}]
                                            [--calibration_method {max,entropy,awq_clip,rtn_dq}]
                                            [--calibration_data_path CALIBRATION_DATA_PATH | --calibration_cache_path CALIBRATION_CACHE_PATH]
                                            [--calibration_shapes CALIBRATION_SHAPES]
                                            [--calibration_eps CALIBRATION_EPS [CALIBRATION_EPS ...]]
                                            [--override_shapes OVERRIDE_SHAPES]
                                            [--op_types_to_quantize OP_TYPES_TO_QUANTIZE [OP_TYPES_TO_QUANTIZE ...]]
                                            [--op_types_to_exclude OP_TYPES_TO_EXCLUDE [OP_TYPES_TO_EXCLUDE ...]]
                                            [--nodes_to_quantize NODES_TO_QUANTIZE [NODES_TO_QUANTIZE ...]]
                                            [--nodes_to_exclude NODES_TO_EXCLUDE [NODES_TO_EXCLUDE ...]]
                                            [--use_external_data_format]
                                            [--keep_intermediate_files]
                                            [--output_path OUTPUT_PATH]
                                            [--log_level {DEBUG,INFO,WARNING,ERROR,debug,info,warning,error}]
                                            [--log_file LOG_FILE]
                                            [--trt_plugins TRT_PLUGINS [TRT_PLUGINS ...]]
                                            [--trt_plugins_precision TRT_PLUGINS_PRECISION [TRT_PLUGINS_PRECISION ...]]
                                            [--high_precision_dtype {fp32,fp16,bf16}]
                                            [--mha_accumulation_dtype MHA_ACCUMULATION_DTYPE]
                                            [--disable_mha_qdq] [--dq_only]
                                            [--use_zero_point USE_ZERO_POINT]
                                            [--passes {concat_elimination} [{concat_elimination} ...]]
                                            [--simplify]
                                            [--calibrate_per_node]

Named Arguments

--onnx_path

Input onnx model without Q/DQ nodes.

--quantize_mode

Possible choices: fp8, int8, int4

Quantization mode for the given ONNX model.

Default: 'int8'

--calibration_method

Possible choices: max, entropy, awq_clip, rtn_dq

Calibration method choices for int8/fp8: {entropy (default), max}, int4: {awq_clip (default), rtn_dq}.

--calibration_data_path

Calibration data in npz/npy format. If None, random data for calibration will be used.

--calibration_cache_path

Pre-calculated activation tensor scaling factors aka calibration cache path.

--calibration_shapes

Optional model input shapes for calibration.Users should provide the shapes specifically if the model has non-batch dynamic dimensions.Example input shapes spec: input0:1x3x256x256,input1:1x3x128x128

--calibration_eps

Priority order for the execution providers (EP) to calibrate the model. Any subset of [‘trt’, ‘cuda:x’, dml:x, ‘cpu’], where ‘x’ is the device id.If a custom op is detected in the model, ‘trt’ will automatically be added to the EP list.

Default: ['cpu', 'cuda:0', 'trt']

--override_shapes

Override model input shapes with static shapes.Example input shapes spec: input0:1x3x256x256,input1:1x3x128x128

--op_types_to_quantize

A space-separated list of node types to quantize.

Default: []

--op_types_to_exclude

A space-separated list of node types to exclude from quantization.

Default: []

--nodes_to_quantize

A space-separated list of node names to quantize. Regular expressions are supported.

Default: []

--nodes_to_exclude

A space-separated list of node names to exclude from quantization. Regular expressions are supported.

Default: []

--use_external_data_format

If True or model size is larger than 2GB, <MODEL_NAME>.onnx_data will be used to write weights and constants.

Default: False

--keep_intermediate_files

If True, keep the files generated during the ONNX models’ conversion/calibration. Otherwise, only the converted ONNX file is kept for the user.

Default: False

--output_path

Output filename to save the converted ONNX model. If None, save it in the same dir as the original ONNX model with an appropriate suffix.

--log_level

Possible choices: DEBUG, INFO, WARNING, ERROR, debug, info, warning, error

Set the logging level for the quantization process.

Default: 'INFO'

--log_file

Path to the log file for the quantization process.

--trt_plugins

A space-separated list with the custom TensorRT plugin library paths in .so format (compiled shared library). If this is not None, the TensorrtExecutionProvider is invoked, so make sure that the TensorRT libraries are in the PATH or LD_LIBRARY_PATH variables.

--trt_plugins_precision

A space-separated list indicating the precision for each custom op. Each item should have the format <op_type>:<precision> (all inputs and outputs have the same precision) or <op_type>:[<inp1_precision>,<inp2_precision>,…]:[<out1_precision>,<out2_precision>,…] (inputs and outputs can have different precisions), where precision can be fp32 (default), fp16, int8, or fp8. Note that int8/fp8 should be the same as the quantization mode. For example: op_type_1:fp16 op_type_2:[int8,fp32]:[int8].

--high_precision_dtype

Possible choices: fp32, fp16, bf16

High precision data type, one of [‘fp32’, ‘fp16’, ‘bf16’]. For int8 quantization, the default value is ‘fp32’ and ‘fp16’ for other quantization modes.

--mha_accumulation_dtype

Accumulation dtype of MHA. This flag will only take effect when mha_accumulation_dtype == ‘fp32’ and quantize_mode == ‘fp8’. One of [‘fp32’, ‘fp16’]

Default: 'fp16'

--disable_mha_qdq

If True, Q/DQ will not be added to MatMuls in MHA pattern.

Default: False

--dq_only

If True, FP32/FP16 weights will be converted to INT8/FP8 weights. Q nodes will get removed from the weights and have only DQ nodes with those converted INT8/FP8 weights in the output model.

Default: False

--use_zero_point

If True, zero-point based quantization will be used - currently, applicable for awq_lite algorithm.

Default: False

--passes

Possible choices: concat_elimination

A space-separated list of optimization passes name, if set, appropriate pre/post-processing passes will be invoked.

Default: ['concat_elimination']

--simplify

If True, the given ONNX model will be simplified before quantization is performed.

Default: False

--calibrate_per_node

If set, performs calibration per node instead of running inference over the entire network. Useful for reducing memory consumption during large model inference.

Default: False

If the model contains custom ops, enable calibration with the TensorRT Execution Provider backend with CUDA and CPU fallback (--calibration_eps trt cuda:0 cpu) and, if relevant, provide the location of the TensorRT plugin in .so format via the --trt_plugins flag.

By default, after running the calibration, the quantization tool will insert the QDQ nodes by following TensorRT friendly QDQ insertion algorithm. Users can change the default quantization behavior by tweaking the API params like op_types_to_quantize, op_types_to_exclude etc. See the modelopt.onnx.quantization.quantize() for details.

Deploy Quantized ONNX Model

trtexec is a command-line tool provided by TensorRT. Typically, it’s within the /usr/src/tensorrt/bin/ directory. Below is a simple command to compile the quantized onnx model generated by the previous step into a TensorRT engine file.

trtexec --onnx=quant.onnx --saveEngine=quant.engine --best

Compare the performance

The following command will build the engine using fp16 precision. After building, check the reported “Latency” and “Throughput” fields and compare.

trtexec --onnx=original.onnx --saveEngine=fp16.engine --fp16

Note

If you replace --fp16 flag with --best flag, this command will create an int8 engine with TensorRT’s implicit quantization.