Quick Start: Quantization (Windows)

Quantization is a crucial technique for reducing memory usage and speeding up inference in deep learning models.

The ONNX quantization API in ModelOpt-Windows offers advanced Post-Training Quantization (PTQ) options like Activation-Aware Quantization (AWQ).

ONNX Model Quantization (PTQ)

The ONNX quantization API requires a model, calibration data, along with quantization settings like algorithm, calibration-EPs etc. Here’s an example implementing int4 AWQ:

from modelopt.onnx.quantization.int4 import quantize as quantize_int4
# import other packages as needed
calib_inputs = get_calib_inputs(dataset, model_name, cache_dir, calib_size, batch_size,...)
quantized_onnx_model = quantize_int4(
    onnx_path,
    calibration_method="awq_lite",
    calibration_data_reader=None if use_random_calib else calib_inputs,
    calibration_eps=["dml", "cpu"]
)
onnx.save_model(
    quantized_onnx_model,
    output_path,
    save_as_external_data=True,
    location=os.path.basename(output_path) + "_data",
    size_threshold=0,
)

Check modelopt.onnx.quantization.quantize_int4 for details about quantization API.

Refer Feature Support Matrix for details about supported features and models.

To know more about ONNX PTQ, refer ONNX Quantization - Windows and example script.

Deployment

The quantized ONNX model is deployment-ready, equivalent to a standard ONNX model. ModelOpt-Windows uses ONNX’s DequantizeLinear (DQ) nodes, which support INT4 data-type from opset version 21 onward. Ensure the model’s opset version is 21 or higher. Refer Apply Post Training Quantization (PTQ) for details.

# write steps (say, upgrade_opset_to_21() method) to upgrade opset to 21, if it is lower than 21.

quantized_onnx_model = upgrade_opset_to_21(quantized_onnx_model)
onnx.save_model(
    quantized_onnx_model,
    output_path,
    save_as_external_data=True,
    location=os.path.basename(output_path) + "_data",
    size_threshold=0,
)

Deploy the quantized model using the DirectML backend. For detailed deployment instructions, see the DirectML Deployment.

Note

The ready-to-deploy optimized ONNX models from ModelOpt-Windows are available at HuggingFace NVIDIA collections.