Quick Start: Quantization (Windows)

Quantization is a crucial technique for reducing memory usage and speeding up inference in deep learning models.

The ONNX quantization API in ModelOpt-Windows offers advanced Post-Training Quantization (PTQ) options like Activation-Aware Quantization (AWQ).

ONNX Model Quantization (PTQ)

The ONNX quantization API requires a model, calibration data, along with quantization settings like algorithm, calibration-EPs etc. Here’s an example snippet to apply INT4 AWQ quantization:

from modelopt.onnx.quantization.int4 import quantize as quantize_int4
# import other packages as needed
calib_inputs = get_calib_inputs(dataset, model_name, cache_dir, calib_size, batch_size,...)
quantized_onnx_model = quantize_int4(
    onnx_path,
    calibration_method="awq_lite",
    calibration_data_reader=None if use_random_calib else calib_inputs,
    calibration_eps=["dml", "cpu"]
)
onnx.save_model(
    quantized_onnx_model,
    output_path,
    save_as_external_data=True,
    location=os.path.basename(output_path) + "_data",
    size_threshold=0,
)

Check modelopt.onnx.quantization.quantize_int4 for details about INT4 quantization API.

Refer Support Matrix for details about supported features and models.

To know more about ONNX PTQ, refer ONNX Quantization - Windows and example scripts.

Deployment

The quantized onnx model can be deployed using frameworks like onnxruntime. Ensure that model’s opset is 19+ for FP8 quantization, and it is 21+ for INT4 quantization. This is needed due to different opset requirements of ONNX’s Q/DQ nodes for INT4, FP8 data-types support. Refer Apply Post Training Quantization (PTQ) for details.

# write steps (say, upgrade_opset() method) to upgrade or patch opset of the model, if needed
# the opset-upgrade, if needed, can be done on either base ONNX model or on the quantized model
# finally, save the quantized model

quantized_onnx_model = upgrade_opset(quantized_onnx_model)
onnx.save_model(
    quantized_onnx_model,
    output_path,
    save_as_external_data=True,
    location=os.path.basename(output_path) + "_data",
    size_threshold=0,
)

For detailed instructions about deployment of quantized models with DirectML backend (ORT-DML), see the DirectML. Also, refer example scripts for any possible model-specific inference guidance or script (if any).

Note

The ready-to-deploy optimized ONNX models from ModelOpt-Windows are available at HuggingFace NVIDIA collections.