Quick Start: Quantization

Quantization

Quantization is an effective technique to reduce the memory footprint of deep learning models and to accelerate the inference speed.

ModelOpt’s mtq.quantize() API enables users to quantize a model with advanced algorithms like SmoothQuant, AWQ etc. ModelOpt supports both Post Training Quantization (PTQ) and Quantization Aware Training (QAT).

Tip

Please refer to Quantization Formats for details on the ModelOpt supported quantization formats and their use-cases.

PTQ for PyTorch models

mtq.quantize requires the model, the appropriate quantization configuration and a forward loop as inputs. Here is a quick example of quantizing a model with int8 SmoothQuant using mtq.quantize:

import modelopt.torch.quantization as mtq

# Setup the model
model = get_model()

# The quantization algorithm requires calibration data. Below we show a rough example of how to
# set up a calibration data loader with the desired calib_size
data_loader = get_dataloader(num_samples=calib_size)


# Define the forward_loop function with the model as input. The data loader should be wrapped
# inside the function.
def forward_loop(model):
    for batch in data_loader:
        model(batch)


# Quantize the model and perform calibration (PTQ)
model = mtq.quantize(model, mtq.INT8_SMOOTHQUANT_CFG, forward_loop)

Refer to Quantization Configs for the quantization configurations available from ModelOpt.

Deployment

The quantized model is just like a regular Pytorch model and is ready for evaluation or deployment.

Huggingface or Nemo LLM models can be exported to TensorRT-LLM using ModelOpt. Please see TensorRT-LLM Deployment guide for more details.

The model can be also exported to ONNX using torch.onnx.export.


Next Steps
  • Learn more about quantization and advanced usage of Model Optimizer quantization in Quantization guide.

  • Checkout out the end-to-end examples on GitHub for PTQ and QAT here.