PyTorch Quantization

ModelOpt PyTorch quantization is refactored based on pytorch_quantization.

Key advantages offered by ModelOpt’s PyTorch quantization:

  1. Support advanced quantization formats, e.g., Block-wise Int4 and FP8.

  2. Native support for LLM models in Hugging Face and NeMo.

  3. Advanced Quantization algorithms, e.g., SmoothQuant, AWQ.

  4. Deployment support to ONNX and NVIDIA TensorRT.

Note

ModelOpt quantization is fake quantization, which means it only simulates the low-precision computation in PyTorch. Real speedup and memory saving should be achieved by exporting the model to deployment frameworks.

Tip

This guide covers the usage of ModelOpt quantization. For details on the quantization formats and recommended use cases, please refer to Quantization Formats.

Apply Post Training Quantization (PTQ)

PTQ can be achieved with simple calibration on a small set of training or evaluation data (typically 128-512 samples) after converting a regular PyTorch model to a quantized model. The simplest way to quantize a model using ModelOpt is to use mtq.quantize().

mtq.quantize() takes a model, a quantization config and a forward loop callable as input. The quantization config specifies the layers to quantize, their quantization formats as well as the algorithm to use for calibration. Please refer to Quantization Configs for the list of quantization configs supported by default. You may also define your own quantization config as described in customizing quantizer config.

ModelOpt supports algorithms such as AWQ, SmoothQuant or max for calibration. Please refer to mtq.calibrate for more details.

The forward loop is used to pass data through the model in-order to collect statistics for calibration. It should wrap around the calibration dataloader and the model.

Here is an example of performing PTQ using ModelOpt:

import modelopt.torch.quantization as mtq

# Setup the model
model = get_model()

# Select quantization config
config = mtq.INT8_SMOOTHQUANT_CFG

# Quantization need calibration data. Setup calibration data loader
# An example of creating a calibration data loader looks like the following:
data_loader = get_dataloader(num_samples=calib_size)


# Define forward_loop. Please wrap the data loader in the forward_loop
def forward_loop(model):
    for batch in data_loader:
        model(batch)


# Quantize the model and perform calibration (PTQ)
model = mtq.quantize(model, config, forward_loop)

To verify that the quantizer nodes are placed correctly in the model, let’s print the quantized model summary as show below:

# Print quantization summary after successfully quantizing the model with mtq.quantize
# This will show the quantizers inserted in the model and their configurations
mtq.print_quantization_summary(model)

After PTQ, the model can be exported to ONNX with the normal PyTorch ONNX export flow.

torch.onnx.export(model, sample_input, onnx_file)

ModelOpt also supports direct export of Huggingface or Nemo LLM models to TensorRT-LLM for deployment. Please see TensorRT-LLM Deployment for more details.

Quantization-aware Training (QAT)

QAT is the technique of fine-tuning a quantized model to recover model quality degradation due to quantization. While QAT requires much more compute resources than PTQ, it is highly effective in recovering model quality.

A model quantized using mtq.quantize() could be directly fine-tuned with QAT. Typically during QAT, the quantizer states are frozen and the model weights are fine-tuned.

Here is an example of performing QAT:

import modelopt.torch.quantization as mtq

# Select quantization config
config = mtq.INT8_DEFAULT_CFG


# Define forward loop for calibration
def forward_loop(model):
    for data in calib_set:
        model(data)


# QAT after replacement of regular modules to quantized modules
model = mtq.quantize(model, config, forward_loop)

# Fine-tune with original training pipeline
# Adjust learning rate and training duration
train(model, train_loader, optimizer, scheduler, ...)

Tip

We recommend QAT for 10% of the original training epochs. For LLMs, we find that QAT fine-tuning for even less than 1% of the original pre-training duration is often sufficient to recover the model quality.

Storing and loading quantized model

The model weights and quantizer states need to saved for future use or to resume training. The quantizer states of the model should be saved and loaded separately from the model weights.

mto.modelopt_state() provides the quantizer states of the model. The quantizer states can be saved with torch.save. For example:

import modelopt.torch.opt as mto

# Save quantizer states
torch.save(mto.modelopt_state(model), "modelopt_state.pt")

# Save model weights using torch.save or custom check-pointing function
# trainer.save_model("model.pt")
torch.save(model.state_dict(), "model.pt")

To restore a quantized model, first restore the quantizer states using mto.restore_from_modelopt_state. After quantizer states are restored, load the model weights. For example:

import modelopt.torch.opt as mto

# Initialize the un-quantized model
model = ...

# Load quantizer states
model = mto.restore_from_modelopt_state(model, torch.load("modelopt_state.pt"))

# Load model weights using torch.load or custom check-pointing function
# model.from_pretrained("model.pt")
model.load_state_dict(torch.load("model.pt"))

Advanced Topics

TensorQuantizer

Under the hood, ModelOpt mtq.quantize() inserts TensorQuantizer (quantizer modules) into the model layers like linear layer, conv layer etc. and patches their forward method to perform quantization.

To create TensorQuantizer instance, you need to specify QuantDescriptor, which describes the quantization parameters like quantization bits, axis etc.

Here is an example of creating a quantizer module:

from modelopt.torch.quantization.tensor_quant import QuantDescriptor
from modelopt.torch.quantization.nn import TensorQuantizer

# Create quantizer descriptor
quant_desc = QuantDescriptor(num_bits=8, axis=(-1,), unsigned=True)

# Create quantizer module
quantizer = TensorQuantizer(quant_desc)

quant_x = quantizer(x)  # Quantize input x

Customize quantizer config

ModelOpt inserts input quantizer, weight quantizer and output quantizer into common layers, but by default disables the output quantizer. Expert users who want to customize the default quantizer configuration can update the config dictionary provided to mtq.quantize using wildcard or filter function match.

Here is an example of specifying a custom quantizer configuration to mtq.quantize:

# Select quantization config
config = mtq.INT8_DEFAULT_CFG.copy()
config["quant_cfg"]["*.bmm.output_quantizer"] = {
    "enable": True
}  # Enable output quantizer for bmm layer

# Perform PTQ/QAT;
model = mtq.quantize(model, config, forward_loop)

Custom quantized module and quantizer placement

modelopt.torch.quantization has a default set of quantized modules (see modelopt.torch.quantization.nn.modules for a detailed list) and quantizer placement rules (input, output and weight quantizers). However, there might be cases where you want to define a custom quantized module and/or customize the quantizer placement.

ModelOpt provides a way to define custom quantized modules and register them with the quantization framework. This allows you to:

  1. Handle unsupported modules, e.g., a subclassed Linear layer that require quantization.

  2. Customize the quantizer placement, e.g., placing the quantizer in special places like the KV Cache of an Attention layer.

Here is an example of defining a custom quantized LayerNorm module:

from modelopt.torch.quantization.nn import TensorQuantizer


class QuantLayerNorm(nn.LayerNorm):
    def __init__(self, normalized_shape):
        super().__init__(normalized_shape)
        self._setup()

    def _setup(self):
        # Method to setup the quantizers
        self.input_quantizer = TensorQuantizer()
        self.weight_quantizer = TensorQuantizer()

    def forward(self, input):
        # You can customize the quantizer placement anywhere in the forward method
        input = self.input_quantizer(input)
        weight = self.weight_quantizer(self.weight)
        return F.layer_norm(input, self.normalized_shape, weight, self.bias, self.eps)

After defining the custom quantized module, you need to register this module so mtq.quantize API will automatically replace the original module with the quantized version. Note that the custom QuantLayerNorm must have a _setup method which instantiates the quantizer attributes that are called in the forward method. Here is the code to register the custom quantized module:

import modelopt.torch.quantization as mtq

# Register the custom quantized module
mtq.register(original_cls=nn.LayerNorm, quantized_cls=QuantLayerNorm)

# Perform PTQ
# nn.LayerNorm modules in the model will be replaced with the QuantLayerNorm module
model = mtq.quantize(model, config, forward_loop)

The quantization config might need to be customized if you define a custom quantized module. Please see customizing quantizer config for more details.

Fast evaluation

Weight folding avoids repeated quantization of weights during each inferece forward pass and speedup evaluation. This can be done with the following code:

# Fold quantizer together with weight tensor
mtq.fold_weight(quantized_model)

# Run model evaluation
user_evaluate_func(quantized_model)

Note

After weight folding, the model can no longer be exported to ONNX or fine-tuned with QAT.