config

This document lists the quantization formats supported by Model Optimizer and example quantization configs.

Quantization Formats

The following table lists the quantization formats supported by Model Optimizer and the corresponding quantization config. See Quantization Configs for the specific quantization config definitions.

Please see choosing the right quantization formats to learn more about the formats and their use-cases.

Note

The recommended configs given below are for LLM models. For CNN models, only INT8 quantization is supported. Please use quantization config INT8_DEFAULT_CFG for CNN models.

Quantization Format	Model Optimizer config
INT8	`INT8_SMOOTHQUANT_CFG`
FP8	`FP8_DEFAULT_CFG`
INT4 Weights only AWQ (W4A16)	`INT4_AWQ_CFG`
INT4-FP8 AWQ (W4A8)	`W4A8_AWQ_BETA_CFG`

Quantization Configs

Quantization config is dictionary specifying the values for keys "quant_cfg" and "algorithm". The "quant_cfg" key specifies the quantization configurations. The "algorithm" key specifies the algorithm argument to calibrate. Please see QuantizeConfig for the quantization config definition.

‘Quantization configurations’ is a dictionary mapping wildcards or filter functions to its ‘quantizer attributes’. The wildcards or filter functions are matched against the quantizer module names. The quantizer modules have names ending with weight_quantizer and input_quantizer and they perform weight quantization and input quantization (or activation quantization) respectively. The quantizer modules are generally instances of TensorQuantizer. The quantizer attributes are defined by QuantizerAttributeConfig. See QuantizerAttributeConfig for details on the quantizer attributes and their values.

The key “default” from the quantization configuration dictionary is applied if no other wildcard or filter functions match the quantizer module name.

The quantizer attributes are applied in the order they are specified. For the missing attributes, the default attributes as defined by QuantizerAttributeConfig are used.

Quantizer attributes can also be a list of dictionaries. In this case, the matched quantizer module is replaced with a SequentialQuantizer module which is used to quantize a tensor in multiple formats sequentially. Each quantizer attribute dictionary in the list specifies the quantization formats for each quantization step of the sequential quantizer. For example, SequentialQuantizer is used in ‘INT4 Weights, FP8 Activations’ quantization in which the weights are quantized in INT4 followed by FP8.

In addition, the dictionary entries could also be pytorch module class names mapping the class specific quantization configurations. The pytorch modules should have a quantized equivalent.

To get the string representation of a module class, do:

from modelopt.torch.quantization import QuantModuleRegistry

# Get the class name for nn.Conv2d
class_name = QuantModuleRegistry.get_key(nn.Conv2d)

Here is an example of a quantization config:

MY_QUANT_CFG = {
    "quant_cfg": {
        # Quantizer wildcard strings mapping to quantizer attributes
        "*weight_quantizer": {"num_bits": 8, "axis": 0},
        "*input_quantizer": {"num_bits": 8, "axis": None},

        # Module class names mapping to quantizer configurations
        "nn.LeakyReLU": {"*input_quantizer": {"enable": False}},

    }
}

Example Quantization Configurations

These example configs can be accessed as attributes of modelopt.torch.quantization and can be given as input to mtq.quantize(). For example:

import modelopt.torch.quantization as mtq
model = mtq.quantize(model, mtq.INT8_DEFAULT_CFG, forward_loop)

You can also create your own config by following these examples. For instance, if you want to quantize a model with int4 AWQ algorithm, but need to skip quantizing the layer named lm_head, you can create a custom config and quantize your model as following:

# Create custom config
CUSTOM_INT4_AWQ_CFG = copy.deepcopy(mtq.INT4_AWQ_CFG)
CUSTOM_INT4_AWQ_CFG["quant_cfg"]["*lm_head*"] = {"enable": False}

# quantize model
model = mtq.quantize(model, CUSTOM_INT4_AWQ_CFG, forward_loop)

Functions

need_calibration

Check if calibration is needed for the given config.

ModeloptConfig AWQClipCalibConfig

Bases: QuantizeAlgorithmConfig

The config for awq_clip (AWQ clip) algorithm.

AWQ clip searches clipped amax for per-group quantization, This search requires much more compute compared to AWQ lite. To avoid any OOM, the linear layer weights are batched along the out_features dimension of batch size max_co_batch_size. AWQ clip calibration also takes longer than AWQ lite.

Show default config as JSON

Default config (JSON):

{
   "method": "awq_clip",
   "max_co_batch_size": 1024,
   "max_tokens_per_batch": 64,
   "min_clip_ratio": 0.5,
   "shrink_step": 0.05,
   "debug": false
}

field debug: bool | None

Show details

If True, module’s search metadata will be kept as a module attribute named awq_clip.

field max_co_batch_size: int | None

Show details

Reduce this number if CUDA Out of Memory error occurs.

field max_tokens_per_batch: int | None

Show details

The total tokens used for clip search would be max_tokens_per_batch * number of batches. Original AWQ uses a total of 512 tokens to search for clip values.

field method: Literal['awq_clip']

field min_clip_ratio: float | None

Show details

It should be in (0, 1.0). Clip will search for the optimal clipping value in the range [original block amax * min_clip_ratio, original block amax].

Constraints:

gt = 0.0
lt = 1.0

field shrink_step: float | None

Show details

It should be in range (0, 1.0]. The clip ratio will be searched from min_clip_ratio to 1 with the step size specified.

Constraints:

gt = 0.0
le = 1.0

ModeloptConfig AWQFullCalibConfig

Bases: AWQLiteCalibConfig, AWQClipCalibConfig

The config for awq or awq_full algorithm (AWQ full).

AWQ full performs awq_lite followed by awq_clip.

Show default config as JSON

Default config (JSON):

{
   "method": "awq_full",
   "max_co_batch_size": 1024,
   "max_tokens_per_batch": 64,
   "min_clip_ratio": 0.5,
   "shrink_step": 0.05,
   "debug": false,
   "alpha_step": 0.1
}

field debug: bool | None

Show details

If True, module’s search metadata will be kept as module attributes named awq_lite and awq_clip.

field method: Literal['awq_full']

ModeloptConfig AWQLiteCalibConfig

Bases: QuantizeAlgorithmConfig

The config for awq_lite (AWQ lite) algorithm.

AWQ lite applies a channel-wise scaling factor which minimizes the output difference after quantization. See AWQ paper for more details.

Show default config as JSON

Default config (JSON):

{
   "method": "awq_lite",
   "alpha_step": 0.1,
   "debug": false
}

field alpha_step: float | None

Show details

The alpha will be searched from 0 to 1 with the step size specified.

Constraints:

gt = 0.0
le = 1.0

field debug: bool | None

Show details

If True, module’s search metadata will be kept as a module attribute named awq_lite.

field method: Literal['awq_lite']

ModeloptConfig CompressConfig

Bases: ModeloptBaseConfig

Default configuration for compress mode.

Show default config as JSON

Default config (JSON):

{
   "compress": {
      "*": true
   },
   "quant_gemm": true
}

field compress: dict[str, bool]

field quant_gemm: bool

Show details

If True, quantized GEMM compute will be enabled. Otherwise, we only do weight-only quantization.

ModeloptConfig MaxCalibConfig

Bases: QuantizeAlgorithmConfig

The config for max calibration algorithm.

Max calibration estimates max values of activations or weights and use this max values to set the quantization scaling factor. See Integer Quantization for the concepts.

Show default config as JSON

Default config (JSON):

{
   "method": "max",
   "distributed_sync": true
}

field distributed_sync: bool | None

Show details

If True, the amax will be synced across the distributed processes.

field method: Literal['max']

ModeloptConfig MseCalibConfig

Bases: QuantizeAlgorithmConfig

Configuration for per-tensor MSE calibration.

Finds a scale s (via amax a, with s = a / q_max) that minimizes the reconstruction error of a tensor after uniform Q→DQ:

s* = argmin_s E[(X - DQ(Q(X; s)))^2], X ∈ {weights | activations}

Show default config as JSON

Default config (JSON):

{
   "method": "mse",
   "num_steps": 10,
   "start_multiplier": 0.25,
   "stop_multiplier": 4.0,
   "distributed_sync": true
}

field distributed_sync: bool | None

Show details

If True, the amax will be synced across the distributed processes.

field method: Literal['mse']

field num_steps: int | None

Show details

Number of amax candidates to search over for MSE minimization.

Constraints:

ge = 1

field start_multiplier: float | None

Show details

Starting multiplier for amax search range (multiplies initial amax).

Constraints:

gt = 0.0

field stop_multiplier: float | None

Show details

Ending multiplier for amax search range (multiplies initial amax).

Constraints:

gt = 0.0

ModeloptConfig QuantizeAlgorithmConfig

Bases: ModeloptBaseConfig

Calibration algorithm config base.

Show default config as JSON

Default config (JSON):

{
   "method": null
}

field method: Literal[None]

ModeloptConfig QuantizeConfig

Bases: ModeloptBaseConfig

Default configuration for quantize mode.

Show default config as JSON

Default config (JSON):

{
   "quant_cfg": {
      "default": {
         "num_bits": 8,
         "axis": null
      }
   },
   "algorithm": "max"
}

field algorithm: str | dict | QuantizeAlgorithmConfig | None | list[str | dict | QuantizeAlgorithmConfig | None]

field quant_cfg: dict[str | Callable, QuantizerAttributeConfig | list[QuantizerAttributeConfig] | dict[str | Callable, QuantizerAttributeConfig | list[QuantizerAttributeConfig]]]

ModeloptConfig QuantizerAttributeConfig

Bases: ModeloptBaseConfig

Quantizer attribute type.

Show default config as JSON

Default config (JSON):

{
   "enable": true,
   "num_bits": 8,
   "axis": null,
   "fake_quant": true,
   "unsigned": false,
   "narrow_range": false,
   "learn_amax": false,
   "type": "static",
   "block_sizes": null,
   "bias": null,
   "trt_high_precision_dtype": "Float",
   "calibrator": "max",
   "rotate": false,
   "pass_through_bwd": false
}

field axis: int | tuple[int, ...] | None

Show details

This field is for static per-channel quantization. It cannot coexist with `block_sizes`. You should set axis if you want a fixed shape of scale factor.

For example, if axis is set to None, the scale factor will be a scalar (per-tensor quantization) if the axis is set to 0, the scale factor will be a vector of shape (dim0, ) (per-channel quantization). if the axis is set to (-2, -1), the scale factor will be a vector of shape (dim-2, dim-1)

axis value must be in the range [-rank(input_tensor), rank(input_tensor))

Show details

Configuration for bias handling in affine quantization. The keys are: - “enable”: Boolean to enable/disable bias handling, default is False - “type”: Specify the type of bias [“static”, “dynamic”], default is “static” - “method”: Specify the method of bias calibration [“mean”, “max_min”], default is “mean” - “axis”: Tuple of integers specifying axes for bias computation, default is None

Examples: bias = {“enable”: True} bias = {“enable”: True, “type”: “static”, “axis”: -1} bias = {“enable”: True, “type”: “dynamic”, “axis”: (-1, -3)}

Show details

This field is for static or dynamic block quantization. It cannot coexist with ``axis``. You should set block_sizes if you want fixed number of elements to share every scale factor.

The keys are the axes for block quantization and the values are block sizes for quantization along the respective axes. Keys must be in the range [-tensor.dim(), tensor.dim()). Values, which are the block sizes for quantization must be positive integers or None. A positive block size specifies the block size for quantization along that axis. None means that the block size will be the maximum possible size in that dimension - this is useful for specifying certain quantization formats such per-token dynamic quantization which has the amax shared along the last dimension.

In addition, there can be special string keys "type", "scale_bits" and "scale_block_sizes".

Key "type" should map to "dynamic" or "static" where "dynamic" indicates dynamic block quantization and “static” indicates static calibrated block quantization. By default, the type is "static".

Key "scale_bits" specify the quantization bits for the per-block quantization scale factor (i.e a double quantization scheme).

Key "scale_block_sizes" specify the block size for double quantization. By default per-block quantization scale is not quantized.

For example, block_sizes = {-1: 32} will quantize the last axis of the input tensor in blocks of size 32 with static calibration, with a total of numel(tensor) / 32 scale factors. block_sizes = {-1: 32, "type": "dynamic"} will perform dynamic block quantization. block_sizes = {-1: None, "type": "dynamic"} can be used to specify per-token dynamic quantization.

field calibrator: str | Callable | tuple

Show details

The calibrator can be a string from ["max", "histogram"] or a constructor to create a calibrator which subclasses _Calibrator. See standardize_constructor_args for more information on how to specify the constructor.

field enable: bool

Show details

If True, enables the quantizer. If False, by-pass the quantizer and returns the input tensor.

field fake_quant: bool

Show details

If True, enable fake quantization.

field learn_amax: bool

Show details

learn_amax is deprecated and reserved for backward compatibility.

field narrow_range: bool

Show details

If True, enable narrow range quantization. Used only for integer quantization.

field num_bits: int | tuple[int, int]

Show details

num_bits can be:

A positive integer argument for integer quantization. num_bits specify
the number of bits used for integer quantization.
Constant integer tuple (E,M) for floating point quantization emulating
Nvidia’s FPx quantization. E is the number of exponent bits and M is the number of mantissa bits. Supported FPx quantization formats: FP8 (E4M3, E5M2), FP6(E3M2, E2M3), FP4(E2M1).

field pass_through_bwd: bool

Show details

Gradient computation where fake quantization is pass through is called ‘Straight-Through Estimator (STE)’. STE does not require saving of the input tensor for performing backward pass and hence consumes less memory.

If set to False, we will use STE with zeroed outlier gradients. This setting could yield better QAT accuracy depending on the quantization format. However, this setting requires saving of the input tensor for computing gradients which uses more memory.

For dynamic quantization formats like MXFP4, STE with zeroed outlier gradients is not needed since fake quantization with dynamic amax results in minimal/no clipping.

field rotate: bool

Show details

“If true, the input of the quantizer will be rotated with a hadamard matrix given by scipy.linalg.hadamard, i.e. input = input @ scipy.linalg.hadamard(input.shape[-1]) / sqrt(input.shape[-1]).

This can be used for ratation based PTQ methods, e.g. QuaRot or SpinQuant. See https://arxiv.org/abs/2404.00456 for example.

field trt_high_precision_dtype: str

Show details

The value is a string from ["Float", "Half", "BFloat16"]. The QDQs will be assigned the appropriate data type, and this variable will only be used when the user is exporting the quantized ONNX model.

Constraints:

pattern = ^Float$|^Half$|^BFloat16$

field type: str

Show details

The value is a string from ["static", "dynamic"]. If "dynamic", dynamic quantization will be enabled which does not collect any statistics during calibration.

Constraints:

pattern = ^static$|^dynamic$

field unsigned: bool

Show details

If True, enable unsigned quantization. Used only for integer quantization.

ModeloptConfig SVDQuantConfig

Bases: QuantizeAlgorithmConfig

The config for SVDQuant.

Refer to the SVDQuant paper for more details.

Show default config as JSON

Default config (JSON):

{
   "method": "svdquant",
   "lowrank": 32
}

field lowrank: int | None

Show details

Specifies the rank of the LoRA used in the SVDQuant method, which captures outliers from the original weights.

field method: Literal['svdquant']

ModeloptConfig SmoothQuantCalibConfig

Bases: QuantizeAlgorithmConfig

The config for smoothquant algorithm (SmoothQuant).

SmoothQuant applies a smoothing factor which balances the scale of outliers in weights and activations. See SmoothQuant paper for more details.

Show default config as JSON

Default config (JSON):

{
   "method": "smoothquant",
   "alpha": 1.0
}

field alpha: float | None

Show details

This hyper-parameter controls the migration strength.The migration strength is within [0, 1], a larger value migrates more quantization difficulty to weights.

Constraints:

ge = 0.0
le = 1.0

field method: Literal['smoothquant']

need_calibration(config): Check if calibration is needed for the given config.