config

This document lists the quantization formats supported by Model Optimizer and example quantization configs.

Quantization Formats

The following table lists the quantization formats supported by Model Optimizer and the corresponding quantization config. See Quantization Configs for the specific quantization config definitions.

Please see choosing the right quantization formats to learn more about the formats and their use-cases.

Note

The recommended configs given below are for LLM models. For CNN models, only INT8 quantization is supported. Please use quantization config INT8_DEFAULT_CFG for CNN models.

Quantization Format

Model Optimizer config

INT8

INT8_SMOOTHQUANT_CFG

FP8

FP8_DEFAULT_CFG

INT4 Weights only AWQ (W4A16)

INT4_AWQ_CFG

INT4-FP8 AWQ (W4A8)

W4A8_AWQ_BETA_CFG

Quantization Configs

Quantization config is a dictionary with two top-level keys:

  • "quant_cfg": an ordered list of QuantizerCfgEntry dicts that specify which quantizers to configure and how.

  • "algorithm": the calibration algorithm passed to calibrate.

Please see QuantizeConfig for the full config schema.

quant_cfg — Entry Format

Each entry in the quant_cfg list is a QuantizerCfgEntry with the following fields:

  • quantizer_name (required): a wildcard string matched against quantizer module names. Quantizer modules are instances of TensorQuantizer and have names ending with weight_quantizer, input_quantizer, etc.

  • parent_class (optional): restricts matching to quantizers whose immediate parent module is of this PyTorch class (e.g. "nn.Linear"). If omitted, all matching quantizers are targeted regardless of their parent class.

  • cfg (optional): a dict of quantizer attributes as defined by QuantizerAttributeConfig, or a list of such dicts. When a list is given, the matched TensorQuantizer is replaced with a SequentialQuantizer that applies each format in sequence. This is used for example in W4A8 quantization where weights are quantized first in INT4 and then in FP8.

  • enable (optional): toggles matched quantizers on (True) or off (False), independently of cfg. When cfg is present and enable is absent, the quantizer is implicitly enabled. When enable is the only field (no cfg), it only flips the on/off state — all other attributes remain unchanged.

quant_cfg — Ordering and Precedence

Entries are applied in list order; later entries override earlier ones for any quantizer they match. The recommended pattern is:

  1. Start with a deny-all entry {"quantizer_name": "*", "enable": False} (provided as _base_disable_all) to disable every quantizer by default.

  2. Follow with format-specific entries that selectively enable and configure the desired quantizers.

  3. Append _default_disabled_quantizer_cfg to enforce standard exclusions (e.g. BatchNorm layers, LM head, MoE routers).

To get the string representation of a module class for use in parent_class, do:

from modelopt.torch.quantization import QuantModuleRegistry

# Get the class name for nn.Conv2d
class_name = QuantModuleRegistry.get_key(nn.Conv2d)

Here is an example of a quantization config:

MY_QUANT_CFG = {
    "quant_cfg": [
        # Deny all quantizers by default
        {"quantizer_name": "*", "enable": False},

        # Enable and configure weight and input quantizers
        {"quantizer_name": "*weight_quantizer", "cfg": {"num_bits": 8, "axis": 0}},
        {"quantizer_name": "*input_quantizer", "cfg": {"num_bits": 8, "axis": None}},

        # Disable input quantizers specifically for LeakyReLU layers
        {"quantizer_name": "*input_quantizer", "parent_class": "nn.LeakyReLU", "enable": False},
    ]
}

Example Quantization Configurations

These example configs can be accessed as attributes of modelopt.torch.quantization and can be given as input to mtq.quantize(). For example:

import modelopt.torch.quantization as mtq
model = mtq.quantize(model, mtq.INT8_DEFAULT_CFG, forward_loop)

You can also create your own config by following these examples. For instance, if you want to quantize a model with int4 AWQ algorithm, but need to skip quantizing the layer named lm_head, you can create a custom config and quantize your model as following:

# Create custom config
CUSTOM_INT4_AWQ_CFG = copy.deepcopy(mtq.INT4_AWQ_CFG)
CUSTOM_INT4_AWQ_CFG["quant_cfg"].append({"quantizer_name": "*lm_head*", "enable": False})

# quantize model
model = mtq.quantize(model, CUSTOM_INT4_AWQ_CFG, forward_loop)

Classes

AWQClipCalibConfig

The config for awq_clip (AWQ clip) algorithm.

AWQFullCalibConfig

The config for awq or awq_full algorithm (AWQ full).

AWQLiteCalibConfig

The config for awq_lite (AWQ lite) algorithm.

CompressConfig

Default configuration for compress mode.

GPTQCalibConfig

The config for GPTQ quantization.

LayerwiseConfig

Nested config for layer-by-layer calibration behavior.

LocalHessianCalibConfig

Configuration for local Hessian-weighted MSE calibration.

MaxCalibConfig

The config for max calibration algorithm.

MseCalibConfig

Configuration for per-tensor MSE calibration.

QuantizeAlgorithmConfig

Calibration algorithm config base.

QuantizeConfig

Default configuration for quantize mode.

QuantizerAttributeConfig

Quantizer attribute type.

QuantizerCfgEntry

A single entry in a quant_cfg list.

RotateConfig

Configuration for rotating quantizer input via Hadamard transform (RHT/QuaRot/SpinQuant).

SVDQuantConfig

The config for SVDQuant.

SmoothQuantCalibConfig

The config for smoothquant algorithm (SmoothQuant).

Functions

find_quant_cfg_entry_by_path

Find the last entry in a quant_cfg list whose quantizer_name key equals the query.

need_calibration

Check if calibration is needed for the given config.

normalize_quant_cfg_list

Normalize a raw quant_cfg into a list of QuantizerCfgEntry instances.

class AWQClipCalibConfig

Bases: QuantizeAlgorithmConfig

The config for awq_clip (AWQ clip) algorithm.

AWQ clip searches clipped amax for per-group quantization, This search requires much more compute compared to AWQ lite. To avoid any OOM, the linear layer weights are batched along the out_features dimension of batch size max_co_batch_size. AWQ clip calibration also takes longer than AWQ lite.

debug: bool | None
max_co_batch_size: int | None
max_tokens_per_batch: int | None
method: Literal['awq_clip']
min_clip_ratio: float | None
model_config = {'extra': 'forbid', 'validate_assignment': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

shrink_step: float | None
class AWQFullCalibConfig

Bases: AWQLiteCalibConfig, AWQClipCalibConfig

The config for awq or awq_full algorithm (AWQ full).

AWQ full performs awq_lite followed by awq_clip.

debug: bool | None
method: Literal['awq_full']
model_config = {'extra': 'forbid', 'validate_assignment': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class AWQLiteCalibConfig

Bases: QuantizeAlgorithmConfig

The config for awq_lite (AWQ lite) algorithm.

AWQ lite applies a channel-wise scaling factor which minimizes the output difference after quantization. See AWQ paper for more details.

alpha_step: float | None
debug: bool | None
method: Literal['awq_lite']
model_config = {'extra': 'forbid', 'validate_assignment': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class CompressConfig

Bases: ModeloptBaseConfig

Default configuration for compress mode.

compress: dict[str, bool]
model_config = {'extra': 'forbid', 'validate_assignment': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

quant_gemm: bool
class GPTQCalibConfig

Bases: QuantizeAlgorithmConfig

The config for GPTQ quantization.

GPTQ minimizes the layer-wise quantization error by using second-order (Hessian) information to perform blockwise weight updates that compensate for rounding loss. Layers are quantized sequentially so that each layer’s Hessian is computed from activations that already reflect the quantization of preceding layers.

The default values are taken from the official GPTQ implementation: https://github.com/IST-DASLab/FP-Quant/blob/d2e3092f968262c4de5fb050e1aef568a280dadd/src/quantization/gptq.py#L35

block_size: int | None
fused: bool
method: Literal['gptq']
model_config = {'extra': 'forbid', 'validate_assignment': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

perc_damp: float | None
class LayerwiseConfig

Bases: ModeloptBaseConfig

Nested config for layer-by-layer calibration behavior.

checkpoint_dir: str | None
enable: bool
get_qdq_activations_from_prev_layer: bool
model_config = {'extra': 'forbid', 'validate_assignment': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

save_every: int
class LocalHessianCalibConfig

Bases: _SharedStatesConfig, QuantizeAlgorithmConfig

Configuration for local Hessian-weighted MSE calibration.

This algorithm uses activation information to optimize per-block scales for weight quantization. It minimizes the output reconstruction error by weighting the loss with the local Hessian matrix computed from input activations.

The local Hessian loss for each block is: (dw @ H @ dw.T) where: - dw = weight - quantized_weight (weight reconstruction error per block) - H = X @ X.T is the local Hessian computed from input activations X

block_size: int | None
debug: bool | None
distributed_sync: bool | None
fp8_scale_sweep: bool | None
method: Literal['local_hessian']
model_config = {'extra': 'forbid', 'validate_assignment': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

start_multiplier: float | None
step_size: float | None
stop_multiplier: float | None
class MaxCalibConfig

Bases: _SharedStatesConfig, QuantizeAlgorithmConfig

The config for max calibration algorithm.

Max calibration estimates max values of activations or weights and use this max values to set the quantization scaling factor. See Integer Quantization for the concepts.

distributed_sync: bool | None
method: Literal['max']
model_config = {'extra': 'forbid', 'validate_assignment': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

sync_expert_weight_amax: bool
class MseCalibConfig

Bases: _SharedStatesConfig, QuantizeAlgorithmConfig

Configuration for per-tensor MSE calibration.

Finds a scale s (via amax a, with s = a / q_max) that minimizes the reconstruction error of a tensor after uniform Q→DQ:

s* = argmin_s E[(W - DQ(Q(W; s)))^2], W ∈ weights

When fp8_scale_sweep is enabled for a supported FP8-scale format, step_size is ignored.

distributed_sync: bool | None
fp8_scale_sweep: bool | None
method: Literal['mse']
model_config = {'extra': 'forbid', 'validate_assignment': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

start_multiplier: float | None
step_size: float | None
stop_multiplier: float | None
class QuantizeAlgorithmConfig

Bases: ModeloptBaseConfig

Calibration algorithm config base.

layerwise: LayerwiseConfig
method: Literal[None]
model_config = {'extra': 'forbid', 'validate_assignment': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

moe_calib_experts_ratio: float | None
validate_layerwise_checkpoint_dir()

Raise if layerwise.checkpoint_dir is set but layerwise.enable is False.

class QuantizeConfig

Bases: ModeloptBaseConfig

Default configuration for quantize mode.

algorithm: str | dict | QuantizeAlgorithmConfig | None | list[str | dict | QuantizeAlgorithmConfig | None]
model_config = {'extra': 'forbid', 'validate_assignment': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

classmethod normalize_quant_cfg(v)

Normalize raw quant_cfg input into a list[QuantizerCfgEntry].

Delegates to normalize_quant_cfg_list(), which accepts every supported input shape (new-format list, legacy single-key-dict list, legacy flat dict, and lists containing already-validated QuantizerCfgEntry instances) and rejects anything else with a clear ValueError before pydantic’s field-type check would see it.

Parameters:

v (Sequence[QuantizerCfgEntry] | Sequence[Mapping[str, Any]] | Mapping[str, Any])

Return type:

list[QuantizerCfgEntry]

quant_cfg: list[QuantizerCfgEntry]
class QuantizerAttributeConfig

Bases: ModeloptBaseConfig

Quantizer attribute type.

axis: int | tuple[int, ...] | None
backend: str | None
backend_extra_args: dict | None
bias: dict[int | str, Literal['static', 'dynamic'] | Literal['mean', 'max_min'] | tuple[int, ...] | bool | int | None] | None
block_sizes: dict[int | str, int | tuple[int, int] | str | dict[int, int] | None] | None
calibrator: str | Callable | tuple
enable: bool
fake_quant: bool
learn_amax: bool
model_config = {'extra': 'forbid', 'validate_assignment': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

narrow_range: bool
num_bits: int | tuple[int, int] | str
pass_through_bwd: bool
rotate: bool | RotateConfig
trt_high_precision_dtype: str
type: str
unsigned: bool
use_constant_amax: bool
classmethod validate_bias(v)

Validate bias.

classmethod validate_block_sizes(v, info)

Validate block sizes.

Parameters:

info (ValidationInfo)

classmethod validate_calibrator(v, info)

Validate calibrator.

Parameters:

info (ValidationInfo)

classmethod validate_config(values)

Validate quantizer config.

classmethod validate_learn_amax(v)

Validate learn_amax.

validate_num_bits()

Validate num_bits.

class QuantizerCfgEntry

Bases: ModeloptBaseConfig

A single entry in a quant_cfg list.

cfg: QuantizerAttributeConfig | list[QuantizerAttributeConfig] | None
enable: bool
model_config = {'extra': 'forbid', 'validate_assignment': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

parent_class: str | None
quantizer_name: str
class RotateConfig

Bases: ModeloptBaseConfig

Configuration for rotating quantizer input via Hadamard transform (RHT/QuaRot/SpinQuant).

See normalized_hadamard_transform for transform details.

block_size: int | None
enable: bool
model_config = {'extra': 'forbid', 'validate_assignment': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

rotate_fp32: bool
classmethod validate_block_size(v)

Validate block_size is a positive int (mode=before to catch bool before int coercion).

class SVDQuantConfig

Bases: QuantizeAlgorithmConfig

The config for SVDQuant.

Refer to the SVDQuant paper for more details.

lowrank: int | None
method: Literal['svdquant']
model_config = {'extra': 'forbid', 'validate_assignment': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class SmoothQuantCalibConfig

Bases: QuantizeAlgorithmConfig

The config for smoothquant algorithm (SmoothQuant).

SmoothQuant applies a smoothing factor which balances the scale of outliers in weights and activations. See SmoothQuant paper for more details.

alpha: float | None
method: Literal['smoothquant']
model_config = {'extra': 'forbid', 'validate_assignment': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

find_quant_cfg_entry_by_path(quant_cfg_list, quantizer_name)

Find the last entry in a quant_cfg list whose quantizer_name key equals the query.

This performs an exact string comparison against the quantizer_name field of each entry — it does not apply fnmatch pattern matching. For example, passing "*input_quantizer" will only match entries whose quantizer_name is literally "*input_quantizer", not entries with a different wildcard that would match the same module names at apply time.

Returns the last match because entries are applied in list order and later entries override earlier ones, so the last match represents the effective configuration.

Parameters:
Returns:

The last entry whose quantizer_name equals quantizer_name.

Raises:

KeyError – If no entry with the given quantizer_name is found.

Return type:

QuantizerCfgEntry

need_calibration(config)

Check if calibration is needed for the given config.

Parameters:

config (QuantizeConfig | Mapping[str, Any])

Return type:

bool

normalize_quant_cfg_list(v)

Normalize a raw quant_cfg into a list of QuantizerCfgEntry instances.

Supports the following input forms:

  • A list of entries in any of the per-entry forms below.

  • A legacy flat dict ({"*": ..., "*weight_quantizer": ...}) — each key/value pair is converted to a single-key dict entry and then normalized.

Per-entry forms (when input is a list):

  • New format: {"quantizer_name": ..., "enable": ..., "cfg": ...} — passed through.

  • Legacy single-key format: {"<quantizer_name>": <cfg_or_dict>} — converted to new format.

  • Legacy nn.*-scoped format: {"nn.<Class>": {"<quantizer_name>": <cfg>}} — converted to a new-format entry with parent_class set.

Each normalized dict is then constructed into a QuantizerCfgEntry, whose own validator enforces that every entry specifies cfg, enable, or both, and that any cfg for an enabled quantizer is a non-empty dict or non-empty list of non-empty dicts.

Parameters:

v (Sequence[QuantizerCfgEntry] | Sequence[Mapping[str, Any]] | Mapping[str, Any]) – A list of raw quant_cfg entries in any supported format, or a legacy flat dict.

Returns:

A list of validated QuantizerCfgEntry instances.

Raises:

ValueError – If any entry’s shape is not recognized, or if it fails QuantizerCfgEntry validation (missing instruction or invalid cfg).

Return type:

list[QuantizerCfgEntry]