config
This document lists the quantization formats supported by Model Optimizer and example quantization configs.
Quantization Formats
The following table lists the quantization formats supported by Model Optimizer and the corresponding quantization config. See Quantization Configs for the specific quantization config definitions.
Please see choosing the right quantization formats to learn more about the formats and their use-cases.
Note
The recommended configs given below are for LLM models. For CNN models, only INT8 quantization
is supported. Please use quantization config INT8_DEFAULT_CFG
for CNN models.
Quantization Format |
Model Optimizer config |
---|---|
INT8 |
|
FP8 |
|
INT4 Weights only AWQ (W4A16) |
|
INT4-FP8 AWQ (W4A8) |
|
Quantization Configs
Quantization config is dictionary specifying the values for keys "quant_cfg"
and
"algorithm"
. The "quant_cfg"
key specifies the quantization configurations. The
"algorithm"
key specifies the algorithm
argument to
calibrate
. Please see QuantizeConfig
for the quantization config definition.
‘Quantization configurations’ is a dictionary mapping wildcards or filter functions
to its ‘quantizer attributes’. The wildcards or filter functions are matched
against the quantizer module names. The quantizer modules have names ending with
weight_quantizer
and input_quantizer
and they perform weight quantization and
input quantization (or activation quantization) respectively. The quantizer modules are generally
instances of
TensorQuantizer
.
The quantizer attributes are defined by QuantizerAttributeConfig
. See QuantizerAttributeConfig
for details on the quantizer attributes and their values.
The key “default” from the quantization configuration dictionary is applied if no other wildcard or filter functions match the quantizer module name.
The quantizer attributes are applied in the order they are specified. For the missing attributes, the default attributes
as defined by QuantizerAttributeConfig
are used.
Quantizer attributes can also be a list of dictionaries. In this case, the matched quantizer module
is replaced with a
SequentialQuantizer
module which is used to quantize a tensor in multiple formats sequentially. Each quantizer attribute
dictionary in the list specifies the quantization formats for each quantization step of the
sequential quantizer. For example, SequentialQuantizer is used in ‘INT4 Weights, FP8 Activations’
quantization in which the weights are quantized in INT4 followed by FP8.
In addition, the dictionary entries could also be pytorch module class names mapping the class specific quantization configurations. The pytorch modules should have a quantized equivalent.
To get the string representation of a module class, do:
from modelopt.torch.quantization.nn import QuantModuleRegistry
# Get the class name for nn.Conv2d
class_name = QuantModuleRegistry.get_key(nn.Conv2d)
Here is an example of a quantization config:
MY_QUANT_CFG = {
"quant_cfg": {
# Quantizer wildcard strings mapping to quantizer attributes
"*weight_quantizer": {"num_bits": 8, "axis": 0},
"*input_quantizer": {"num_bits": 8, "axis": None},
# Module class names mapping to quantizer configurations
"nn.LeakyReLU": {"*input_quantizer": {"enable": False}},
}
}
Example Quantization Configurations
Here are the recommended quantization configs from Model Optimizer for quantization formats such as FP8, INT8, INT4, etc.:
INT8_DEFAULT_CFG = {
"quant_cfg": {
"*weight_quantizer": {"num_bits": 8, "axis": 0},
"*input_quantizer": {"num_bits": 8, "axis": None},
"*lm_head*": {"enable": False},
"*block_sparse_moe.gate*": {"enable": False}, # Skip the MOE router
"*router*": {"enable": False}, # Skip the MOE router
"default": {"enable": False},
},
"algorithm": "max",
}
INT8_SMOOTHQUANT_CFG = {
"quant_cfg": {
"*weight_quantizer": {"num_bits": 8, "axis": 0},
"*input_quantizer": {"num_bits": 8, "axis": -1},
"*lm_head*": {"enable": False},
"*block_sparse_moe.gate*": {"enable": False}, # Skip the MOE router
"*router*": {"enable": False}, # Skip the MOE router
"nn.Conv2d": {
"*weight_quantizer": {"num_bits": 8, "axis": 0},
"*input_quantizer": {"num_bits": 8, "axis": None},
},
"default": {"enable": False},
},
"algorithm": "smoothquant",
}
FP8_DEFAULT_CFG = {
"quant_cfg": {
"*weight_quantizer": {"num_bits": (4, 3), "axis": None},
"*input_quantizer": {"num_bits": (4, 3), "axis": None},
"*block_sparse_moe.gate*": {"enable": False}, # Skip the MOE router
"*router*": {"enable": False}, # Skip the MOE router
"default": {"enable": False},
},
"algorithm": "max",
}
INT4_BLOCKWISE_WEIGHT_ONLY_CFG = {
"quant_cfg": {
"*weight_quantizer": {"num_bits": 4, "block_sizes": {-1: 128}, "enable": True},
"*input_quantizer": {"enable": False},
"*lm_head*": {"enable": False},
"*block_sparse_moe.gate*": {"enable": False}, # Skip the MOE router
"*router*": {"enable": False}, # Skip the MOE router
"default": {"enable": False},
},
"algorithm": "max",
}
INT4_AWQ_CFG = {
"quant_cfg": {
"*weight_quantizer": {"num_bits": 4, "block_sizes": {-1: 128}, "enable": True},
"*input_quantizer": {"enable": False},
"*lm_head*": {"enable": False},
"*block_sparse_moe.gate*": {"enable": False}, # Skip the MOE router
"*router*": {"enable": False}, # Skip the MOE router
"default": {"enable": False},
},
"algorithm": {"method": "awq_lite", "alpha_step": 0.1},
# "algorithm": {"method": "awq_full", "alpha_step": 0.1, "max_co_batch_size": 1024},
# "algorithm": {"method": "awq_clip", "max_co_batch_size": 2048},
}
W4A8_AWQ_BETA_CFG = {
"quant_cfg": {
"*weight_quantizer": [
{"num_bits": 4, "block_sizes": {-1: 128}, "enable": True},
{"num_bits": (4, 3), "axis": None, "enable": True},
],
"*input_quantizer": {"num_bits": (4, 3), "axis": None, "enable": True},
"*lm_head*": {"enable": False},
"*block_sparse_moe.gate*": {"enable": False}, # Skip the MOE router
"*router*": {"enable": False}, # Skip the MOE router
"default": {"enable": False},
},
"algorithm": "awq_lite",
}
These config can be accessed as attributes of modelopt.torch.quantization
and can be given as
input to mtq.quantize()
. For example:
import modelopt.torch.quantization as mtq
model = mtq.quantize(model, mtq.INT8_DEFAULT_CFG, forward_loop)
You can also create your own config by following these examples.
For instance, if you want to quantize a model with int4 AWQ algorithm, but need to skip quantizing
the layer named lm_head
, you can create a custom config and quantize your model as following:
# Create custom config
CUSTOM_INT4_AWQ_CFG = copy.deepcopy(mtq.INT4_AWQ_CFG)
CUSTOM_INT4_AWQ_CFG["quant_cfg"]["*lm_head*"] = {"enable": False}
# quantize model
model = mtq.quantize(model, CUSTOM_INT4_AWQ_CFG, forward_loop)
- ModeloptConfig AWQClipCalibConfig
Bases:
QuantizeAlgorithmConfig
The config for
awq_clip
(AWQ clip) algorithm.AWQ clip searches clipped amax for per-group quantization, This search requires much more compute compared to AWQ lite. To avoid any OOM, the linear layer weights are batched along the
out_features
dimension of batch sizemax_co_batch_size
. AWQ clip calibration also takes longer than AWQ lite.Show default config as JSON
- Default config (JSON):
{ "method": "max", "max_co_batch_size": 1024, "max_tokens_per_batch": 64, "min_clip_ratio": 0.5, "shrink_step": 0.05, "debug": false }
- field debug: bool | None
Show details
If True, module’s search metadata will be kept as a module attribute named
awq_clip
.
- field max_co_batch_size: int | None
Show details
Reduce this number if CUDA Out of Memory error occurs.
- field max_tokens_per_batch: int | None
Show details
The total tokens used for clip search would be
max_tokens_per_batch * number of batches
. Original AWQ uses a total of 512 tokens to search for clip values.
- field min_clip_ratio: float | None
Show details
It should be in (0, 1.0). Clip will search for the optimal clipping value in the range
[original block amax * min_clip_ratio, original block amax]
.- Constraints:
gt = 0.0
lt = 1.0
- field shrink_step: float | None
Show details
It should be in range (0, 1.0]. The clip ratio will be searched from
min_clip_ratio
to 1 with the step size specified.- Constraints:
gt = 0.0
le = 1.0
- ModeloptConfig AWQFullCalibConfig
Bases:
AWQLiteCalibConfig
,AWQClipCalibConfig
The config for
awq
orawq_full
algorithm (AWQ full).AWQ full performs
awq_lite
followed byawq_clip
.Show default config as JSON
- Default config (JSON):
{ "method": "max", "max_co_batch_size": 1024, "max_tokens_per_batch": 64, "min_clip_ratio": 0.5, "shrink_step": 0.05, "debug": false, "alpha_step": 0.1 }
- field debug: bool | None
Show details
If True, module’s search metadata will be kept as module attributes named
awq_lite
andawq_clip
.
- ModeloptConfig AWQLiteCalibConfig
Bases:
QuantizeAlgorithmConfig
The config for
awq_lite
(AWQ lite) algorithm.AWQ lite applies a channel-wise scaling factor which minimizes the output difference after quantization. See AWQ paper for more details.
Show default config as JSON
- Default config (JSON):
{ "method": "max", "alpha_step": 0.1, "debug": false }
- field alpha_step: float | None
Show details
The alpha will be searched from 0 to 1 with the step size specified.
- Constraints:
gt = 0.0
le = 1.0
- field debug: bool | None
Show details
If True, module’s search metadata will be kept as a module attribute named awq_lite.
- ModeloptConfig MaxCalibConfig
Bases:
QuantizeAlgorithmConfig
The config for max calibration algorithm.
Max calibration estimates max values of activations or weights and use this max values to set the quantization scaling factor. See Integer Quantization for the concepts.
Show default config as JSON
- Default config (JSON):
{ "method": "max" }
- ModeloptConfig QuantizeAlgorithmConfig
Bases:
ModeloptBaseConfig
Calibration algorithm config base.
Show default config as JSON
- Default config (JSON):
{ "method": "max" }
- field method: str
Show details
The algorithm used for calibration. Supported algorithms include
"max", "smoothquant", "awq_lite", "awq_full", and "awq_clip"
.
- ModeloptConfig QuantizeConfig
Bases:
ModeloptBaseConfig
Default configuration for
quantize
mode.Show default config as JSON
- Default config (JSON):
{ "quant_cfg": { "default": { "num_bits": 8, "axis": null } }, "algorithm": "max" }
- field algorithm: None | str | MaxCalibConfig | SmoothQuantCalibConfig | AWQLiteCalibConfig | AWQClipCalibConfig | AWQFullCalibConfig | RealQuantizeConfig
- field quant_cfg: Dict[str | Callable, QuantizerAttributeConfig | List[QuantizerAttributeConfig] | Dict[str | Callable, QuantizerAttributeConfig | List[QuantizerAttributeConfig]]]
- ModeloptConfig QuantizerAttributeConfig
Bases:
ModeloptBaseConfig
Quantizer attribute type.
Show default config as JSON
- Default config (JSON):
{ "enable": true, "num_bits": 8, "axis": null, "fake_quant": true, "unsigned": false, "narrow_range": false, "learn_amax": false, "type": "static", "block_sizes": null, "trt_high_precision_dtype": "Float", "calibrator": "max" }
- field axis: int | Tuple[int, ...] | None
Show details
The specified axis/axes will have its own amax for computing scaling factor. If None (the default), use per tensor scale. Must be in the range [-rank(input_tensor), rank(input_tensor)). E.g. For a KCRS weight tensor,
quant_axis=(0)
will yield per channel scaling.
- field block_sizes: Dict[int | str, int | Tuple[int, int] | str | Dict[int, int]] | None
Show details
The keys are the axes for block quantization and the values are block sizes for quantization along the respective axes. Keys must be in the range
[-tensor.dim(), tensor.dim())
. Values, which are the block sizes for quantization must be positive integers.In addition, there can be special string keys
"type"
,"scale_bits"
and"scale_block_sizes"
.Key
"type"
should map to"dynamic"
or"static"
where"dynamic"
indicates dynamic block quantization and “static” indicates static calibrated block quantization. By default, the type is"static"
.Key
"scale_bits"
specify the quantization bits for the per-block quantization scale factor (i.e a double quantization scheme).Key
"scale_block_sizes"
specify the block size for double quantization. By default per-block quantization scale is not quantized.For example,
block_sizes = {-1: 32}
will quantize the last axis of the input tensor in blocks of size 32 with static calibration andblock_sizes = {-1: 32, "type": "dynamic"}
will perform dynamic block quantization. If None, block quantization is not performed.axis
must be None whenblock_sizes
is not None.
- field calibrator: str | Callable | Tuple
Show details
The calibrator can be a string from
["max", "histogram"]
or a constructor to create a calibrator which subclasses_Calibrator
. Seestandardize_constructor_args
for more information on how to specify the constructor.
- field enable: bool
Show details
If True, enables the quantizer. If False, by-pass the quantizer and returns the input tensor.
- field fake_quant: bool
Show details
If True, enable fake quantization.
- field learn_amax: bool
Show details
If True, enable learning amax.
- field narrow_range: bool
Show details
If True, enable narrow range quantization. Used only for integer quantization.
- field num_bits: int | Tuple[int, int]
Show details
num_bits can be:
- A positive integer argument for integer quantization. num_bits specify
the number of bits used for integer quantization.
- Constant integer tuple (E,M) for floating point quantization emulating
Nvidia’s FPx quantization. E is the number of exponent bits and M is the number of mantissa bits. Supported FPx quantization formats: FP8 (E4M3, E5M2), FP6(E3M2, E2M3), FP4(E2M1).
- field trt_high_precision_dtype: str
Show details
The value is a string from
["Float", "Half", "BFloat16"]
. The QDQs will be assigned the appropriate data type, and this variable will only be used when the user is exporting the quantized ONNX model.- Constraints:
pattern = ^Float$|^Half$|^BFloat16$
- field type: str
Show details
The value is a string from
["static", "dynamic"]
. If"dynamic"
, dynamic quantization will be enabled which does not collect any statistics during calibration.- Constraints:
pattern = ^static$|^dynamic$
- field unsigned: bool
Show details
If True, enable unsigned quantization. Used only for integer quantization.
- ModeloptConfig RealQuantizeConfig
Bases:
QuantizeAlgorithmConfig
The config for real quantization config.
The
additional_algorithm
will be used for calibration before quantizing weights into low precision.Show default config as JSON
- Default config (JSON):
{ "method": "max", "additional_algorithm": "" }
- field additional_algorithm: AWQLiteCalibConfig | AWQClipCalibConfig | AWQFullCalibConfig | None
Show details
The algorithm used for calibration. Supported algorithms include
"awq_lite", "awq_full", and "awq_clip"
.
- ModeloptConfig SmoothQuantCalibConfig
Bases:
QuantizeAlgorithmConfig
The config for
smoothquant
algorithm (SmoothQuant).SmoothQuant applies a smoothing factor which balances the scale of outliers in weights and activations. See SmoothQuant paper for more details.
Show default config as JSON
- Default config (JSON):
{ "method": "max", "alpha": 1.0 }
- field alpha: float | None
Show details
This hyper-parameter controls the migration strength.The migration strength is within [0, 1], a larger value migrates more quantization difficulty to weights.
- Constraints:
ge = 0.0
le = 1.0