config

This document lists the quantization formats supported by Model Optimizer and example quantization configs.

Quantization Formats

The following table lists the quantization formats supported by Model Optimizer and the corresponding quantization config. See Quantization Configs for the specific quantization config definitions.

Please see choosing the right quantization formats to learn more about the formats and their use-cases.

Note

The recommended configs given below are for LLM models. For CNN models, only INT8 quantization is supported. Please use quantization config INT8_DEFAULT_CFG for CNN models.

Quantization Format

Model Optimizer config

INT8

INT8_SMOOTHQUANT_CFG

FP8

FP8_DEFAULT_CFG

INT4 Weights only AWQ (W4A16)

INT4_AWQ_CFG

INT4-FP8 AWQ (W4A8)

W4A8_AWQ_BETA_CFG

Quantization Configs

Quantization config is dictionary specifying the values for keys "quant_cfg" and "algorithm". The "quant_cfg" key specifies the quantization configurations. The "algorithm" key specifies the algorithm argument to calibrate.

Quantization configurations is a dictionary mapping wildcards or filter functions to its quantizer attributes. The wildcards or filter functions are matched against the quantizer module names. The quantizer modules have names ending with weight_quantizer and input_quantizer and they perform weight quantization and input quantization (or activation quantization) respectively. The quantizer modules are generally instances of TensorQuantizer and the specified quantizer attributes describe its quantization behavior. Quantizer attributes is a dictionary mapping quantizer attribute names to their values.

Quantizer attributes can also be a list of dictionaries. In this case, the matched quantizer module is replaced with a SequentialQuantizer module which is used to quantize a tensor in multiple formats sequentially. Each quantizer attribute dictionary in the list specifies the quantization formats for each quantization step of the sequential quantizer. For example, SequentialQuantizer is used in ‘INT4 Weights, FP8 Activations’ quantization in which the weights are quantized in INT4 followed by FP8.

Here are examples quantization configs from Model Optimizer:

INT8_DEFAULT_CFG = {
    "quant_cfg": {
    "*weight_quantizer": {"num_bits": 8, "axis": 0},
    "*input_quantizer": {"num_bits": 8, "axis": None},
    "*lm_head*": {"enable": False},
    "*block_sparse_moe.gate*": {"enable": False},  # Skip the MOE router
    "default": {"num_bits": 8, "axis": None},
    },
    "algorithm": "max",
}

INT8_SMOOTHQUANT_CFG = {
    "quant_cfg": {
    "*weight_quantizer": {"num_bits": 8, "axis": 0},
    "*input_quantizer": {"num_bits": 8, "axis": -1},
    "*lm_head*": {"enable": False},
    "*block_sparse_moe.gate*": {"enable": False},  # Skip the MOE router
    "default": {"num_bits": 8, "axis": None},
    },
    "algorithm": "smoothquant",
}

FP8_DEFAULT_CFG = {
    "quant_cfg": {
    "*weight_quantizer": {"num_bits": (4, 3), "axis": None},
    "*input_quantizer": {"num_bits": (4, 3), "axis": None},
    "*block_sparse_moe.gate*": {"enable": False},  # Skip the MOE router
    "default": {"num_bits": (4, 3), "axis": None},
    },
    "algorithm": "max",
}

INT4_BLOCKWISE_WEIGHT_ONLY_CFG = {
    "quant_cfg": {
    "*weight_quantizer": {"num_bits": 4, "block_sizes": {-1: 128}, "enable": True},
    "*input_quantizer": {"enable": False},
    "*lm_head*": {"enable": False},
    "*block_sparse_moe.gate*": {"enable": False},  # Skip the MOE router
    "default": {"enable": False},
    },
    "algorithm": "max",
}

INT4_AWQ_CFG = {
    "quant_cfg": {
    "*weight_quantizer": {"num_bits": 4, "block_sizes": {-1: 128}, "enable": True},
    "*input_quantizer": {"enable": False},
    "*lm_head*": {"enable": False},
    "*block_sparse_moe.gate*": {"enable": False},  # Skip the MOE router
    "default": {"enable": False},
    },
    "algorithm": {"method": "awq_lite", "alpha_step": 0.1},
    # "algorithm": {"method": "awq_full", "alpha_step": 0.1, "max_co_batch_size": 1024},
    # "algorithm": {"method": "awq_clip", "max_co_batch_size": 2048},
}

W4A8_AWQ_BETA_CFG = {
"quant_cfg": {
    "*weight_quantizer": [
        {"num_bits": 4, "block_sizes": {-1: 128}, "enable": True},
        {"num_bits": (4, 3), "axis": None, "enable": True},
    ],
    "*input_quantizer": {"num_bits": (4, 3), "axis": None, "enable": True},
    "*lm_head*": {"enable": False},
    "*block_sparse_moe.gate*": {"enable": False},  # Skip the MOE router
    "default": {"enable": False},
},
"algorithm": "awq_lite",
}

These config can be accessed as attributes of modelopt.torch.quantization and can be given as input to mtq.quantize(). For example:

import modelopt.torch.quantization as mtq
model = mtq.quantize(model, mtq.INT8_DEFAULT_CFG, forward_loop)

You can also create your own config by following these examples. For instance, if you want to quantize a model with int4 AWQ algorithm, but need to skip quantizing the layer named lm_head, you can create a custom config and quantize your model as following:

# Create custom config
CUSTOM_INT4_AWQ_CFG = copy.deepcopy(mtq.INT4_AWQ_CFG)
CUSTOM_INT4_AWQ_CFG["quant_cfg"]["*lm_head*"] = {"enable": False}

# quantize model
model = mtq.quantize(model, CUSTOM_INT4_AWQ_CFG, forward_loop)
ModeloptConfig QuantizeConfig

Bases: ModeloptBaseConfig

Default configuration for quantize mode.

Show default config as JSON
Default config (JSON):

{
   "quant_cfg": {
      "default": {
         "num_bits": 8,
         "axis": null
      }
   },
   "algorithm": "max"
}

field algorithm: str | Dict[str, Any]
field quant_cfg: Dict[str | Callable, Any]