algorithms

Module for advanced quantization algorithms.

Classes

`AutoQuantizeGradientSearcher`	A searcher for AutoQuantize algorithm that uses gradient based score estimation.
`AutoQuantizeKLDivSearcher`	A searcher for AutoQuantize algorithm that uses KL-Divergence loss based score estimation.
`AutoQuantizeSearcher`	alias of `AutoQuantizeGradientSearcher`
`QuantRecipe`	A subclass of QuantizeConfig enabling auto_quantize specific configurations.
`QuantRecipeHparam`	An Hparam for quantization recipes.

Functions

estimate_quant_compression

Estimate the compression ratio of a quantization configuration.

class AutoQuantizeGradientSearcher

Bases: _AutoQuantizeBaseSearcher

A searcher for AutoQuantize algorithm that uses gradient based score estimation.

In AutoQuantize, we search for the best per-layer quantization configuration that minimizes the sum of per-layer scores while meeting the specified constraint. AutoQuantize uses Linear Programming Solver to find the optimal quantization configuration.

The auto_quantize score for a layer quantization configuration is an approximation of model loss change due to quantizing the particular layer with the particular configuration. The approximation is based on taylor expansion of the loss function wrt to the quantized output of the layer and substitution of Fisher information for Hessian. This approximation is mathematically correct for models where the loss is a log likelihood loss such as BERT, GPT, etc. However, the auto_quantize score can still be used as a proxy for other models such as ResNet.

Quant Modules:

This searcher operates on quantizable modules (quant modules), which are typically Linear or Conv layers that support quantization. Optionally, grouping rules can be applied to ensure certain layers share the same quantization format (e.g., Q, K, V projections in the same attention layer). For details on quant_grouping_rules and customization, see the auto_quantize API documentation.

Score Modules:

By default, for each quant module, its sensitivity score is estimated using that module’s output perturbation. However, the sensitivity can also be estimated by looking at perturbation at a separate point in the neural network (score module). This is helpful in some cases such as MoEs for speed and lower memory consumption. Since all experts are already restricted to the same quant format by quant grouping rules, their sensitivity can be estimated together at a single point (e.g., the MLP output level).

property default_search_config: Get the default config for the searcher.

estimate_sensitivity_scores()

Estimate sensitivity scores using hessian approximation.

Return type:: None

classmethod register_custom_support(is_supported_checker, grad_ckpt_context, is_param_grad_enabled)

(Optional) Register custom support for AutoQuantize score estimation.

This custom support is used to enable memory/compute efficient backward gradient propagation. This involves:

grad_ckpt_context: backward pass with gradient checkpointing enabled
is_param_grad_enabled: AutoQuantize only needs activation gradients to be computed (not weight gradients). is_param_grad_enabled is used to select which parameters should have gradients enabled, limiting gradient computation to only what’s needed for activation gradients. For LLMs, to trigger all activation gradient computation, just enabling the embedding layer weight gradient is sufficient. This will enable gradient computation for all the activation gradients downstream.

If the is_supported_checker(model) returns True, the grad_ckpt_context(model) will be used to enable gradient checkpointing and is_param_grad_enabled(pname, model) will be used to select which parameters have gradients enabled to minimize gradient computation.

Parameters:

is_supported_checker (Callable)
grad_ckpt_context (Callable)
is_param_grad_enabled (Callable)

Return type:

None

run_search_with_stats(max_weight_size, verbose=False)

Linear Programming Solve for gradient based auto_quantize.

AutoQuantize uses Linear Programming Solver to find the optimal quantization configuration which minimizes the sum of per-layer auto_quantize scores while meeting the specified constraint.

sanitize_search_config(config)

Sanitize the search config dict.

Parameters:: config (dict[str, Any] | None)
Return type:: dict[str, Any]

score_module_rules = ['^(.*?\\.mlp)\\.experts\\.\\d+\\.(gate_proj|up_proj|down_proj)$', '^(.*?)\\.(\\d+\\.(w1|w2|w3))$', '^(.*?)\\.((w1_linear|w2_linear|w3_linear)\\.\\d+)$']

class AutoQuantizeKLDivSearcher

Bases: _AutoQuantizeBaseSearcher

A searcher for AutoQuantize algorithm that uses KL-Divergence loss based score estimation.

property default_search_config: Get the default config for the searcher.

estimate_sensitivity_scores()

Estimate the sensitivity scores for the model.

Higher score means more sensitive to quantization.

run_search_with_stats(max_weight_size, verbose=False)

Run threshold-based binary search for KLDivergence loss based auto_quantize.

We use binary search to minimize the max(per-layer score) while meeting the constraint.

sanitize_search_config(config)

Sanitize the search config dict.

Parameters:: config (dict[str, Any] | None)
Return type:: dict[str, Any]

AutoQuantizeSearcher: alias of AutoQuantizeGradientSearcher

class QuantRecipe

Bases: CustomHPType

A subclass of QuantizeConfig enabling auto_quantize specific configurations.

Parameters:

quant_cfg – str or dict or None. dict is used for custom quantization formats.
name – name for custom quantization formats. Only used if quantization format is a custom format not available in modelopt.torch.quantization.config.

__init__(quant_cfg=None, name=None)

Initialize the QuantRecipe with the quantization configuration.

Parameters:

quant_cfg (str | dict[str, Any] | None)
name (str | None)

static disable_folding_pqs_to_weights(): Disable the folding of pre_quant_scale to weights.

static fold_pqs_to_weights(model): Fold the pre_quant_scale in weight_quantizers to weights.

static get_auto_name_for_config(quant_cfg)

Get a name for the quantization configuration.

Parameters:: quant_cfg (str | dict[str, Any] | None)
Return type:: str | None

property num_bits: int: Get the number of bits for the quantization format.

class QuantRecipeHparam

Bases: Hparam

An Hparam for quantization recipes.

See Hparam for more details. In addition, this Hparam also:

Keeps a link to its quant_modules and score_modules and sets the quantizers for the quant_modules based on the active recipe.
Provides get_score() and get_cost() methods to evaluate recipes.
Registers itself with each score_module via the _hparams_for_scoring attribute.

__init__(choices=None, quant_modules=None, score_modules=None, name=None)

Initializes Hparam with original value and choices.

Parameters:

choices (Sequence[QuantRecipe] | None)
quant_modules (list[Module] | None)
score_modules (list[Module] | None)
name (str | None)

Return type:

None

property active: tuple[int, ...] | int | float | CustomHPType: Return the currently active value.

property attrs: list[str]: Return the attributes of the hparam for repr.

get_cost(recipe)

Get the cost for a given recipe.

The cost is the total weight size of the quantizable modules multiplied by the compression ratio of the recipe.

Parameters:: recipe (QuantRecipe)
Return type:: float

get_score(recipe)

Get the score for a given recipe.

Parameters:: recipe (QuantRecipe)
Return type:: float

property importance: dict: Raises an error since this is not a useful abstraction for AutoQuantize.

estimate_quant_compression(quant_cfg)

Estimate the compression ratio of a quantization configuration.

Right now, we find the minimum compression ratio across all quantizer attribute configs. This is not perfect but is a good proxy for the overall compression ratio. We will improve this in future releases.

Parameters:: quant_cfg (QuantizeConfig) – The quantization configuration to estimate compression for.
Returns:: The estimated compression ratio (0.0 to 1.0).
Return type:: float