algorithms

Module for advanced quantization algorithms.

Classes

`AutoQuantizeSearcher`	A searcher for AutoQuantize algorithm.
`QuantRecipe`	A subclass of QuantizeConfig enabling auto_quantize specific configurations.
`QuantRecipeHparam`	An Hparam for quantization recipes.

Functions

estimate_quant_compression

Estimate the compression ratio of a quantization configuration.

class AutoQuantizeSearcher

Bases: BaseSearcher

A searcher for AutoQuantize algorithm.

In AutoQuantize, we search for the best per-layer quantization configuration that minimizes the sum of per-layer scores while meeting the specified constraint. AutoQuantize uses Linear Programming Solver to find the optimal quantization configuration.

The auto_quantize score for a layer quantization configuration is an approximation of model loss change change due to quantizing the particular layer with the particular configuration. The approximation is based on taylor expansion of the loss function wrt to the quantized output of the layer and substitution of Fisher information for Hessian. This approximation is mathematically correct for models where the loss is a log likelihood loss such as BERT, GPT, etc. However, the auto_quantize score can still be used as a proxy for other models such as ResNet.

before_search(): Prepare the model for search by calibrating the quantizers and collecting AutoQuantize score.

best: dict[str, Any]

candidate_stats: dict[str, dict[str, list[float]]]

property default_search_config: Get the default config for the searcher.

property default_state_dict: dict[str, Any]: Get the default state dict for AutoQuantize.

gradient_checkpointing_enable_contexts: list[tuple[Callable, Callable]] = [(<function _is_supported_hf_model>, <function setup_model_for_gradient_checkpointing>)]

classmethod insert_hparams_after_merge_rules(model, quant_recipes): Restrict the search space using the merge rules and insert the hparams for the model.

classmethod register_gradient_checkpointing_enable_context(is_supported_checker, context)

Register a gradient checkpointing enable context for AutoQuantize score estimation.

If the is_supported_checker(model) returns True, the context(model) will be used to enable gradient checkpointing.

Parameters:

is_supported_checker (Callable)
context (Callable)

rules = ['^(.*?)\\.(q_proj|k_proj|v_proj)$', '^(.*?)\\.(gate_proj|up_proj)$', '^(.*?)\\.(\\d+\\.(w1|w2|w3))$', '^(.*?)\\.((w1_linear|w2_linear|w3_linear)\\.\\d+)$']

run_search()

Search for the best per-layer quantization configuration and return the best model and configuration.

AutoQuantize uses Linear Programming Solver to find the optimal quantization configuration which minimizes the sum of per-layer auto_quantize scores while meeting the specified constraint.

sanitize_search_config(config)

Sanitize the search config dict.

Parameters:: config (dict[str, Any] | None)
Return type:: dict[str, Any]

class QuantRecipe

Bases: CustomHPType

A subclass of QuantizeConfig enabling auto_quantize specific configurations.

__init__(quant_cfg=None, quant_format_idx=None)

Initialize the QuantRecipe with the quantization configuration.

Parameters:

quant_cfg (dict[str, Any] | None)
quant_format_idx (int | None)

static disable_folding_pqs_to_weights(): Disable the folding of pre_quant_scale to weights.

static fold_pqs_to_weights(model): Fold the pre_quant_scale in weight_quantizers to weights.

property num_bits: int: Get the number of bits for the quantization format.

class QuantRecipeHparam

Bases: Hparam

An Hparam for quantization recipes.

In addition, this Hparam also: 1. Keeps a link to its modules and sets the quantizers for the module based on the active recipe. 2. Keeps track of the importance of each recipe in a dict instead of a tensor

__init__(choices, original=None, nn_modules=None)

Initializes Hparam with original value and choices.

Parameters:

choices (Sequence[QuantRecipe])
original (QuantRecipe | None)
nn_modules (list[Module] | None)

Return type:

None

property active: tuple[int, ...] | int | float | CustomHPType: Return the currently active value.

property importance: dict: Return the importance dict mapping recipe and importance.

estimate_quant_compression(quant_cfg)

Estimate the compression ratio of a quantization configuration.

Right now, we find the minimum compression ratio across all quantizer attribute configs. This is not perfect but is a good proxy for the overall compression ratio. We will improve this in future releases.

Parameters:: quant_cfg (QuantizeConfig) – The quantization configuration to estimate compression for.
Returns:: The estimated compression ratio (0.0 to 1.0).
Return type:: float