tensor_quantizer

TensorQuantizer Module.

Classes

TensorQuantizer

Tensor quantizer module.

SequentialQuantizer

A sequential container for TensorQuantizer modules.

class SequentialQuantizer

Bases: Sequential

A sequential container for TensorQuantizer modules.

This modules is used to quantize a tensor in multiple formats sequentially. It takes as input TensorQuantizer modules and containerize them similar to torch.nn.Sequential.

We delegate certain properties and methods to all contained quantizers. In the case of conflicts, the first quantizer’s property or method takes priority.

SequentialQuantizer is useful in cases like INT4 weights, FP8 activations where weight quantization is not the same as the gemm quantization. It allows for applying multiple quantization formats to the same tensor in sequence.

Use SequentialQuantizer methods in lower level implementations for better code organization and readability.

Parameters:

quantizers (TensorQuantizer) – TensorQuantizer modules to be added to the container.

__init__(*quantizers)

Initialize SequentialQuantizer module.

Parameters:

quantizers (TensorQuantizer) –

static convert_to_single_quantizer(model, indx=0)

Replace instances of SequentialQuantizer in the model with single TensorQuantizer quantizer.

The quantizer indexed by indx from the sequential quantizer is used to replace it. This method is useful for individually calibrating the quantizers in a sequential quantizer.

Parameters:

indx (int) –

get_modelopt_state()

Get meta state to be saved in checkpoint.

Return type:

dict[str, Any]

set_from_attribute_config(attributes)

Set the attributes of contained quantizers from a list of attribute_dicts.

Parameters:

attributes (list[dict[str, Any] | QuantizerAttributeConfig] | dict[str, Any] | QuantizerAttributeConfig) –

class TensorQuantizer

Bases: Module

Tensor quantizer module.

This module manages quantization and calibration of input tensor. It can perform fake (simulated quantization) or real quantization for various precisions and formats such as FP8 per-tensor, INT8 per-channel, INT4 per-block etc.

If quantization is enabled, it calls the appropriate quantization functional and returns the quantized tensor. The quantized tensor data type will be same as the input tensor data type for fake quantization. During calibration mode, the module collects the statistics using its calibrator.

The quantization parameters are as described in QuantizerAttributeConfig. They can be set at initialization using quant_attribute_cfg or later by calling set_from_attribute_config().

Parameters:
  • quant_attribute_cfg – An instance of QuantizerAttributeConfig or None. If None, default values are used.

  • if_quant – A boolean. If True, quantization is enabled in the forward path.

  • if_calib – A boolean. If True, calibration is enabled in the forward path.

  • amax – None or an array like object such as list, tuple, numpy array, scalar which can be used to construct amax tensor.

__init__(quant_attribute_cfg=None, if_quant=True, if_calib=False, amax=None)

Initialize quantizer and set up required variables.

property amax

Return amax for quantization.

property axis

Return axis for quantization.

property bias

Return bias for quantization.

property bias_axis

Return bias_axis for quantization.

property bias_calibrator

Return bias_calibrator for quantization.

property bias_method

Return bias_method for quantization.

property bias_type

Return bias_type for quantization.

property bias_value

Return bias for quantization.

property block_sizes

Return block_sizes for quantization.

collect(inputs)

Collect calibration data.

Return type:

None

dequantize(inputs)

De-quantize a real quantized tensor to a given dtype.

Parameters:

inputs (BaseQuantizedTensor | QTensorWrapper) –

disable()

Bypass the module.

Neither of calibration, clipping and quantization will be performed if the module is disabled.

disable_calib()

Disable calibration.

disable_pre_quant_scale()

Context manager to turn off pre_quant_scale inside this quantizer.

disable_quant()

Disable quantization.

enable()

Enable the module.

enable_calib()

Enable calibration.

enable_quant()

Enable quantization.

export_amax()

Export correctly formatted/shaped amax.

Return type:

Tensor | None

extra_repr()

Set the extra information about this module.

property fake_quant

Return True if fake quantization is used.

forward(inputs)

Apply tensor_quant function to inputs.

Parameters:

inputs – A Tensor of type float32/float16/bfloat16.

Returns:

A Tensor of type output_dtype

Return type:

outputs

get_modelopt_state(properties_only=False)

Get meta state to be saved in checkpoint.

If properties_only is True, only the quantizer properties such as num_bits, axis etc are included. For restoring the quantizer fully including the parameters and buffers, use properties_only=False.

Parameters:

properties_only (bool) –

Return type:

dict[str, Any]

property is_enabled

Return true if the modules is not disabled.

property is_mx_format

Check if is MX formats.

load_calib_amax(*args, **kwargs)

Load amax from calibrator.

Updates the amax buffer with value computed by the calibrator, creating it if necessary. *args and **kwargs are directly passed to compute_amax, except "strict" in kwargs. Refer to compute_amax for more details.

load_calib_bias(*args, **kwargs)

Load affine bias for quantization.

property maxbound

Return maxbound for quantization.

property mopt_ckpt_versn

Version of the checkpoint if it is restored from a checkpoint.

property narrow_range

Return True if symmetric integer range for signed quantization is used.

property num_bits

Return num_bits for quantization.

property pre_quant_scale

Return pre_quant_scale used for smoothquant.

reset_amax()

Reset amax to None.

reset_bias()

Reset bias to None.

set_from_attribute_config(attribute_cfg)

Set quantizer attributes from attribute_dict.

The attributes are defined in QuantizerAttributeConfig.

Parameters:

attribute_cfg (QuantizerAttributeConfig | dict) –

set_from_modelopt_state(modelopt_state, properties_only=False)

Set meta state from checkpoint.

If properties_only is True, only the quantizer properties such as num_bits, axis etc are included. For restoring the quantizer fully including the parameters and buffers, use properties_only=False.

Parameters:

properties_only (bool) –

property step_size

Return step size for integer quantization.

property svdquant_lora_a

Lora a weights for svdquant.

property svdquant_lora_b

Lora b weights for svdquant.

sync_amax_across_distributed_group(parallel_group)

Synchronize the amax across all ranks in the given group.

Parameters:

parallel_group (DistributedProcessGroup) –

property trt_high_precision_dtype

Return True if FP16 AMAX is used when exporting the model.

property unsigned

Return True if unsigned quantization is used.