quant_utils

Utils for quantization including scaling factors adjustments.

Functions

adjust_attn_amax_values

Adjusts the amax values for the attention layers.

all_items_same

Checks if all elements in the provided list are the same.

from_quantized_weight

Converts the quantized weight to the target torch_dtype format.

get_activation_scaling_factor

Returns the activation scaling factor.

get_kv_cache_dtype

Returns the kv_cache dtype.

get_kv_cache_scaling_factor

Returns the kv_cache scaling factor if output quantizer is set.

get_prequant_scaling_factor

Returns the prequant scaling factor.

get_quantization_format

Gets the quantization string.

get_scaling_factor

Returns scaling factor from the quantizer as torch.Tensor.

get_scaling_factor_from_weight

Calculate the weight scaling factor for a given group size.

get_weight_block_size

Returns the weight block size.

get_weight_scaling_factor

Returns the weight scaling factor.

get_weight_scaling_factor_2

Returns the secondary weight scaling factor.

postprocess_state_dict

Filters out keys related to weight quantizers and updates KV cache related keys.

preprocess_linear_fusion

Preprocess the quantized linears that we plan to fuse.

process_layer_quant_config

Processes per layer quantization information for TRTLLM export to quant_cfg.json.

resmooth_and_get_scale

Resmooths weights from a single or multiple ranks and get scaling factors and amax.

to_quantized_weight

Converts the weight to the quantized (packed) format.

adjust_attn_amax_values(module)

Adjusts the amax values for the attention layers.

all_items_same(item_list)

Checks if all elements in the provided list are the same.

from_quantized_weight(weight, weights_scaling_factor, quantization, torch_dtype)

Converts the quantized weight to the target torch_dtype format.

Parameters:
  • weight (Tensor) –

  • weights_scaling_factor (Tensor) –

  • quantization (str) –

get_activation_scaling_factor(module)

Returns the activation scaling factor.

Parameters:

module (Module) –

Return type:

Tensor

get_kv_cache_dtype(modules)

Returns the kv_cache dtype.

If num_bits of output_quantizer is (4, 3) then returns FP8; if it is 8, returns int8, otherwise returns None.

Parameters:

modules (Union[list[nn.Module], nn.Module]) – The module or list of modules to inspect.

Returns:

The kv_cache dtype.

Return type:

str

get_kv_cache_scaling_factor(qkv_modules)

Returns the kv_cache scaling factor if output quantizer is set. Else returns None by default.

Parameters:

qkv_modules (list[Module]) –

Return type:

Tensor

get_prequant_scaling_factor(module)

Returns the prequant scaling factor.

Parameters:

module (Module) –

Return type:

Tensor

get_quantization_format(module)

Gets the quantization string.

Gets the quantization string by iterating through the module and its children. The first non-None quantization string is returned.

Return type:

str | None

get_scaling_factor(quantizer)

Returns scaling factor from the quantizer as torch.Tensor.

Parameters:

quantizer (TensorQuantizer) –

Return type:

Tensor

get_scaling_factor_from_weight(weight, group_size)

Calculate the weight scaling factor for a given group size.

Return type:

tensor

get_weight_block_size(module)

Returns the weight block size.

Parameters:

module (Module) –

Return type:

int

get_weight_scaling_factor(module)

Returns the weight scaling factor.

Parameters:

module (Module) –

Return type:

Tensor

get_weight_scaling_factor_2(module)

Returns the secondary weight scaling factor.

Parameters:

module (Module) –

Return type:

Tensor

postprocess_state_dict(state_dict, maxbound, quantization)

Filters out keys related to weight quantizers and updates KV cache related keys.

Parameters:
  • state_dict (dict) – The full model state_dict.

  • maxbound (float) – The maximum bound value for the output quantizer.

  • quantization (Optional[str]) – The KV cache quantization format.

Returns:

Filtered state_dict without unnecessary keys like ‘_amax’ and non KV cache output quantizers.

Return type:

dict

preprocess_linear_fusion(modules, resmooth_only=False)

Preprocess the quantized linears that we plan to fuse.

Use resmooth_only for MOE experts as each individual expert is not fused.

Parameters:

modules (list[Module]) –

process_layer_quant_config(layer_config_dict)

Processes per layer quantization information for TRTLLM export to quant_cfg.json.

resmooth_and_get_scale(merged_weights, pre_quant_scales, ranks, group_size, new_pre_quant_scale=None, quantization=None)

Resmooths weights from a single or multiple ranks and get scaling factors and amax.

Parameters:
  • merged_weights (Tensor) – Merged weights from ranks.

  • pre_quant_scales (list[Tensor]) – List of pre-quantization scales for each rank.

  • ranks (int) – Number of ranks.

  • group_size (int) – Group size of the quantization block.

  • new_pre_quant_scale (optional) – If not provided, weights will be resmoothed using the average of pre_quant_scales.

  • quantization (str | None) –

Returns:

Resmoothed weights. weight_scaling_factors: Resmoothed scaling factors. avg_pre_quant_scale: Calculated average of the quantization scale.

Return type:

weights

to_quantized_weight(weight, weights_scaling_factor, quantization, weights_scaling_factor2=None, block_size=None)

Converts the weight to the quantized (packed) format.

Parameters:
  • weight (Tensor) –

  • weights_scaling_factor (Tensor) –

  • quantization (str) –

  • weights_scaling_factor2 (Tensor | None) –

  • block_size (int | None) –