utils

Quantization utilities.

Functions

convert_quantization_axis_to_reduce_axis

Convert the quantization axis to the reduce axis.

export_torch_mode

Context manager enabling the export mode.

is_quantized

Check if a module is quantized.

is_quantized_column_parallel_linear

Check if a module is a quantized column parallel linear module.

is_quantized_linear

Check if a module is a quantized linear module.

is_quantized_row_parallel_linear

Check if a module is a quantized row parallel linear module.

reduce_amax

Compute the absolute maximum value of a tensor.

reduce_sum

Compute the sum of a tensor along specified axes.

replace_function

Replace a function with a new one within a context.

representative_weight_quantizer

Return the representative weight quantizer for weight_name on module.

update_quant_cfg_with_kv_cache_quant

Update the quant_cfg with the kv cache quant_cfg.

weight_attr_names

Get the weight param attribute names in a converted module, non-recursive.

convert_quantization_axis_to_reduce_axis(input, axis)

Convert the quantization axis to the reduce axis.

Parameters:
  • input (torch.Tensor) – The input tensor.

  • axis (int, tuple, list of None) – The quantization axis. None means per-tensor quantization.

Returns:

The axis to reduce. None suggests all dimensions should be reduced.

Return type:

list

export_torch_mode()

Context manager enabling the export mode.

is_quantized(module)

Check if a module is quantized.

is_quantized_column_parallel_linear(module)

Check if a module is a quantized column parallel linear module.

is_quantized_linear(module)

Check if a module is a quantized linear module.

is_quantized_row_parallel_linear(module)

Check if a module is a quantized row parallel linear module.

reduce_amax(input, axis=None, keepdims=True, squeeze_scalar=True)

Compute the absolute maximum value of a tensor.

Reduces input_tensor along the dimensions given in axis. Unless keepdims is true, the rank of the tensor is reduced by 1 for each entry in axis. If keepdims is true, the reduced dimensions are retained with length 1.

Note

Gradient computation is disabled as this function is never meant learning reduces amax

Parameters:
  • input – Input tensor

  • axis – The dimensions to reduce. None or int or tuple of ints. If None (the default), reduces all dimensions. Must be in the range [-rank(input_tensor), rank(input_tensor)).

  • keepdims – A boolean. If true, retains reduced dimensions with length 1. Default True

Returns:

The reduced tensor.

reduce_sum(input, axis=None, keepdims=True)

Compute the sum of a tensor along specified axes.

Reduces input_tensor along the dimensions given in axis. Unless keepdims is true, the rank of the tensor is reduced by 1 for each entry in axis. If keepdims is true, the reduced dimensions are retained with length 1.

Note

Gradient computation is disabled as this function is never meant for learning.

Parameters:
  • input – Input tensor

  • axis – The dimensions to reduce. None or int or tuple of ints. If None (the default), reduces all dimensions. Must be in the range [-rank(input_tensor), rank(input_tensor)).

  • keepdims – A boolean. If true, retains reduced dimensions with length 1. Default True

Returns:

The reduced tensor.

replace_function(package, name, new_func, og_func_cache_name=None)

Replace a function with a new one within a context.

representative_weight_quantizer(module, weight_name='weight')

Return the representative weight quantizer for weight_name on module.

Handles two layouts:

  • singular <name>_weight_quantizer — standard nn.Linear / _QuantLinear.

  • plural <name>_weight_quantizers (nn.ModuleList) — fused-experts modules (_QuantFusedExperts) hold one TensorQuantizer per expert. Per-expert formats are identical, so the first element is representative.

Returns None if no matching quantizer is found.

Parameters:
  • module (Module)

  • weight_name (str)

update_quant_cfg_with_kv_cache_quant(quant_cfg, kv_cache_quant_cfg)

Update the quant_cfg with the kv cache quant_cfg.

Parameters:
  • quant_cfg (dict[str, Any]) – The outer quantization config dict (with "quant_cfg" and "algorithm" keys).

  • kv_cache_quant_cfg (list[QuantizerCfgEntry]) – A list of QuantizerCfgEntry dicts for KV cache quantization, typically some_kv_cfg["quant_cfg"].

Returns:

A deep copy of quant_cfg with the KV cache entries appended to quant_cfg["quant_cfg"].

Return type:

dict[str, Any]

weight_attr_names(module)

Get the weight param attribute names in a converted module, non-recursive.

Covers three layouts:

  • standard nn.Linear: weight + weight_quantizer.

  • custom per-weight quantizer (e.g. Llama4TextExperts with gate_up_proj + gate_up_proj_weight_quantizer).

  • fused-experts nn.ModuleList quantizers (_QuantFusedExperts with gate_up_proj + gate_up_proj_weight_quantizers plural list).

Parameters:

module (Module)

Return type:

Generator[str, None, None]