modelopt.torch.quantization.compress

compress(quant_model, compress_config={'default': True})

Compress model weights of quantized model.

This function compresses weights in layers that have an enabled weight_quantizer with a supported quantization format. The compression is controlled by a pattern-based configuration.

Parameters:
  • quant_model – The quantized model to compress.

  • compress_config (dict[str, bool]) –

    Dictionary mapping layer patterns to boolean compression flags. Defaults to {“default”: True} which compresses all supported layers.

    Example configuration:

    {
        "*.mlp.fc1*": False,  # Skip compression for fc1 layers
        "default": True,  # Compress all other layers
    }
    

    Note: Each configuration except “default” is applied sequentially; therefore the later configurations will override the previous ones if the same layer is matched.

Note: This function modifies the input model in-place.