conversion

Conversion and restoration utilities for sparse attention.

Functions

`convert_to_sparse_attention_model`	Convert model to use sparse attention.
`disable_sparse_attention`	Disable sparse attention for matching modules.
`enable_sparse_attention`	Enable sparse attention for matching modules.
`export_sparse_attention_config`	Extract sparse attention config for export to config.json.
`is_attn_sparsified`	Check if a model has sparse attention applied.
`print_sparse_attention_summary`	Print summary of sparse attention modules in the model.
`replace_sparse_attention_modules`	Replace regular attention modules with sparse attention modules.
`restore_sparse_attention_model`	Restore sparse attention model from saved state.
`restore_sparse_attention_state`	Restore sparse attention state from state dict.
`set_sparse_attention_attribute`	Set sparse attention attributes for modules matching pattern.
`set_sparse_attention_by_cfg`	Apply sparse attention configuration to model.
`update_sparse_attention_metadata`	Update metadata with sparse attention state.

convert_to_sparse_attention_model(model, config)

Convert model to use sparse attention.

Parameters:

model (ModelLikeModule) – Model to convert
config (SparseAttentionConfig) – Sparse attention configuration

Returns:

Tuple of (converted_model, metadata)

Return type:

tuple[Module, dict[str, Any]]

disable_sparse_attention(model, wildcard_or_filter_func)

Disable sparse attention for matching modules.

Similar to mtq.disable_quantizer for API consistency.

Parameters:

model (Module) – Model with sparse attention applied
wildcard_or_filter_func (str | Callable) – Wildcard string or filter function to match module names. For example: “lm_head”, “layer_0”, etc.

Example

>>> import modelopt.torch.sparsity.attention_sparsity as sparse_attn
>>> model = sparse_attn.sparsify(model, config)
>>> # Disable sparse attention for lm_head
>>> sparse_attn.disable_sparse_attention(model, "*lm_head*")

enable_sparse_attention(model, wildcard_or_filter_func)

Enable sparse attention for matching modules.

Similar to mtq.enable_quantizer for API consistency.

Parameters:

model (Module) – Model with sparse attention applied
wildcard_or_filter_func (str | Callable) – Wildcard string or filter function to match module names. For example: “attention”, “attn”, etc.

Example

>>> import modelopt.torch.sparsity.attention_sparsity as sparse_attn
>>> model = sparse_attn.sparsify(model, config)
>>> # Re-enable sparse attention for all attention modules
>>> sparse_attn.enable_sparse_attention(model, "*attention*")

export_sparse_attention_config(model)

Extract sparse attention config for export to config.json.

Extracts the calibration parameters (a, b) for the exponential threshold model from the first sparse attention module that has calibrated thresholds.

The exported config allows computing threshold at runtime:: scale_factor = a * exp(b * target_sparsity) threshold = scale_factor / seqlen

Parameters:: model (Module) – Model with sparse attention applied
Returns:: Dictionary with sparse attention config for HuggingFace config.json export. Returns None if no calibrated sparse attention modules found.
Return type:: dict[str, Any] | None

Example output:

{
    "config_groups": {
        "group_0": {"sparse_algo": "softmax_skip", "targets": ["LlamaAttention"]}
    },
    "threshold_scale_factor": {
        "formula": "a * exp(b * target_sparsity)",
        "prefill": {"a": 7.93, "b": 8.61},
        "decode": {"a": 0.12, "b": 9.85},
    },
    "producer": {"name": "modelopt", "version": "0.37.0"},
}

is_attn_sparsified(model)

Check if a model has sparse attention applied.

Similar to quantization’s is_quantized for API consistency.

Parameters:: model (Module) – Model to check
Returns:: True if model contains any SparseAttentionModule instances
Return type:: bool

print_sparse_attention_summary(model)

Print summary of sparse attention modules in the model.

Parameters:: model (Module) – Model with sparse attention applied

replace_sparse_attention_modules(model, version=None)

Replace regular attention modules with sparse attention modules.

Recursively replace all attention modules in the model with their sparse attention counterparts.

Parameters:

model (Module) – Model to process
version – State version for tracking (optional)

restore_sparse_attention_model(model, config, metadata)

Restore sparse attention model from saved state.

Parameters:

model (ModelLikeModule) – Model to restore
config (SparseAttentionConfig) – Sparse attention configuration
metadata (dict[str, Any]) – Saved metadata

Returns:

Restored model

Return type:

Module

restore_sparse_attention_state(model, state_dict)

Restore sparse attention state from state dict.

Parameters:

model (Module) – Model with sparse attention modules
state_dict (dict[str, Any]) – Saved state dictionary

set_sparse_attention_attribute(model, wildcard_or_filter, attribute_cfg)

Set sparse attention attributes for modules matching pattern.

Similar to quantization’s set_quantizer_attribute.

Parameters:

model (Module) – Model to configure
wildcard_or_filter (str | Callable) – Pattern to match module names
attribute_cfg (dict[str, Any]) – Attributes to apply (must include ‘method’)

set_sparse_attention_by_cfg(model, sparse_cfg)

Apply sparse attention configuration to model.

Similar to quantization’s set_quantizer_by_cfg.

Parameters:

model (Module) – Model with sparse attention modules
sparse_cfg (dict) – Sparse configuration dictionary mapping patterns to attributes

update_sparse_attention_metadata(model, config, metadata)

Update metadata with sparse attention state.

Parameters:

model (Module) – Model with sparse attention
config (SparseAttentionConfig) – Configuration used
metadata (dict[str, Any]) – Metadata dict to update

Return type:

None