conversion

Conversion and restoration utilities for sparse attention.

Functions

convert_to_sparse_attention_model

Convert model to use sparse attention.

disable_sparse_attention

Disable sparse attention for matching modules.

enable_sparse_attention

Enable sparse attention for matching modules.

export_sparse_attention_config

Extract sparse attention config for export to config.json.

is_attn_sparsified

Check if a model has sparse attention applied.

print_sparse_attention_summary

Print summary of sparse attention modules in the model.

replace_sparse_attention_modules

Replace regular attention modules with sparse attention modules.

restore_sparse_attention_model

Restore sparse attention model from saved state.

restore_sparse_attention_state

Restore sparse attention state from state dict.

set_sparse_attention_attribute

Set sparse attention attributes for modules matching pattern.

set_sparse_attention_by_cfg

Apply sparse attention configuration to model.

update_sparse_attention_metadata

Update metadata with sparse attention state.

convert_to_sparse_attention_model(model, config)

Convert model to use sparse attention.

Parameters:
  • model (ModelLikeModule) – Model to convert

  • config (SparseAttentionConfig) – Sparse attention configuration

Returns:

Tuple of (converted_model, metadata)

Return type:

tuple[Module, dict[str, Any]]

disable_sparse_attention(model, wildcard_or_filter_func)

Disable sparse attention for matching modules.

Similar to mtq.disable_quantizer for API consistency.

Parameters:
  • model (Module) – Model with sparse attention applied

  • wildcard_or_filter_func (str | Callable) – Wildcard string or filter function to match module names. For example: “lm_head”, “layer_0”, etc.

Example

>>> import modelopt.torch.sparsity.attention_sparsity as sparse_attn
>>> model = sparse_attn.sparsify(model, config)
>>> # Disable sparse attention for lm_head
>>> sparse_attn.disable_sparse_attention(model, "*lm_head*")
enable_sparse_attention(model, wildcard_or_filter_func)

Enable sparse attention for matching modules.

Similar to mtq.enable_quantizer for API consistency.

Parameters:
  • model (Module) – Model with sparse attention applied

  • wildcard_or_filter_func (str | Callable) – Wildcard string or filter function to match module names. For example: “attention”, “attn”, etc.

Example

>>> import modelopt.torch.sparsity.attention_sparsity as sparse_attn
>>> model = sparse_attn.sparsify(model, config)
>>> # Re-enable sparse attention for all attention modules
>>> sparse_attn.enable_sparse_attention(model, "*attention*")
export_sparse_attention_config(model)

Extract sparse attention config for export to config.json.

Extracts the calibration parameters (a, b) for the exponential threshold model from the first sparse attention module that has calibrated thresholds.

The exported config allows computing threshold at runtime:

scale_factor = a * exp(b * target_sparsity) threshold = scale_factor / seqlen

Parameters:

model (Module) – Model with sparse attention applied

Returns:

Dictionary with sparse attention config for HuggingFace config.json export. Returns None if no calibrated sparse attention modules found.

Return type:

dict[str, Any] | None

Example output:

{
    "config_groups": {
        "group_0": {"sparse_algo": "softmax_skip", "targets": ["LlamaAttention"]}
    },
    "threshold_scale_factor": {
        "formula": "a * exp(b * target_sparsity)",
        "prefill": {"a": 7.93, "b": 8.61},
        "decode": {"a": 0.12, "b": 9.85},
    },
    "producer": {"name": "modelopt", "version": "0.37.0"},
}
is_attn_sparsified(model)

Check if a model has sparse attention applied.

Similar to quantization’s is_quantized for API consistency.

Parameters:

model (Module) – Model to check

Returns:

True if model contains any SparseAttentionModule instances

Return type:

bool

print_sparse_attention_summary(model)

Print summary of sparse attention modules in the model.

Parameters:

model (Module) – Model with sparse attention applied

replace_sparse_attention_modules(model, version=None)

Replace regular attention modules with sparse attention modules.

Recursively replace all attention modules in the model with their sparse attention counterparts.

Parameters:
  • model (Module) – Model to process

  • version – State version for tracking (optional)

restore_sparse_attention_model(model, config, metadata)

Restore sparse attention model from saved state.

Parameters:
  • model (ModelLikeModule) – Model to restore

  • config (SparseAttentionConfig) – Sparse attention configuration

  • metadata (dict[str, Any]) – Saved metadata

Returns:

Restored model

Return type:

Module

restore_sparse_attention_state(model, state_dict)

Restore sparse attention state from state dict.

Parameters:
  • model (Module) – Model with sparse attention modules

  • state_dict (dict[str, Any]) – Saved state dictionary

set_sparse_attention_attribute(model, wildcard_or_filter, attribute_cfg)

Set sparse attention attributes for modules matching pattern.

Similar to quantization’s set_quantizer_attribute.

Parameters:
  • model (Module) – Model to configure

  • wildcard_or_filter (str | Callable) – Pattern to match module names

  • attribute_cfg (dict[str, Any]) – Attributes to apply (must include ‘method’)

set_sparse_attention_by_cfg(model, sparse_cfg)

Apply sparse attention configuration to model.

Similar to quantization’s set_quantizer_by_cfg.

Parameters:
  • model (Module) – Model with sparse attention modules

  • sparse_cfg (dict) – Sparse configuration dictionary mapping patterns to attributes

update_sparse_attention_metadata(model, config, metadata)

Update metadata with sparse attention state.

Parameters:
  • model (Module) – Model with sparse attention

  • config (SparseAttentionConfig) – Configuration used

  • metadata (dict[str, Any]) – Metadata dict to update

Return type:

None