conversion

Conversion and restoration utilities for sparse attention.

Functions

convert_to_sparse_attention_model

Convert model to use sparse attention.

disable_sparse_attention

Disable sparse attention for matching modules.

enable_sparse_attention

Enable sparse attention for matching modules.

export_sparse_attention_config

Extract sparse attention config for export to config.json.

is_attn_sparsified

Check if a model has sparse attention applied.

print_sparse_attention_summary

Print summary of sparse attention modules in the model.

replace_sparse_attention_modules

Replace regular attention modules with sparse attention modules.

restore_sparse_attention_model

Restore sparse attention model from saved state.

restore_sparse_attention_state

Restore sparse attention state from state dict.

set_sparse_attention_attribute

Set sparse attention attributes for modules matching pattern.

set_sparse_attention_by_cfg

Apply sparse attention configuration to model.

update_sparse_attention_metadata

Update metadata with sparse attention state.

convert_to_sparse_attention_model(model, config)

Convert model to use sparse attention.

Parameters:
  • model (ModelLikeModule) – Model to convert

  • config (SparseAttentionConfig) – Sparse attention configuration

Returns:

Tuple of (converted_model, metadata)

Return type:

tuple[Module, dict[str, Any]]

disable_sparse_attention(model, wildcard_or_filter_func)

Disable sparse attention for matching modules.

Similar to mtq.disable_quantizer for API consistency.

Parameters:
  • model (Module) – Model with sparse attention applied

  • wildcard_or_filter_func (str | Callable) – Wildcard string or filter function to match module names. For example: “lm_head”, “layer_0”, etc.

Example

>>> import modelopt.torch.sparsity.attention_sparsity as sparse_attn
>>> model = sparse_attn.sparsify(model, config)
>>> # Disable sparse attention for lm_head
>>> sparse_attn.disable_sparse_attention(model, "*lm_head*")
enable_sparse_attention(model, wildcard_or_filter_func)

Enable sparse attention for matching modules.

Similar to mtq.enable_quantizer for API consistency.

Parameters:
  • model (Module) – Model with sparse attention applied

  • wildcard_or_filter_func (str | Callable) – Wildcard string or filter function to match module names. For example: “attention”, “attn”, etc.

Example

>>> import modelopt.torch.sparsity.attention_sparsity as sparse_attn
>>> model = sparse_attn.sparsify(model, config)
>>> # Re-enable sparse attention for all attention modules
>>> sparse_attn.enable_sparse_attention(model, "*attention*")
export_sparse_attention_config(model)

Extract sparse attention config for export to config.json.

Extracts calibrated skip-softmax parameters and N:M sparse-softmax metadata from sparse attention modules.

The exported config allows computing threshold at runtime:

scale_factor = a * exp(b * target_sparsity) threshold = scale_factor / seqlen

Parameters:

model (Module) – Model with sparse attention applied

Returns:

Dictionary with sparse attention config for HuggingFace config.json export. Returns None if no exportable sparse attention metadata is found.

Return type:

dict[str, Any] | None

Example output:

{
    "config_groups": {
        "group_0": {"sparse_algo": "softmax_skip", "targets": ["LlamaAttention"]}
    },
    "threshold_scale_factor": {
        "formula": "a * exp(b * target_sparsity)",
        "prefill": {"a": 7.93, "b": 8.61},
        "decode": {"a": 0.12, "b": 9.85},
    },
    "sparse_softmax": {
        "sparsity_n": 2,
        "sparsity_m": 4,
        "dense_sink_tokens": 0,
        "dense_recent_tokens": 64,
    },
    "producer": {"name": "modelopt", "version": "0.37.0"},
}
is_attn_sparsified(model)

Check if a model has sparse attention applied.

Similar to quantization’s is_quantized for API consistency.

Parameters:

model (Module) – Model to check

Returns:

True if model contains any SparseAttentionModule instances

Return type:

bool

print_sparse_attention_summary(model)

Print summary of sparse attention modules in the model.

Parameters:

model (Module) – Model with sparse attention applied

replace_sparse_attention_modules(model, version=None)

Replace regular attention modules with sparse attention modules.

Recursively replace all attention modules in the model with their sparse attention counterparts.

Parameters:
  • model (Module) – Model to process

  • version – State version for tracking (optional)

restore_sparse_attention_model(model, config, metadata)

Restore sparse attention model from saved state.

Parameters:
  • model (ModelLikeModule) – Model to restore

  • config (SparseAttentionConfig) – Sparse attention configuration

  • metadata (dict[str, Any]) – Saved metadata

Returns:

Restored model

Return type:

Module

restore_sparse_attention_state(model, state_dict)

Restore sparse attention state from state dict.

Parameters:
  • model (Module) – Model with sparse attention modules

  • state_dict (dict[str, Any]) – Saved state dictionary

set_sparse_attention_attribute(model, wildcard_or_filter, attribute_cfg)

Set sparse attention attributes for modules matching pattern.

Similar to quantization’s set_quantizer_attributes_partial.

Parameters:
  • model (Module) – Model to configure

  • wildcard_or_filter (str | Callable) – Pattern to match module names

  • attribute_cfg (dict[str, Any]) – Attributes to apply (must include ‘method’)

set_sparse_attention_by_cfg(model, sparse_cfg)

Apply sparse attention configuration to model.

Similar to quantization’s set_quantizer_by_cfg.

Parameters:
  • model (Module) – Model with sparse attention modules

  • sparse_cfg (dict) – Sparse configuration dictionary mapping patterns to attributes

update_sparse_attention_metadata(model, config, metadata)

Update metadata with sparse attention state.

Parameters:
  • model (Module) – Model with sparse attention

  • config (SparseAttentionConfig) – Configuration used

  • metadata (dict[str, Any]) – Metadata dict to update

Return type:

None