conversion
Conversion and restoration utilities for sparse attention.
Functions
Convert model to use sparse attention. |
|
Disable sparse attention for matching modules. |
|
Enable sparse attention for matching modules. |
|
Extract sparse attention config for export to config.json. |
|
Check if a model has sparse attention applied. |
|
Print summary of sparse attention modules in the model. |
|
Replace regular attention modules with sparse attention modules. |
|
Restore sparse attention model from saved state. |
|
Restore sparse attention state from state dict. |
|
Set sparse attention attributes for modules matching pattern. |
|
Apply sparse attention configuration to model. |
|
Update metadata with sparse attention state. |
- convert_to_sparse_attention_model(model, config)
Convert model to use sparse attention.
- Parameters:
model (ModelLikeModule) – Model to convert
config (SparseAttentionConfig) – Sparse attention configuration
- Returns:
Tuple of (converted_model, metadata)
- Return type:
tuple[Module, dict[str, Any]]
- disable_sparse_attention(model, wildcard_or_filter_func)
Disable sparse attention for matching modules.
Similar to mtq.disable_quantizer for API consistency.
- Parameters:
model (Module) – Model with sparse attention applied
wildcard_or_filter_func (str | Callable) – Wildcard string or filter function to match module names. For example: “lm_head”, “layer_0”, etc.
Example
>>> import modelopt.torch.sparsity.attention_sparsity as sparse_attn >>> model = sparse_attn.sparsify(model, config) >>> # Disable sparse attention for lm_head >>> sparse_attn.disable_sparse_attention(model, "*lm_head*")
- enable_sparse_attention(model, wildcard_or_filter_func)
Enable sparse attention for matching modules.
Similar to mtq.enable_quantizer for API consistency.
- Parameters:
model (Module) – Model with sparse attention applied
wildcard_or_filter_func (str | Callable) – Wildcard string or filter function to match module names. For example: “attention”, “attn”, etc.
Example
>>> import modelopt.torch.sparsity.attention_sparsity as sparse_attn >>> model = sparse_attn.sparsify(model, config) >>> # Re-enable sparse attention for all attention modules >>> sparse_attn.enable_sparse_attention(model, "*attention*")
- export_sparse_attention_config(model)
Extract sparse attention config for export to config.json.
Extracts calibration parameters, method metadata, and per-layer enable/disable state from sparse attention modules. Supports both LLM and diffusion models.
Algorithm-specific parameters (
threshold_scale_factor,raw_threshold,disabled_layers) are nested inside the config group that owns them. This allows future sparse attention methods to define their own parameter schemas in separate groups without collision.The formula in the export reflects the actual fitting mode used during calibration:
Linear-space fit (default, LLMs):
scale_factor = a * exp(b * S)exportsaandb.Log-space fit (diffusion):
log_a + b * Sexportslog_aandb.
At runtime:
threshold = scale_factor / seqlen.- Parameters:
model (Module) – Model with sparse attention applied
- Returns:
Dictionary with sparse attention config for HuggingFace config.json export. Returns None if no sparse attention modules are found, or if no calibration parameters and no raw threshold are available.
- Return type:
dict[str, Any] | None
Example output (LLM, linear-space fit):
{ "config_groups": { "group_0": { "sparse_algo": "softmax_skip", "targets": ["LlamaAttention"], "threshold_scale_factor": { "formula": "a * exp(b * target_sparsity)", "prefill": {"a": 7.93, "b": 8.61}, "decode": {"a": 0.12, "b": 9.85}, }, } }, "producer": {"name": "modelopt", "version": "0.37.0"}, }
Example output (diffusion, log-space fit):
{ "config_groups": { "group_0": { "sparse_algo": "softmax_skip", "targets": ["Attention"], "threshold_scale_factor": { "formula": "log_a + b * target_sparsity", "prefill": {"log_a": 0.21, "b": 3.45}, }, "disabled_layers": ["blocks.0.attn1", "blocks.39.attn1"], } }, "producer": {"name": "modelopt", "version": "0.37.0"}, }
- is_attn_sparsified(model)
Check if a model has sparse attention applied.
Similar to quantization’s is_quantized for API consistency.
- Parameters:
model (Module) – Model to check
- Returns:
True if model contains any SparseAttentionModule instances
- Return type:
bool
- print_sparse_attention_summary(model)
Print summary of sparse attention modules in the model.
- Parameters:
model (Module) – Model with sparse attention applied
- replace_sparse_attention_modules(model, version=None)
Replace regular attention modules with sparse attention modules.
Recursively replace all attention modules in the model with their sparse attention counterparts.
- Parameters:
model (Module) – Model to process
version – State version for tracking (optional)
- restore_sparse_attention_model(model, config, metadata)
Restore sparse attention model from saved state.
- Parameters:
model (ModelLikeModule) – Model to restore
config (SparseAttentionConfig) – Sparse attention configuration
metadata (dict[str, Any]) – Saved metadata
- Returns:
Restored model
- Return type:
Module
- restore_sparse_attention_state(model, state_dict)
Restore sparse attention state from state dict.
- Parameters:
model (Module) – Model with sparse attention modules
state_dict (dict[str, Any]) – Saved state dictionary
- set_sparse_attention_attribute(model, wildcard_or_filter, attribute_cfg)
Set sparse attention attributes for modules matching pattern.
Similar to quantization’s set_quantizer_attributes_partial.
- Parameters:
model (Module) – Model to configure
wildcard_or_filter (str | Callable) – Pattern to match module names
attribute_cfg (dict[str, Any]) – Attributes to apply (must include ‘method’)
- set_sparse_attention_by_cfg(model, sparse_cfg)
Apply sparse attention configuration to model.
Similar to quantization’s set_quantizer_by_cfg.
- Parameters:
model (Module) – Model with sparse attention modules
sparse_cfg (dict) – Sparse configuration dictionary mapping patterns to attributes
- update_sparse_attention_metadata(model, config, metadata)
Update metadata with sparse attention state.
- Parameters:
model (Module) – Model with sparse attention
config (SparseAttentionConfig) – Configuration used
metadata (dict[str, Any]) – Metadata dict to update
- Return type:
None