config
Configuration classes for sparse attention optimization.
- ModeloptConfig FlashSkipSoftmaxConfig
Bases:
SparseAttentionConfigConfiguration for Flash Attention-aware softmax skip sparse attention.
Show default config as JSON
- Default config (JSON):
{ "sparse_cfg": { "*attention*": { "backend": "pytorch", "bc": 128, "br": 128, "enable": true, "method": "flash_skip_softmax", "threshold": { "decode": 0.0001, "prefill": 0.001 } }, "default": { "enable": false } }, "export_format": null }
- field sparse_cfg: dict[str | Callable, dict[str, Any]]
Show details
Pattern-based configuration with flash_skip_softmax specific defaults. Includes FA block sizes (br, bc) and correction factor settings.
- ModeloptConfig SparseAttentionAttributeConfig
Bases:
ModeloptBaseConfigSparse attention attribute configuration for pattern-based module config.
Show default config as JSON
- Default config (JSON):
{ "method": "flash_skip_softmax", "enable": true, "threshold": 0.001, "br": 128, "bc": 128, "backend": "pytorch", "is_causal": true, "calibration": null }
- field backend: str
Show details
Backend to use for sparse attention computation. Only ‘pytorch’ is supported, which uses softmax patching with F.softmax. Requires model to be loaded with attn_implementation=’eager’.
- field bc: int
Show details
Block column size for block-wise sparsity in Flash Attention.
- field br: int
Show details
Block row size for block-wise sparsity in Flash Attention.
- field calibration: dict | None
Show details
Calibration settings for this pattern. If provided, enables automatic threshold calibration. Only one pattern should have calibration enabled.
- field enable: bool
Show details
If True, enables sparse attention. If False, bypasses sparsity.
- field is_causal: bool
Show details
Whether the model uses causal (autoregressive) attention. If True, sparsity statistics are calculated over the lower triangle only. Defaults to True for decoder-only models like GPT, LLaMA, etc.
- field method: str
Show details
The sparse attention method to use (e.g., ‘flash_skip_softmax’).
- field threshold: float | dict[str, float]
Show details
Threshold for determining which attention values to skip. Can be a float or dict with phase-specific values.
- ModeloptConfig SparseAttentionConfig
Bases:
ModeloptBaseConfigBase configuration for sparse attention optimization.
This base configuration provides the common structure for all sparse attention methods and supports pattern-based layer configuration.
Show default config as JSON
- Default config (JSON):
{ "sparse_cfg": { "*attention*": { "enable": true, "method": "flash_skip_softmax" }, "default": { "enable": false } }, "export_format": null }
- field export_format: str | None
Show details
Export format for sparse attention (e.g., ‘onnx’, ‘tensorrt’)
- field sparse_cfg: dict[str | Callable, dict[str, Any]]
Show details
Pattern-based configuration for sparse attention. Keys are patterns to match module names, values are configuration dicts with parameters like ‘threshold’, ‘enable’, and ‘calibration’.