config

Configuration classes for sparse attention optimization.

ModeloptConfig FlashSkipSoftmaxConfig

Bases: SparseAttentionConfig

Configuration for Flash Attention-aware softmax skip sparse attention.

Show default config as JSON
Default config (JSON):

{
   "sparse_cfg": {
      "*attention*": {
         "backend": "pytorch",
         "bc": 128,
         "br": 128,
         "enable": true,
         "method": "flash_skip_softmax",
         "threshold": {
            "decode": 0.0001,
            "prefill": 0.001
         }
      },
      "default": {
         "enable": false
      }
   },
   "export_format": null
}

field sparse_cfg: dict[str | Callable, dict[str, Any]]

Show details

Pattern-based configuration with flash_skip_softmax specific defaults. Includes FA block sizes (br, bc) and correction factor settings.

ModeloptConfig SparseAttentionAttributeConfig

Bases: ModeloptBaseConfig

Sparse attention attribute configuration for pattern-based module config.

Show default config as JSON
Default config (JSON):

{
   "method": "flash_skip_softmax",
   "enable": true,
   "threshold": 0.001,
   "br": 128,
   "bc": 128,
   "backend": "pytorch",
   "is_causal": true,
   "calibration": null
}

field backend: str

Show details

Backend to use for sparse attention computation. Only ‘pytorch’ is supported, which uses softmax patching with F.softmax. Requires model to be loaded with attn_implementation=’eager’.

field bc: int

Show details

Block column size for block-wise sparsity in Flash Attention.

field br: int

Show details

Block row size for block-wise sparsity in Flash Attention.

field calibration: dict | None

Show details

Calibration settings for this pattern. If provided, enables automatic threshold calibration. Only one pattern should have calibration enabled.

field enable: bool

Show details

If True, enables sparse attention. If False, bypasses sparsity.

field is_causal: bool

Show details

Whether the model uses causal (autoregressive) attention. If True, sparsity statistics are calculated over the lower triangle only. Defaults to True for decoder-only models like GPT, LLaMA, etc.

field method: str

Show details

The sparse attention method to use (e.g., ‘flash_skip_softmax’).

field threshold: float | dict[str, float]

Show details

Threshold for determining which attention values to skip. Can be a float or dict with phase-specific values.

ModeloptConfig SparseAttentionConfig

Bases: ModeloptBaseConfig

Base configuration for sparse attention optimization.

This base configuration provides the common structure for all sparse attention methods and supports pattern-based layer configuration.

Show default config as JSON
Default config (JSON):

{
   "sparse_cfg": {
      "*attention*": {
         "enable": true,
         "method": "flash_skip_softmax"
      },
      "default": {
         "enable": false
      }
   },
   "export_format": null
}

field export_format: str | None

Show details

Export format for sparse attention (e.g., ‘onnx’, ‘tensorrt’)

field sparse_cfg: dict[str | Callable, dict[str, Any]]

Show details

Pattern-based configuration for sparse attention. Keys are patterns to match module names, values are configuration dicts with parameters like ‘threshold’, ‘enable’, and ‘calibration’.