block_config

Classes

`BaseDataclass`	A dataclass base class with several utilities: 1.
`SubblockConfig`	Base configuration for a subblock (e.g. attention or FFN) within a transformer block.
`MoEConfig`	Configuration class for Mixture of Experts parameters.
`MambaConfig`	Configuration for a Mamba (state-space model) subblock.
`Llama4AttentionConfig`	Configuration for Llama-4-specific attention parameters.
`AttentionConfig`	Configuration for an attention subblock within a transformer block.
`FFNConfig`	Configuration for a feed-forward network subblock within a transformer block.
`BlockConfig`	Configuration for a single transformer block, including its attention and FFN subblocks.

Functions

maybe_cast_block_configs

Cast a list of dicts to BlockConfig objects if needed.

class AttentionConfig

Bases: SubblockConfig

Configuration for an attention subblock within a transformer block.

__init__(*, no_op=False, replace_with_linear=False, sparsify=None, weights_precision='bf16', num_key_value_heads=None, llama4=None, mamba=None)

Parameters:

no_op (bool)
replace_with_linear (bool)
sparsify (list[str] | None)
weights_precision (str | None)
num_key_value_heads (int | None)
llama4 (Llama4AttentionConfig | None)
mamba (MambaConfig | None)

Return type:

None

property is_llama4: bool

property is_mamba: bool

llama4: Llama4AttentionConfig | None = None

mamba: MambaConfig | None = None

num_key_value_heads: int | None = None

to_blockconfig()

Return type:: BlockConfig

class BaseDataclass

Bases: object

A dataclass base class with several utilities: 1. Comparison via string representation. 2. Initialization of dataclasses fields from dicts. 3. Setting attributes even though it’s frozen (but only inside __post_init__!)

__init__()

Return type:: None

class BlockConfig

Bases: BaseDataclass

Configuration for a single transformer block, including its attention and FFN subblocks.

__init__(*, attention=None, ffn=None, parallel_blocks=None)

Parameters:

attention (AttentionConfig | None)
ffn (FFNConfig | None)
parallel_blocks (list[BlockConfig] | None)

Return type:

None

attention: AttentionConfig | None = None

ffn: FFNConfig | None = None

parallel_blocks: list[BlockConfig] | None = None

to_dict()

Convert BlockConfig to a dictionary.

Return type:: dict

class FFNConfig

Bases: SubblockConfig

Configuration for a feed-forward network subblock within a transformer block.

__init__(*, no_op=False, replace_with_linear=False, sparsify=None, weights_precision='bf16', moe=None, intermediate_size=None)

Parameters:

no_op (bool)
replace_with_linear (bool)
sparsify (list[str] | None)
weights_precision (str | None)
moe (MoEConfig | None)
intermediate_size (int | None)

Return type:

None

intermediate_size: int | None = None

property is_moe: bool

moe: MoEConfig | None = None

to_blockconfig()

Return type:: BlockConfig

class Llama4AttentionConfig

Bases: BaseDataclass

Configuration for Llama-4-specific attention parameters.

__init__(*, attention_chunk_size=None, use_rope=None, use_qk_norm=None, attn_scale=None, floor_scale=None, attn_temperature_tuning=None, attention_dropout=None)

Parameters:

attention_chunk_size (int | None)
use_rope (bool | None)
use_qk_norm (bool | None)
attn_scale (float | None)
floor_scale (float | None)
attn_temperature_tuning (bool | None)
attention_dropout (float | None)

Return type:

None

attention_chunk_size: int | None = None

attention_dropout: float | None = None

attn_scale: float | None = None

attn_temperature_tuning: bool | None = None

floor_scale: float | None = None

use_qk_norm: bool | None = None

use_rope: bool | None = None

class MambaConfig

Bases: BaseDataclass

Configuration for a Mamba (state-space model) subblock.

__init__(*, state_dim, num_heads, head_dim, num_groups)

Parameters:

state_dim (int)
num_heads (int)
head_dim (int)
num_groups (int)

Return type:

None

head_dim: int

num_groups: int

num_heads: int

state_dim: int

class MoEConfig

Bases: BaseDataclass

Configuration class for Mixture of Experts parameters.

__init__(*, num_local_experts=8, num_experts_per_tok=1, expert_intermediate_dim=8192, shared_expert_intermediate_dim=8192)

Parameters:

num_local_experts (int)
num_experts_per_tok (int)
expert_intermediate_dim (int)
shared_expert_intermediate_dim (int)

Return type:

None

expert_intermediate_dim: int = 8192

num_experts_per_tok: int = 1

num_local_experts: int = 8

shared_expert_intermediate_dim: int = 8192

class SubblockConfig

Bases: BaseDataclass

Base configuration for a subblock (e.g. attention or FFN) within a transformer block.

__init__(*, no_op=False, replace_with_linear=False, sparsify=None, weights_precision='bf16')

Parameters:

no_op (bool)
replace_with_linear (bool)
sparsify (list[str] | None)
weights_precision (str | None)

Return type:

None

no_op: bool = False

replace_with_linear: bool = False

sparsify: list[str] | None = None

abstract to_blockconfig()

” Convert to a block including this subblock only.

Return type:: BlockConfig

weights_precision: str | None = 'bf16'

maybe_cast_block_configs(block_configs)

Cast a list of dicts to BlockConfig objects if needed.

Parameters:: block_configs (List[BlockConfig | dict] | None) – List of BlockConfig or dict objects, or None.
Returns:: List of BlockConfig objects, or None if input is None/empty.
Return type:: List[BlockConfig] | None