calc_subblock_params_and_memory

Calculate memory usage and parameter counts for neural network subblocks.

This module provides utilities to compute memory footprints and parameter counts for different subblock types (FFN, Attention, Mamba, MoE) in large language models, considering various data types, batch sizes, and sequence lengths.

Functions

`calc_subblock_active_params`	Calculate the number of "active" parameters for a subblock (FFN, Attention, or MoE).
`calculate_ffn_memory`	Estimate the memory usage in MiB of a feed-forward network (FFN) subblock.
`calculate_mamba_memory`	Calculate memory usage (MiB) for a Mamba attention subblock.
`calculate_mamba_state_size`	Calculate the total state size for a Mamba attention subblock.
`calculate_non_block_memory`	Estimate the memory usage in MiB of non-subblock components (e.g., embeddings, output projection).
`calculate_non_block_params`	Calculate the number of parameters for non-subblock components (e.g., embeddings, output projection).
`calculate_subblock_memory`	Calculate the memory usage of a single subblock (FFN or Attention).
`calculate_subblock_params`	Count parameters on one meta decoder layer.
`estimate_num_active_experts`	Estimate the expected number of active experts in a Mixture-of-Experts (MoE) layer.
`load_moe_stats`	Load MoE (Mixture-of-Experts) routing statistics from a file.

calc_subblock_active_params(sublayer_config, model_config, descriptor, n_embd, moe_stats_file, batch_size, block_idx)

Calculate the number of “active” parameters for a subblock (FFN, Attention, or MoE).

For non-MoE subblocks, simply calls calculate_subblock_params to count all parameters. For MoE (Mixture-of-Experts) FFN subblocks, estimates the expected number of active parameters per batch by leveraging expert activation statistics (from a given stats file) and calculating the expected number of active experts, then multiplies by the number of parameters per expert.

Parameters:

sublayer_config (FFNConfig | AttentionConfig) – The subblock configuration (either FFNConfig or AttentionConfig).
model_config (PreTrainedConfig) – The Hugging Face model configuration.
descriptor (type[ModelDescriptor]) – The ModelDescriptor class corresponding to this model family.
n_embd (int) – The embedding size (hidden dimension).
moe_stats_file (str) – Path to file containing expert activation probabilities.
batch_size (int) – The batch size used for the estimate.
block_idx (int) – The index of the block/subblock within the network, used to index into the stats.

Returns:

The expected number of “active” parameters for the given subblock.

Return type:

int

calculate_ffn_memory(ffn_config, model_config, descriptor, weights_dtype, experts_dtype=None)

Estimate the memory usage in MiB of a feed-forward network (FFN) subblock.

Parameters:

ffn_config (FFNConfig) – FFN configuration for the block.
model_config (PreTrainedConfig) – The parent model configuration.
descriptor (type[ModelDescriptor]) – Model descriptor class.
weights_dtype (dtype | str) – Data type for FFN weights.
experts_dtype (dtype | str | None) – Data type for expert weights (for MoE layers, if present).

Returns:

Estimated FFN memory usage in mebibytes (MiB).

Return type:

float

calculate_mamba_memory(attention_config, model_config, descriptor, batch_size, weights_dtype, kv_cache_dtype)

Calculate memory usage (MiB) for a Mamba attention subblock.

Parameters:

attention_config (AttentionConfig) – Mamba attention configuration, including Mamba-specific settings.
model_config (PreTrainedConfig) – Model configuration.
descriptor (type[ModelDescriptor]) – Model descriptor class.
batch_size (int) – Batch size for memory estimate.
weights_dtype (dtype) – Data type for model weights.
kv_cache_dtype (dtype) – Data type for state/kv-cache.

Returns:

Estimated memory usage in mebibytes (MiB) for the Mamba subblock.

Return type:

int

calculate_mamba_state_size(mamba_config, batch_size)

Calculate the total state size for a Mamba attention subblock.

Parameters:

mamba_config (MambaConfig) – Configuration object containing Mamba subblock parameters.
batch_size (int) – Batch size to estimate the memory/state requirements for.

Returns:

Total state size (number of elements) required for the Mamba subblock, including convolution and SSM state.

Return type:

int

calculate_non_block_memory(n_embd, vocab_size, weight_dtype)

Estimate the memory usage in MiB of non-subblock components (e.g., embeddings, output projection).

Parameters:

n_embd (int)
vocab_size (int)
weight_dtype (dtype)

Return type:

float

calculate_non_block_params(n_embd, vocab_size)

Calculate the number of parameters for non-subblock components (e.g., embeddings, output projection).

Parameters:

n_embd (int)
vocab_size (int)

Return type:

int

calculate_subblock_memory(subblock_config, batch_size, prefill_seq_len, generation_seq_len, prefill_queue_size, n_embd, n_head, weights_dtype, kv_cache_dtype, allocate_prefill_query, model_config, descriptor)

Calculate the memory usage of a single subblock (FFN or Attention).

Given its configuration and runtime dimensions, returns bytes or a detailed dict.

Parameters:

subblock_config (FFNConfig | AttentionConfig) – Subblock configuration dataclass.
batch_size (int) – Batch size for memory estimate.
prefill_seq_len (int) – Sequence length for prefill phase.
generation_seq_len (int) – Sequence length for generation phase (token-by-token).
prefill_queue_size (int) – Token queue size for prefill attention memory allocation.
n_embd (int) – Embedding (hidden) dimension.
n_head (int) – Number of attention heads (used for non-FFN).
weights_dtype (dtype) – PyTorch dtype for model weights.
kv_cache_dtype (dtype) – PyTorch dtype for KV cache.
allocate_prefill_query (bool) – Whether to allocate query cache for prefill tokens.
model_config (PreTrainedConfig) – HuggingFace-style config instance describing the model.
descriptor (type[ModelDescriptor]) – Model descriptor type (for puzzletron model types).

Returns:

Memory usage in bytes (float), or a dictionary by memory type.

Return type:

float | dict[str, float]

calculate_subblock_params(config, layer_config, descriptor)

Count parameters on one meta decoder layer.

The caller is responsible for adjusting per-layer config fields (e.g. hybrid_override_pattern) before passing config; see ModelDescriptor.truncate_pattern_for_subblock.

Parameters:

config (PreTrainedConfig)
layer_config (BlockConfig | FFNConfig | AttentionConfig)
descriptor (type[ModelDescriptor])

Return type:

int

estimate_num_active_experts(dist_over_experts, batch_size, num_experts)

Estimate the expected number of active experts in a Mixture-of-Experts (MoE) layer.

This function computes the expected number of unique experts that are selected at least once when performing inference with a given batch size. It assumes, for each input in the batch, an expert is chosen with probability given by dist_over_experts (typically a vector of probabilities for each expert). For a batch of size B, the expected number of active (i.e., selected at least once) experts is computed.

Parameters:

dist_over_experts (ndarray) – A 1D array of probabilities for each expert.
batch_size (int) – The number of samples in the batch.
num_experts (int) – The maximum number of experts to consider (fewer if dist_over_experts is shorter).

Returns:

The expected number of experts selected at least once across the batch.

Return type:

int

load_moe_stats(stats_file)

Load MoE (Mixture-of-Experts) routing statistics from a file.

This function reads a JSON file containing expert activation probabilities or counts for each MoE block. It returns the normalized probability distributions over experts for each block, as a list of numpy arrays.

Parameters:: stats_file (str) – Path to the JSON file containing expert routing statistics for each block.
Returns:: A list where each element is a numpy array containing the normalized probability distribution over experts for the corresponding block. If a block’s expert list is empty, its entry is 0.
Return type:: dict