calc_runtime_stats

Runtime statistics calculation for NAS subblock benchmarking via vLLM.

Functions

`calc_base_runtime`	Calculate the base runtime of a model with no subblocks.
`calc_model_runtime`	Measure total runtime of a model via vLLM latency benchmark.
`calc_no_block_runtime`	Estimate the overhead runtime (embedding + LM head) with no decoder blocks.
`calc_runtime_for_subblocks`	Benchmark each unique subblock and return per-subblock runtimes and no-block overhead.
`calc_subblock_runtime`	Measure total runtime of a repeated subblock via vLLM latency benchmark.
`create_benchmark_model`	Build a small Llama model with repeated subblocks for latency benchmarking.

calc_base_runtime(runtime_config, subblock_config)

Calculate the base runtime of a model with no subblocks.

Parameters:

runtime_config (RuntimeConfig)
subblock_config (SubblockConfig)

Return type:

float

calc_model_runtime(model, runtime_config)

Measure total runtime of a model via vLLM latency benchmark.

Parameters:

model (LlamaForCausalLM)
runtime_config (RuntimeConfig)

Return type:

float

calc_no_block_runtime(runtime_config)

Estimate the overhead runtime (embedding + LM head) with no decoder blocks.

Parameters:: runtime_config (RuntimeConfig)
Return type:: float

calc_runtime_for_subblocks(subblock_config_set, runtime_stats_config, vocab_size, hidden_size, num_attention_heads, num_key_value_heads, tokenizer_path, prefill_seq_len, generation_seq_len, batch_size)

Benchmark each unique subblock and return per-subblock runtimes and no-block overhead.

Parameters:

subblock_config_set (set[SubblockConfig])
runtime_stats_config (DictConfig)
vocab_size (int)
hidden_size (int)
num_attention_heads (int)
num_key_value_heads (int)
tokenizer_path (str)
prefill_seq_len (int)
generation_seq_len (int)
batch_size (int)

Return type:

tuple[dict[SubblockConfig, float], float]

calc_subblock_runtime(runtime_config, subblock_config)

Measure total runtime of a repeated subblock via vLLM latency benchmark.

Parameters:

runtime_config (RuntimeConfig)
subblock_config (SubblockConfig | None)

Return type:

float

create_benchmark_model(vocab_size, hidden_size, num_key_value_heads, num_attention_heads, prefill_seq_len, generation_seq_len, block_config, repeat_block_n_times=10)

Build a small Llama model with repeated subblocks for latency benchmarking.

Parameters:

vocab_size (int)
hidden_size (int)
num_key_value_heads (int)
num_attention_heads (int)
prefill_seq_len (int)
generation_seq_len (int)
block_config (BlockConfig | None)
repeat_block_n_times (int)

Return type:

LlamaForCausalLM