calc_runtime_stats
Runtime statistics calculation for NAS subblock benchmarking via vLLM.
Functions
Calculate the base runtime of a model with no subblocks. |
|
Measure total runtime of a model via vLLM latency benchmark. |
|
Estimate the overhead runtime (embedding + LM head) with no decoder blocks. |
|
Benchmark each unique subblock and return per-subblock runtimes and no-block overhead. |
|
Measure total runtime of a repeated subblock via vLLM latency benchmark. |
|
Build a small Llama model with repeated subblocks for latency benchmarking. |
- calc_base_runtime(runtime_config, subblock_config)
Calculate the base runtime of a model with no subblocks.
- Parameters:
runtime_config (RuntimeConfig)
subblock_config (SubblockConfig)
- Return type:
float
- calc_model_runtime(model, runtime_config)
Measure total runtime of a model via vLLM latency benchmark.
- Parameters:
model (LlamaForCausalLM)
runtime_config (RuntimeConfig)
- Return type:
float
- calc_no_block_runtime(runtime_config)
Estimate the overhead runtime (embedding + LM head) with no decoder blocks.
- Parameters:
runtime_config (RuntimeConfig)
- Return type:
float
- calc_runtime_for_subblocks(subblock_config_set, runtime_stats_config, vocab_size, hidden_size, num_attention_heads, num_key_value_heads, tokenizer_path, prefill_seq_len, generation_seq_len, batch_size)
Benchmark each unique subblock and return per-subblock runtimes and no-block overhead.
- Parameters:
subblock_config_set (set[SubblockConfig])
runtime_stats_config (DictConfig)
vocab_size (int)
hidden_size (int)
num_attention_heads (int)
num_key_value_heads (int)
tokenizer_path (str)
prefill_seq_len (int)
generation_seq_len (int)
batch_size (int)
- Return type:
tuple[dict[SubblockConfig, float], float]
- calc_subblock_runtime(runtime_config, subblock_config)
Measure total runtime of a repeated subblock via vLLM latency benchmark.
- Parameters:
runtime_config (RuntimeConfig)
subblock_config (SubblockConfig | None)
- Return type:
float
- create_benchmark_model(vocab_size, hidden_size, num_key_value_heads, num_attention_heads, prefill_seq_len, generation_seq_len, block_config, repeat_block_n_times=10)
Build a small Llama model with repeated subblocks for latency benchmarking.
- Parameters:
vocab_size (int)
hidden_size (int)
num_key_value_heads (int)
num_attention_heads (int)
prefill_seq_len (int)
generation_seq_len (int)
block_config (BlockConfig | None)
repeat_block_n_times (int)
- Return type:
LlamaForCausalLM