calc_runtime_stats

Runtime statistics calculation for NAS subblock benchmarking via vLLM.

Functions

calc_base_runtime

Calculate the base runtime of a model with no subblocks.

calc_model_runtime

Measure total runtime of a model via vLLM latency benchmark.

calc_no_block_runtime

Estimate the overhead runtime (embedding + LM head) with no decoder blocks.

calc_runtime_for_subblocks

Benchmark each unique subblock and return per-subblock runtimes and no-block overhead.

calc_subblock_runtime

Measure total runtime of a repeated subblock via vLLM latency benchmark.

create_benchmark_model

Build a small Llama model with repeated subblocks for latency benchmarking.

calc_base_runtime(runtime_config, subblock_config)

Calculate the base runtime of a model with no subblocks.

Parameters:
Return type:

float

calc_model_runtime(model, runtime_config)

Measure total runtime of a model via vLLM latency benchmark.

Parameters:
Return type:

float

calc_no_block_runtime(runtime_config)

Estimate the overhead runtime (embedding + LM head) with no decoder blocks.

Parameters:

runtime_config (RuntimeConfig)

Return type:

float

calc_runtime_for_subblocks(subblock_config_set, runtime_stats_config, vocab_size, hidden_size, num_attention_heads, num_key_value_heads, tokenizer_path, prefill_seq_len, generation_seq_len, batch_size)

Benchmark each unique subblock and return per-subblock runtimes and no-block overhead.

Parameters:
  • subblock_config_set (set[SubblockConfig])

  • runtime_stats_config (DictConfig)

  • vocab_size (int)

  • hidden_size (int)

  • num_attention_heads (int)

  • num_key_value_heads (int)

  • tokenizer_path (str)

  • prefill_seq_len (int)

  • generation_seq_len (int)

  • batch_size (int)

Return type:

tuple[dict[SubblockConfig, float], float]

calc_subblock_runtime(runtime_config, subblock_config)

Measure total runtime of a repeated subblock via vLLM latency benchmark.

Parameters:
Return type:

float

create_benchmark_model(vocab_size, hidden_size, num_key_value_heads, num_attention_heads, prefill_seq_len, generation_seq_len, block_config, repeat_block_n_times=10)

Build a small Llama model with repeated subblocks for latency benchmarking.

Parameters:
  • vocab_size (int)

  • hidden_size (int)

  • num_key_value_heads (int)

  • num_attention_heads (int)

  • prefill_seq_len (int)

  • generation_seq_len (int)

  • block_config (BlockConfig | None)

  • repeat_block_n_times (int)

Return type:

LlamaForCausalLM