calc_runtime_stats

Classes

RuntimeConfig

RuntimeConfig(vocab_size: int, hidden_size: int, num_attention_heads: int, master_puzzle_dir: str, tokenizer_path: str, synth_dataset_num_requests: int, repeat_block_n_times: int, prefill_seq_len: int, generation_seq_len: int, batch_size: int, num_iters: int, num_warmup_iters: int)

Functions

calc_no_block_runtime

calc_runtime_for_subblocks

calc_subblock_runtime

create_benchmark_model

run_vllm_latency_benchmark

save_model

save_model_as_anymodel

class RuntimeConfig

Bases: object

RuntimeConfig(vocab_size: int, hidden_size: int, num_attention_heads: int, master_puzzle_dir: str, tokenizer_path: str, synth_dataset_num_requests: int, repeat_block_n_times: int, prefill_seq_len: int, generation_seq_len: int, batch_size: int, num_iters: int, num_warmup_iters: int)

__init__(vocab_size, hidden_size, num_attention_heads, master_puzzle_dir, tokenizer_path, synth_dataset_num_requests, repeat_block_n_times, prefill_seq_len, generation_seq_len, batch_size, num_iters, num_warmup_iters)
Parameters:
  • vocab_size (int)

  • hidden_size (int)

  • num_attention_heads (int)

  • master_puzzle_dir (str)

  • tokenizer_path (str)

  • synth_dataset_num_requests (int)

  • repeat_block_n_times (int)

  • prefill_seq_len (int)

  • generation_seq_len (int)

  • batch_size (int)

  • num_iters (int)

  • num_warmup_iters (int)

Return type:

None

batch_size: int
generation_seq_len: int
hidden_size: int
master_puzzle_dir: str
num_attention_heads: int
num_iters: int
num_warmup_iters: int
prefill_seq_len: int
repeat_block_n_times: int
synth_dataset_num_requests: int
tokenizer_path: str
vocab_size: int
calc_no_block_runtime(runtime_config)
Parameters:

runtime_config (RuntimeConfig)

Return type:

float

calc_runtime_for_subblocks(subblock_config_set, runtime_stats_config, vocab_size, hidden_size, num_attention_heads, master_puzzle_dir, tokenizer_path, synth_dataset_num_requests, prefill_seq_len, generation_seq_len)
Parameters:
  • subblock_config_set (set[SubblockConfig])

  • runtime_stats_config (DictConfig)

  • vocab_size (int)

  • hidden_size (int)

  • num_attention_heads (int)

  • master_puzzle_dir (str)

  • tokenizer_path (str)

  • synth_dataset_num_requests (int)

  • prefill_seq_len (int)

  • generation_seq_len (int)

Return type:

tuple[dict[SubblockConfig, float], float]

calc_subblock_runtime(runtime_config, subblock_config)
Parameters:
Return type:

float

create_benchmark_model(vocab_size, hidden_size, num_attention_heads, prefill_seq_len, generation_seq_len, block_config, repeat_block_n_times=10)
Parameters:
  • vocab_size (int)

  • hidden_size (int)

  • num_attention_heads (int)

  • prefill_seq_len (int)

  • generation_seq_len (int)

  • block_config (BlockConfig | None)

  • repeat_block_n_times (int)

Return type:

LlamaForCausalLM

run_vllm_latency_benchmark(model_path, runtime_config)
Parameters:
save_model(model, tokenizer_path, output_path, num_hidden_layers)
Parameters:
  • model (LlamaForCausalLM)

  • tokenizer_path (Path)

  • output_path (Path)

  • num_hidden_layers (int)

Return type:

None

save_model_as_anymodel(model, output_dir, descriptor, num_hidden_layers)
Parameters:
  • output_dir (Path)

  • num_hidden_layers (int)