runtime_utils

Utilities for runtime benchmarking and model saving in ModelOpt NAS.

This module provides classes and utility functions used for empirical runtime estimation of Transformer subblocks and for saving models and tokenizers in formats suitable for benchmarking (e.g., vLLM latency benchmark) or the AnyModel subblock-safetensors format. It defines the configuration dataclass used to parameterize runtime benchmarks, as well as model checkpointing helpers to ensure compatibility with downstream evaluation pipelines.

Classes

RuntimeConfig

Configuration for a vLLM latency benchmark run.

Functions

save_model

Save model weights as AnyModel and copy the tokenizer to output_path.

save_model_as_anymodel

Save a model checkpoint in AnyModel subblock-safetensors format.

class RuntimeConfig

Bases: object

Configuration for a vLLM latency benchmark run.

__init__(vocab_size, hidden_size, num_attention_heads, num_key_value_heads, tokenizer_path, repeat_block_n_times, prefill_seq_len, generation_seq_len, batch_size, num_iters, num_warmup_iters)
Parameters:
  • vocab_size (int)

  • hidden_size (int)

  • num_attention_heads (int)

  • num_key_value_heads (int)

  • tokenizer_path (str)

  • repeat_block_n_times (int)

  • prefill_seq_len (int)

  • generation_seq_len (int)

  • batch_size (int)

  • num_iters (int)

  • num_warmup_iters (int)

Return type:

None

batch_size: int
generation_seq_len: int
hidden_size: int
num_attention_heads: int
num_iters: int
num_key_value_heads: int
num_warmup_iters: int
prefill_seq_len: int
repeat_block_n_times: int
tokenizer_path: str
vocab_size: int
save_model(model, tokenizer_path, output_path)

Save model weights as AnyModel and copy the tokenizer to output_path.

Parameters:
  • model (LlamaForCausalLM)

  • tokenizer_path (Path)

  • output_path (Path)

Return type:

None

save_model_as_anymodel(model, output_dir, descriptor)

Save a model checkpoint in AnyModel subblock-safetensors format.

Parameters:

output_dir (Path)