runtime_vllm
vLLM Runtime Benchmark Integration for ModelOpt NAS Subblocks.
This module provides the integration logic to empirically benchmark subblock runtime statistics within transformer architectures using the vLLM latency benchmark. Each invocation is launched in a dedicated subprocess so that GPU memory and CUDA state are fully reclaimed when the subprocess exits, allowing many sequential benchmarks to run in a single Python session without leaking.
- Usage:
Call run_vllm_latency_benchmark with a model path and a RuntimeConfig instance to run a latency benchmark and return the average latency for the configuration (in milliseconds).
Functions
Run |
- run_vllm_latency_benchmark(model_path, runtime_config)
Run
vllm bench latencyin a fresh subprocess and return avg latency in ms.Spawning a subprocess per call gives OS-level isolation: GPU memory, CUDA context, and vLLM engine state are fully released on subprocess exit, so many calls in one parent process do not accumulate.
- Parameters:
model_path (Path)
runtime_config (RuntimeConfig)
- Return type:
float