runtime_vllm

vLLM Runtime Benchmark Integration for ModelOpt NAS Subblocks.

This module provides the integration logic to empirically benchmark subblock runtime statistics within transformer architectures using the vLLM latency benchmark. Each invocation is launched in a dedicated subprocess so that GPU memory and CUDA state are fully reclaimed when the subprocess exits, allowing many sequential benchmarks to run in a single Python session without leaking.

Usage:
  • Call run_vllm_latency_benchmark with a model path and a RuntimeConfig instance to run a latency benchmark and return the average latency for the configuration (in milliseconds).

Functions

run_vllm_latency_benchmark

Run vllm bench latency in a fresh subprocess and return avg latency in ms.

run_vllm_latency_benchmark(model_path, runtime_config)

Run vllm bench latency in a fresh subprocess and return avg latency in ms.

Spawning a subprocess per call gives OS-level isolation: GPU memory, CUDA context, and vLLM engine state are fully released on subprocess exit, so many calls in one parent process do not accumulate.

Parameters:
Return type:

float