vLLM#

vLLM workload (test_template_name is vllm) allows users to execute vLLM benchmarks within the CloudAI framework.

vLLM is a high-throughput and memory-efficient inference engine for LLMs. This workload supports both aggregated and disaggregated prefill/decode modes.

Usage Examples#

Test and Scenario Examples#

test.toml (test definition)#
name = "vllm_test"
description = "Example vLLM test"
test_template_name = "vllm"

[cmd_args]
docker_image_url = "nvcr.io#nvidia/ai-dynamo/vllm-runtime:0.7.0"
model = "Qwen/Qwen3-0.6B"

[bench_cmd_args]
random_input_len = 16
random_output_len = 128
max_concurrency = 16
num_prompts = 30
scenario.toml (scenario with one test)#
name = "vllm-benchmark"

[[Tests]]
id = "vllm.1"
num_nodes = 1
time_limit = "00:10:00"
test_name = "vllm_test"

Test-in-Scenario example#

scenario.toml (separate test toml is not needed)#
name = "vllm-benchmark"

[[Tests]]
id = "vllm.1"
num_nodes = 1
time_limit = "00:10:00"

name = "vllm_test"
description = "Example vLLM test"
test_template_name = "vllm"

[Tests.cmd_args]
docker_image_url = "nvcr.io#nvidia/ai-dynamo/vllm-runtime:0.7.0"
model = "Qwen/Qwen3-0.6B"

[Tests.bench_cmd_args]
random_input_len = 16
random_output_len = 128
max_concurrency = 16
num_prompts = 30

Controlling the Number of GPUs#

The number of GPUs can be controlled using the options below, listed from lowest to highest priority: 1. gpus_per_node system property (scalar value) 2. CUDA_VISIBLE_DEVICES environment variable (comma-separated list of GPU IDs) 3. gpu_ids command argument for prefill and decode configurations (comma-separated list of GPU IDs). If disaggregated mode is used (prefill is set), both prefill and decode should define gpu_ids, or none of them should set it.

Controlling Disaggregation#

By default, vLLM will run without disaggregation as a single process. To enable disaggregation, one needs to set prefill configuration:

test.toml (disaggregated prefill/decode)#
[cmd_args]
docker_image_url = "nvcr.io#nvidia/ai-dynamo/vllm-runtime:0.7.0"
model = "Qwen/Qwen3-0.6B"

[cmd_args.prefill]

[extra_env_vars]
CUDA_VISIBLE_DEVICES = "0,1,2,3"

The config above, will automatically split GPUs specified in CUDA_VISIBLE_DEVICES into two: - The first half will be used for prefill - The second half will be used for decode

For more control, users can specify the GPU IDs explicitly in prefill and decode configurations:

test.toml (disaggregated prefill/decode)#
[cmd_args.prefill]
gpu_ids = "0,1"

[cmd_args.decode]
gpu_ids = "2,3"

In this case CUDA_VISIBLE_DEVICES will be ignored and only the GPUs specified in gpu_ids will be used.

Controlling proxy_script#

proxy_script is used to proxy the requests from the client to the prefill and decode instances. It is ignored for non-disaggregated mode. Default value can be found below.

It can be overridden by setting proxy_script by using the latest version of the script from vLLM repository:

test_scenario.toml (override proxy_script)#
[[Tests.git_repos]]
url = "https://github.com/vllm-project/vllm.git"
commit = "main"
mount_as = "/vllm_repo"

[Tests.cmd_args]
docker_image_url = "vllm/vllm-openai:v0.14.0-cu130"
proxy_script = "/vllm_repo/tests/v1/kv_connector/nixl_integration/toy_proxy_server.py"

In this case the proxy script will be mounted from the vLLM repository (cloned locally) as /vllm_repo and used for the test.

API Documentation#

vLLM Serve Arguments#

pydantic model cloudai.workloads.vllm.vllm.VllmArgs[source]#

Base command arguments for vLLM instances.

field nixl_threads: int | list[int] | None = None#

Set kv_connector_extra_config.num_threads for --kv-transfer-config CLI argument.

property serve_args_exclude: set[str]#

Fields consumed internally and excluded from generic serve args.

serialize_serve_arg(key: str, value: object) list[str][source]#

Serialize a single serve argument to CLI tokens.

property serve_args: list[str]#
field gpu_ids: str | list[str] | None = None#

Comma-separated GPU IDs. If not set, all available GPUs will be used.

Command Arguments#

class cloudai.workloads.vllm.vllm.VllmCmdArgs(
*,
docker_image_url: str,
model: str = 'Qwen/Qwen3-0.6B',
port: ~typing.Annotated[int,
~annotated_types.Ge(ge=1),
~annotated_types.Le(le=65535)] = 8000,
serve_wait_seconds: int = 300,
prefill: ~cloudai.workloads.vllm.vllm.VllmArgs | None = None,
decode: ~cloudai.workloads.vllm.vllm.VllmArgs = <factory>,
proxy_script: str = '/opt/vllm/tests/v1/kv_connector/nixl_integration/toy_proxy_server.py',
)[source]#

Bases: LLMServingCmdArgs[VllmArgs]

vLLM serve command arguments.

Benchmark Command Arguments#

class cloudai.workloads.vllm.vllm.VllmBenchCmdArgs(
*,
random_input_len: int = 16,
random_output_len: int = 128,
max_concurrency: int = 16,
num_prompts: int = 30,
**extra_data: Any,
)[source]#

Bases: CmdArgs

vLLM bench serve command arguments.

Test Definition#

class cloudai.workloads.vllm.vllm.VllmTestDefinition(
*,
name: str,
description: str,
test_template_name: str,
cmd_args: VllmCmdArgs,
extra_env_vars: dict[str, str | List[str]] = {},
extra_cmd_args: dict[str, str] = {},
extra_container_mounts: list[str] = [],
git_repos: list[GitRepo] = [],
nsys: NsysConfiguration | None = None,
predictor: PredictorConfig | None = None,
agent: str = 'grid_search',
agent_steps: int = 1,
agent_metrics: list[str] = ['default'],
agent_reward_function: str = 'inverse',
agent_config: dict[str, Any] | None = None,
bench_cmd_args: VllmBenchCmdArgs = VllmBenchCmdArgs(random_input_len=16, random_output_len=128, max_concurrency=16, num_prompts=30),
proxy_script_repo: GitRepo | None = None,
)[source]#

Bases: LLMServingTestDefinition[VllmCmdArgs]

Test object for vLLM.