vLLM#

vLLM workload (test_template_name is vllm) allows users to execute vLLM benchmarks within the CloudAI framework.

vLLM is a high-throughput and memory-efficient inference engine for LLMs. This workload supports both aggregated and disaggregated prefill/decode modes.

Usage Examples#

Test and Scenario Examples#

test.toml (test definition)#
name = "vllm_test"
description = "Example vLLM test"
test_template_name = "vllm"

[cmd_args]
docker_image_url = "nvcr.io#nvidia/ai-dynamo/vllm-runtime:0.7.0"
model = "Qwen/Qwen3-0.6B"

[bench_cmd_args]
random_input_len = 16
random_output_len = 128
max_concurrency = 16
num_prompts = 30

[semantic_eval_cmd_args]
entrypoint = "python3 /opt/vllm/tests/evals/gsm8k/gsm8k_eval.py"
cli = "--host {host} --port {port} --num-questions 200 --save-results {output_path}/vllm-gsm8k.json"
scenario.toml (scenario with one test)#
name = "vllm-benchmark"

[[Tests]]
id = "vllm.1"
num_nodes = 1
time_limit = "00:10:00"
test_name = "vllm_test"

Test-in-Scenario example#

scenario.toml (separate test toml is not needed)#
name = "vllm-benchmark"

[[Tests]]
id = "vllm.1"
num_nodes = 1
time_limit = "00:10:00"

name = "vllm_test"
description = "Example vLLM test"
test_template_name = "vllm"

[Tests.cmd_args]
docker_image_url = "nvcr.io#nvidia/ai-dynamo/vllm-runtime:0.7.0"
model = "Qwen/Qwen3-0.6B"

[Tests.bench_cmd_args]
random_input_len = 16
random_output_len = 128
max_concurrency = 16
num_prompts = 30

Semantic Validation#

To run GSM8K semantic validation after the serving benchmark, add semantic_eval_cmd_args. CloudAI reports accuracy from the eval output, but does not enforce an accuracy threshold.

test.toml (semantic validation)#
[semantic_eval_cmd_args]
entrypoint = "python3 /opt/vllm/tests/evals/gsm8k/gsm8k_eval.py"
cli = "--host {host} --port {port} --num-questions 200 --save-results {output_path}/vllm-gsm8k.json"

If the runtime image does not contain the eval script, mount a vLLM repository with existing git_repos support and point entrypoint at the mounted path.

The cli string supports {model}, {host}, {port}, {url}, {output_path}, and {result_dir} placeholders.

Controlling the Number of GPUs#

GPU selection priority, from lowest to highest:

  1. gpus_per_node system property (scalar value)

  2. decode.gpu_ids command argument in non-disaggregated mode when CUDA_VISIBLE_DEVICES is not set

  3. CUDA_VISIBLE_DEVICES environment variable (comma-separated list of GPU IDs)

  4. gpu_ids command argument for both prefill and decode configurations in disaggregated mode

In disaggregated mode, define both prefill.gpu_ids and decode.gpu_ids, or omit both.

Controlling Disaggregation#

By default, vLLM will run without disaggregation as a single process. To enable disaggregation, one needs to set prefill configuration:

test.toml (disaggregated prefill/decode)#
[cmd_args]
docker_image_url = "nvcr.io#nvidia/ai-dynamo/vllm-runtime:0.7.0"
model = "Qwen/Qwen3-0.6B"

[cmd_args.prefill]

[extra_env_vars]
CUDA_VISIBLE_DEVICES = "0,1,2,3"

The config above, will automatically split GPUs specified in CUDA_VISIBLE_DEVICES into two: - The first half will be used for prefill - The second half will be used for decode

For more control, users can specify the GPU IDs explicitly in prefill and decode configurations:

test.toml (disaggregated prefill/decode)#
[cmd_args.prefill]
gpu_ids = "0,1"

[cmd_args.decode]
gpu_ids = "2,3"

In this case CUDA_VISIBLE_DEVICES will be ignored and only the GPUs specified in gpu_ids will be used.

Multi-node serving#

For non-disaggregated num_nodes > 1, CloudAI creates one Ray cluster and starts vllm serve on the head node with --distributed-executor-backend ray.

For disaggregated serving over more than two nodes, set explicit role sizes:

  • prefill.num_nodes + decode.num_nodes must equal the test num_nodes.

  • CloudAI assigns contiguous node slices: prefill first, decode second.

  • tensor_parallel_size is total per role, not per node.

  • CUDA_VISIBLE_DEVICES and gpu_ids are local GPU IDs on each serving node.

Example: four prefill nodes and four decode nodes, each with four visible GPUs:

scenario.toml (multi-node disaggregated serving)#
[[Tests]]
id = "vllm.pd_multi_node"
num_nodes = 8
test_template_name = "vllm"

[Tests.cmd_args.prefill]
num_nodes = 4
tensor_parallel_size = 16

[Tests.cmd_args.decode]
num_nodes = 4
tensor_parallel_size = 16

[Tests.extra_env_vars]
CUDA_VISIBLE_DEVICES = "0,1,2,3"

Readiness health checks#

Healthcheck fields:

  • healthcheck: aggregated server endpoint, default /healthcheck.

  • serve_healthcheck: optional override for serve, prefill, and decode servers.

  • proxy_healthcheck: disaggregated proxy/router endpoint, default /healthcheck.

If serve_healthcheck is omitted, disaggregated prefill/decode servers keep the legacy /health endpoint. If a disaggregated config sets healthcheck but omits proxy_healthcheck, the proxy/router uses healthcheck for backward compatibility.

Controlling proxy_script#

proxy_script is used to proxy the requests from the client to the prefill and decode instances. It is ignored for non-disaggregated mode. Default value can be found below.

It can be overridden by setting proxy_script by using the latest version of the script from vLLM repository:

test_scenario.toml (override proxy_script)#
[[Tests.git_repos]]
url = "https://github.com/vllm-project/vllm.git"
commit = "main"
mount_as = "/vllm_repo"

[Tests.cmd_args]
docker_image_url = "vllm/vllm-openai:v0.14.0-cu130"
proxy_script = "/vllm_repo/tests/v1/kv_connector/nixl_integration/toy_proxy_server.py"

In this case the proxy script will be mounted from the vLLM repository (cloned locally) as /vllm_repo and used for the test.

API Documentation#

vLLM Serve Arguments#

pydantic model cloudai.workloads.vllm.vllm.VllmArgs[source]#

Base command arguments for vLLM instances.

field ray_head: VllmRayStartArgs | None = None#

Arguments appended to the Ray head startup command for multi-node vLLM roles.

field ray_worker: VllmRayStartArgs | None = None#

Arguments appended to the Ray worker startup command for multi-node vLLM roles.

field nixl_threads: int | list[int] | None = None#

Set kv_connector_extra_config.num_threads for --kv-transfer-config CLI argument.

property serve_args_exclude: set[str]#

Fields consumed internally and excluded from generic serve args.

serialize_serve_arg(key: str, value: object) list[str][source]#

Serialize a single serve argument to CLI tokens.

property serve_args: list[str]#
field gpu_ids: str | list[str] | None = None#

Comma-separated GPU IDs. If not set, all available GPUs will be used.

field num_nodes: int | list[int] | None = None#

Number of Slurm nodes assigned to this role in disaggregated serving mode.

Command Arguments#

class cloudai.workloads.vllm.vllm.VllmCmdArgs(
*,
docker_image_url: str,
model: str = 'Qwen/Qwen3-0.6B',
port: ~typing.Annotated[int,
~annotated_types.Ge(ge=1),
~annotated_types.Le(le=65535)] = 8300,
host: str = '0.0.0.0',
bench_host: str | None = None,
healthcheck: str = '/healthcheck',
serve_healthcheck: str | None = None,
serve_wait_seconds: int = 300,
prefill: ~cloudai.workloads.vllm.vllm.VllmArgs | None = None,
decode: ~cloudai.workloads.vllm.vllm.VllmArgs = <factory>,
proxy_script: str = '/opt/vllm/tests/v1/kv_connector/nixl_integration/toy_proxy_server.py',
proxy_healthcheck: str = '/healthcheck',
)[source]#

Bases: LLMServingCmdArgs[VllmArgs]

vLLM serve command arguments.

Benchmark Command Arguments#

class cloudai.workloads.vllm.vllm.VllmBenchCmdArgs(
*,
random_input_len: int = 16,
random_output_len: int = 128,
max_concurrency: int = 16,
num_prompts: int = 30,
**extra_data: Any,
)[source]#

Bases: CmdArgs

vLLM bench serve command arguments.

Semantic Eval Command Arguments#

class cloudai.workloads.vllm.vllm.VllmSemanticEvalCmdArgs(
*,
entrypoint: str = 'python3 /opt/vllm/tests/evals/gsm8k/gsm8k_eval.py',
cli: str = '--host {host} --port {port} --num-questions 200 --save-results {output_path}/vllm-gsm8k.json',
)[source]#

Bases: CmdArgs

vLLM semantic validation command arguments.

Test Definition#

class cloudai.workloads.vllm.vllm.VllmTestDefinition(*, name: str, description: str, test_template_name: str, cmd_args: ~cloudai.workloads.vllm.vllm.VllmCmdArgs, dse_excluded_args: list[str] = <factory>, extra_env_vars: dict[str, str | ~typing.List[str]] = {}, extra_cmd_args: dict[str, str] = {}, extra_container_mounts: list[str] = [], git_repos: list[~cloudai._core.installables.git_repo.GitRepo] = [], nsys: ~cloudai.models.workload.NsysConfiguration | None = None, predictor: ~cloudai.models.workload.PredictorConfig | None = None, agent: str = 'grid_search', agent_steps: int = 1, agent_metrics: list[str] = ['default'], agent_reward_function: str = 'inverse', agent_config: dict[str, ~typing.Any] | None = None, env_params: dict[str, ~cloudai.configurator.env_params.EnvParamSpec] = <factory>, bench_cmd_args: ~cloudai.workloads.vllm.vllm.VllmBenchCmdArgs = VllmBenchCmdArgs(random_input_len=16, random_output_len=128, max_concurrency=16, num_prompts=30), semantic_eval_cmd_args: ~cloudai.workloads.vllm.vllm.VllmSemanticEvalCmdArgs | None = None, proxy_script_repo: ~cloudai._core.installables.git_repo.GitRepo | None = None, custom_bash: str | dict[str, str] | None = None)[source]#

Bases: LLMServingTestDefinition[VllmCmdArgs]

Test object for vLLM.

property is_domain_randomization_enabled: bool#

at least one env_params annotation.

Type:

Whether the config declares domain randomization

is_dse_excluded_arg(path: str) bool#

Return whether a dot-separated cmd_args path should be ignored by DSE.

is_env_sampled(cmd_args_path: str) bool#

Whether a cmd_args field is env-sampled (env draws it per trial, not the agent).

validator validate_env_params  »  all fields#

Validate env_params annotations against cmd_args.

env_params is an annotation: each key names a cmd_args field whose value is the candidate set (the single source of truth), and the entry carries only how to sample. So each key must name a real cmd_args field whose value is a candidate list; a scalar is already fixed, so annotating it is a meaningless label and is rejected here. When weights are declared, the list needs >= 2 values and the weights must align 1:1 with it. Sampling, persistence, the per-trial cmd_args overlay, and the cache key all live in CloudAIGymEnv; keeping this shape check in core lets the overlay stay agent- and workload-agnostic rather than re-implemented per workload.