vLLM#
vLLM workload (test_template_name is vllm) allows users to execute vLLM benchmarks within the CloudAI framework.
vLLM is a high-throughput and memory-efficient inference engine for LLMs. This workload supports both aggregated and disaggregated prefill/decode modes.
Usage Examples#
Test and Scenario Examples#
name = "vllm_test"
description = "Example vLLM test"
test_template_name = "vllm"
[cmd_args]
docker_image_url = "nvcr.io#nvidia/ai-dynamo/vllm-runtime:0.7.0"
model = "Qwen/Qwen3-0.6B"
[bench_cmd_args]
random_input_len = 16
random_output_len = 128
max_concurrency = 16
num_prompts = 30
[semantic_eval_cmd_args]
entrypoint = "python3 /opt/vllm/tests/evals/gsm8k/gsm8k_eval.py"
cli = "--host {host} --port {port} --num-questions 200 --save-results {output_path}/vllm-gsm8k.json"
name = "vllm-benchmark"
[[Tests]]
id = "vllm.1"
num_nodes = 1
time_limit = "00:10:00"
test_name = "vllm_test"
Test-in-Scenario example#
name = "vllm-benchmark"
[[Tests]]
id = "vllm.1"
num_nodes = 1
time_limit = "00:10:00"
name = "vllm_test"
description = "Example vLLM test"
test_template_name = "vllm"
[Tests.cmd_args]
docker_image_url = "nvcr.io#nvidia/ai-dynamo/vllm-runtime:0.7.0"
model = "Qwen/Qwen3-0.6B"
[Tests.bench_cmd_args]
random_input_len = 16
random_output_len = 128
max_concurrency = 16
num_prompts = 30
Semantic Validation#
To run GSM8K semantic validation after the serving benchmark, add semantic_eval_cmd_args. CloudAI reports
accuracy from the eval output, but does not enforce an accuracy threshold.
[semantic_eval_cmd_args]
entrypoint = "python3 /opt/vllm/tests/evals/gsm8k/gsm8k_eval.py"
cli = "--host {host} --port {port} --num-questions 200 --save-results {output_path}/vllm-gsm8k.json"
If the runtime image does not contain the eval script, mount a vLLM repository with existing git_repos support and
point entrypoint at the mounted path.
The cli string supports {model}, {host}, {port}, {url}, {output_path}, and {result_dir}
placeholders.
Controlling the Number of GPUs#
GPU selection priority, from lowest to highest:
gpus_per_nodesystem property (scalar value)decode.gpu_idscommand argument in non-disaggregated mode whenCUDA_VISIBLE_DEVICESis not setCUDA_VISIBLE_DEVICESenvironment variable (comma-separated list of GPU IDs)gpu_idscommand argument for bothprefillanddecodeconfigurations in disaggregated mode
In disaggregated mode, define both prefill.gpu_ids and decode.gpu_ids, or omit both.
Controlling Disaggregation#
By default, vLLM will run without disaggregation as a single process. To enable disaggregation, one needs to set prefill configuration:
[cmd_args]
docker_image_url = "nvcr.io#nvidia/ai-dynamo/vllm-runtime:0.7.0"
model = "Qwen/Qwen3-0.6B"
[cmd_args.prefill]
[extra_env_vars]
CUDA_VISIBLE_DEVICES = "0,1,2,3"
The config above, will automatically split GPUs specified in CUDA_VISIBLE_DEVICES into two:
- The first half will be used for prefill
- The second half will be used for decode
For more control, users can specify the GPU IDs explicitly in prefill and decode configurations:
[cmd_args.prefill]
gpu_ids = "0,1"
[cmd_args.decode]
gpu_ids = "2,3"
In this case CUDA_VISIBLE_DEVICES will be ignored and only the GPUs specified in gpu_ids will be used.
Multi-node serving#
For non-disaggregated num_nodes > 1, CloudAI creates one Ray cluster and starts vllm serve on the head node with
--distributed-executor-backend ray.
For disaggregated serving over more than two nodes, set explicit role sizes:
prefill.num_nodes + decode.num_nodesmust equal the testnum_nodes.CloudAI assigns contiguous node slices: prefill first, decode second.
tensor_parallel_sizeis total per role, not per node.CUDA_VISIBLE_DEVICESandgpu_idsare local GPU IDs on each serving node.
Example: four prefill nodes and four decode nodes, each with four visible GPUs:
[[Tests]]
id = "vllm.pd_multi_node"
num_nodes = 8
test_template_name = "vllm"
[Tests.cmd_args.prefill]
num_nodes = 4
tensor_parallel_size = 16
[Tests.cmd_args.decode]
num_nodes = 4
tensor_parallel_size = 16
[Tests.extra_env_vars]
CUDA_VISIBLE_DEVICES = "0,1,2,3"
Readiness health checks#
Healthcheck fields:
healthcheck: aggregated server endpoint, default/healthcheck.serve_healthcheck: optional override for serve, prefill, and decode servers.proxy_healthcheck: disaggregated proxy/router endpoint, default/healthcheck.
If serve_healthcheck is omitted, disaggregated prefill/decode servers keep the legacy /health endpoint. If a
disaggregated config sets healthcheck but omits proxy_healthcheck, the proxy/router uses healthcheck for
backward compatibility.
Controlling proxy_script#
proxy_script is used to proxy the requests from the client to the prefill and decode instances. It is ignored for non-disaggregated mode. Default value can be found below.
It can be overridden by setting proxy_script by using the latest version of the script from vLLM repository:
[[Tests.git_repos]]
url = "https://github.com/vllm-project/vllm.git"
commit = "main"
mount_as = "/vllm_repo"
[Tests.cmd_args]
docker_image_url = "vllm/vllm-openai:v0.14.0-cu130"
proxy_script = "/vllm_repo/tests/v1/kv_connector/nixl_integration/toy_proxy_server.py"
In this case the proxy script will be mounted from the vLLM repository (cloned locally) as /vllm_repo and used for the test.
API Documentation#
vLLM Serve Arguments#
- pydantic model cloudai.workloads.vllm.vllm.VllmArgs[source]#
Base command arguments for vLLM instances.
- field ray_head: VllmRayStartArgs | None = None#
Arguments appended to the Ray head startup command for multi-node vLLM roles.
- field ray_worker: VllmRayStartArgs | None = None#
Arguments appended to the Ray worker startup command for multi-node vLLM roles.
- field nixl_threads: int | list[int] | None = None#
Set
kv_connector_extra_config.num_threadsfor--kv-transfer-configCLI argument.
- property serve_args_exclude: set[str]#
Fields consumed internally and excluded from generic serve args.
- serialize_serve_arg(key: str, value: object) list[str][source]#
Serialize a single serve argument to CLI tokens.
- property serve_args: list[str]#
- field gpu_ids: str | list[str] | None = None#
Comma-separated GPU IDs. If not set, all available GPUs will be used.
- field num_nodes: int | list[int] | None = None#
Number of Slurm nodes assigned to this role in disaggregated serving mode.
Command Arguments#
- class cloudai.workloads.vllm.vllm.VllmCmdArgs(
- *,
- docker_image_url: str,
- model: str = 'Qwen/Qwen3-0.6B',
- port: ~typing.Annotated[int,
- ~annotated_types.Ge(ge=1),
- ~annotated_types.Le(le=65535)] = 8300,
- host: str = '0.0.0.0',
- bench_host: str | None = None,
- healthcheck: str = '/healthcheck',
- serve_healthcheck: str | None = None,
- serve_wait_seconds: int = 300,
- prefill: ~cloudai.workloads.vllm.vllm.VllmArgs | None = None,
- decode: ~cloudai.workloads.vllm.vllm.VllmArgs = <factory>,
- proxy_script: str = '/opt/vllm/tests/v1/kv_connector/nixl_integration/toy_proxy_server.py',
- proxy_healthcheck: str = '/healthcheck',
Bases:
LLMServingCmdArgs[VllmArgs]vLLM serve command arguments.
Benchmark Command Arguments#
Semantic Eval Command Arguments#
- class cloudai.workloads.vllm.vllm.VllmSemanticEvalCmdArgs(
- *,
- entrypoint: str = 'python3 /opt/vllm/tests/evals/gsm8k/gsm8k_eval.py',
- cli: str = '--host {host} --port {port} --num-questions 200 --save-results {output_path}/vllm-gsm8k.json',
Bases:
CmdArgsvLLM semantic validation command arguments.
Test Definition#
- class cloudai.workloads.vllm.vllm.VllmTestDefinition(*, name: str, description: str, test_template_name: str, cmd_args: ~cloudai.workloads.vllm.vllm.VllmCmdArgs, dse_excluded_args: list[str] = <factory>, extra_env_vars: dict[str, str | ~typing.List[str]] = {}, extra_cmd_args: dict[str, str] = {}, extra_container_mounts: list[str] = [], git_repos: list[~cloudai._core.installables.git_repo.GitRepo] = [], nsys: ~cloudai.models.workload.NsysConfiguration | None = None, predictor: ~cloudai.models.workload.PredictorConfig | None = None, agent: str = 'grid_search', agent_steps: int = 1, agent_metrics: list[str] = ['default'], agent_reward_function: str = 'inverse', agent_config: dict[str, ~typing.Any] | None = None, env_params: dict[str, ~cloudai.configurator.env_params.EnvParamSpec] = <factory>, bench_cmd_args: ~cloudai.workloads.vllm.vllm.VllmBenchCmdArgs = VllmBenchCmdArgs(random_input_len=16, random_output_len=128, max_concurrency=16, num_prompts=30), semantic_eval_cmd_args: ~cloudai.workloads.vllm.vllm.VllmSemanticEvalCmdArgs | None = None, proxy_script_repo: ~cloudai._core.installables.git_repo.GitRepo | None = None, custom_bash: str | dict[str, str] | None = None)[source]#
Bases:
LLMServingTestDefinition[VllmCmdArgs]Test object for vLLM.
- property is_domain_randomization_enabled: bool#
at least one
env_paramsannotation.- Type:
Whether the config declares domain randomization
- is_dse_excluded_arg(path: str) bool#
Return whether a dot-separated cmd_args path should be ignored by DSE.
- is_env_sampled(cmd_args_path: str) bool#
Whether a cmd_args field is env-sampled (env draws it per trial, not the agent).
- validator validate_env_params » all fields#
Validate env_params annotations against cmd_args.
env_paramsis an annotation: each key names acmd_argsfield whose value is the candidate set (the single source of truth), and the entry carries only how to sample. So each key must name a realcmd_argsfield whose value is a candidate list; a scalar is already fixed, so annotating it is a meaningless label and is rejected here. Whenweightsare declared, the list needs >= 2 values and the weights must align 1:1 with it. Sampling, persistence, the per-trial cmd_args overlay, and the cache key all live inCloudAIGymEnv; keeping this shape check in core lets the overlay stay agent- and workload-agnostic rather than re-implemented per workload.