vLLM#
vLLM workload (test_template_name is vllm) allows users to execute vLLM benchmarks within the CloudAI framework.
vLLM is a high-throughput and memory-efficient inference engine for LLMs. This workload supports both aggregated and disaggregated prefill/decode modes.
Usage Examples#
Test and Scenario Examples#
name = "vllm_test"
description = "Example vLLM test"
test_template_name = "vllm"
[cmd_args]
docker_image_url = "nvcr.io#nvidia/ai-dynamo/vllm-runtime:0.7.0"
model = "Qwen/Qwen3-0.6B"
[bench_cmd_args]
random_input_len = 16
random_output_len = 128
max_concurrency = 16
num_prompts = 30
name = "vllm-benchmark"
[[Tests]]
id = "vllm.1"
num_nodes = 1
time_limit = "00:10:00"
test_name = "vllm_test"
Test-in-Scenario example#
name = "vllm-benchmark"
[[Tests]]
id = "vllm.1"
num_nodes = 1
time_limit = "00:10:00"
name = "vllm_test"
description = "Example vLLM test"
test_template_name = "vllm"
[Tests.cmd_args]
docker_image_url = "nvcr.io#nvidia/ai-dynamo/vllm-runtime:0.7.0"
model = "Qwen/Qwen3-0.6B"
[Tests.bench_cmd_args]
random_input_len = 16
random_output_len = 128
max_concurrency = 16
num_prompts = 30
Controlling the Number of GPUs#
The number of GPUs can be controlled using the options below, listed from lowest to highest priority:
1. gpus_per_node system property (scalar value)
2. CUDA_VISIBLE_DEVICES environment variable (comma-separated list of GPU IDs)
3. gpu_ids command argument for prefill and decode configurations (comma-separated list of GPU IDs). If disaggregated mode is used (prefill is set), both prefill and decode should define gpu_ids, or none of them should set it.
Controlling Disaggregation#
By default, vLLM will run without disaggregation as a single process. To enable disaggregation, one needs to set prefill configuration:
[cmd_args]
docker_image_url = "nvcr.io#nvidia/ai-dynamo/vllm-runtime:0.7.0"
model = "Qwen/Qwen3-0.6B"
[cmd_args.prefill]
[extra_env_vars]
CUDA_VISIBLE_DEVICES = "0,1,2,3"
The config above, will automatically split GPUs specified in CUDA_VISIBLE_DEVICES into two:
- The first half will be used for prefill
- The second half will be used for decode
For more control, users can specify the GPU IDs explicitly in prefill and decode configurations:
[cmd_args.prefill]
gpu_ids = "0,1"
[cmd_args.decode]
gpu_ids = "2,3"
In this case CUDA_VISIBLE_DEVICES will be ignored and only the GPUs specified in gpu_ids will be used.
Controlling proxy_script#
proxy_script is used to proxy the requests from the client to the prefill and decode instances. It is ignored for non-disaggregated mode. Default value can be found below.
It can be overridden by setting proxy_script by using the latest version of the script from vLLM repository:
[[Tests.git_repos]]
url = "https://github.com/vllm-project/vllm.git"
commit = "main"
mount_as = "/vllm_repo"
[Tests.cmd_args]
docker_image_url = "vllm/vllm-openai:v0.14.0-cu130"
proxy_script = "/vllm_repo/tests/v1/kv_connector/nixl_integration/toy_proxy_server.py"
In this case the proxy script will be mounted from the vLLM repository (cloned locally) as /vllm_repo and used for the test.
API Documentation#
vLLM Serve Arguments#
- pydantic model cloudai.workloads.vllm.vllm.VllmArgs[source]#
Base command arguments for vLLM instances.
- field nixl_threads: int | list[int] | None = None#
Set
kv_connector_extra_config.num_threadsfor--kv-transfer-configCLI argument.
- property serve_args_exclude: set[str]#
Fields consumed internally and excluded from generic serve args.
- serialize_serve_arg(key: str, value: object) list[str][source]#
Serialize a single serve argument to CLI tokens.
- property serve_args: list[str]#
- field gpu_ids: str | list[str] | None = None#
Comma-separated GPU IDs. If not set, all available GPUs will be used.
Command Arguments#
- class cloudai.workloads.vllm.vllm.VllmCmdArgs(
- *,
- docker_image_url: str,
- model: str = 'Qwen/Qwen3-0.6B',
- port: ~typing.Annotated[int,
- ~annotated_types.Ge(ge=1),
- ~annotated_types.Le(le=65535)] = 8000,
- serve_wait_seconds: int = 300,
- prefill: ~cloudai.workloads.vllm.vllm.VllmArgs | None = None,
- decode: ~cloudai.workloads.vllm.vllm.VllmArgs = <factory>,
- proxy_script: str = '/opt/vllm/tests/v1/kv_connector/nixl_integration/toy_proxy_server.py',
Bases:
LLMServingCmdArgs[VllmArgs]vLLM serve command arguments.
Benchmark Command Arguments#
Test Definition#
- class cloudai.workloads.vllm.vllm.VllmTestDefinition(
- *,
- name: str,
- description: str,
- test_template_name: str,
- cmd_args: VllmCmdArgs,
- extra_env_vars: dict[str, str | List[str]] = {},
- extra_cmd_args: dict[str, str] = {},
- extra_container_mounts: list[str] = [],
- git_repos: list[GitRepo] = [],
- nsys: NsysConfiguration | None = None,
- predictor: PredictorConfig | None = None,
- agent: str = 'grid_search',
- agent_steps: int = 1,
- agent_metrics: list[str] = ['default'],
- agent_reward_function: str = 'inverse',
- agent_config: dict[str, Any] | None = None,
- bench_cmd_args: VllmBenchCmdArgs = VllmBenchCmdArgs(random_input_len=16, random_output_len=128, max_concurrency=16, num_prompts=30),
- proxy_script_repo: GitRepo | None = None,
Bases:
LLMServingTestDefinition[VllmCmdArgs]Test object for vLLM.