vLLM#
This workload (test_template_name is vllm) allows users to execute vLLM benchmarks within the CloudAI framework.
vLLM is a high-throughput and memory-efficient inference engine for LLMs. This workload supports both aggregated and disaggregated prefill/decode modes.
Usage Examples#
Test + Scenario example#
name = "vllm_test"
description = "Example vLLM test"
test_template_name = "vllm"
[cmd_args]
docker_image_url = "nvcr.io#nvidia/ai-dynamo/vllm-runtime:0.7.0"
model = "Qwen/Qwen3-0.6B"
[bench_cmd_args]
random_input_len = 16
random_output_len = 128
max_concurrency = 16
num_prompts = 30
name = "vllm-benchmark"
[[Tests]]
id = "vllm.1"
num_nodes = 1
time_limit = "00:10:00"
test_name = "vllm_test"
Test-in-Scenario example#
name = "vllm-benchmark"
[[Tests]]
id = "vllm.1"
num_nodes = 1
time_limit = "00:10:00"
name = "vllm_test"
description = "Example vLLM test"
test_template_name = "vllm"
[Tests.cmd_args]
docker_image_url = "nvcr.io#nvidia/ai-dynamo/vllm-runtime:0.7.0"
model = "Qwen/Qwen3-0.6B"
[Tests.bench_cmd_args]
random_input_len = 16
random_output_len = 128
max_concurrency = 16
num_prompts = 30
Control number of GPUs#
The number of GPUs can be controlled using the options below, listed from lowest to highest priority:
1. gpus_per_node system property (scalar value)
2. CUDA_VISIBLE_DEVICES environment variable (comma-separated list of GPU IDs)
3. gpu_ids command argument for prefill and decode configurations (comma-separated list of GPU IDs). If disaggregated mode is used (prefill is set), both prefill and decode should define gpu_ids, or none of them should set it.
Control disaggregation#
By default, vLLM will run without disaggregation as a single process. To enable disaggregation, one needs to set prefill configuration:
[cmd_args]
docker_image_url = "nvcr.io#nvidia/ai-dynamo/vllm-runtime:0.7.0"
model = "Qwen/Qwen3-0.6B"
[cmd_args.prefill]
[extra_env_vars]
CUDA_VISIBLE_DEVICES = "0,1,2,3"
The config above will automatically split GPUs specified in CUDA_VISIBLE_DEVICES into two halves, first half will be used for prefill and second half will be used for decode.
For more control, one can specify the GPU IDs explicitly in prefill and decode configurations:
[cmd_args.prefill]
gpu_ids = "0,1"
[cmd_args.decode]
gpu_ids = "2,3"
In this case CUDA_VISIBLE_DEVICES will be ignored and only the GPUs specified in gpu_ids will be used.
Control proxy_script#
proxy_script is used to proxy the requests from the client to the prefill and decode instances. It is ignored for non-disaggregated mode. Default value can be found below.
It can be overridden by setting proxy_script by using the latest version of the script from vLLM repository:
[[Tests.git_repos]]
url = "https://github.com/vllm-project/vllm.git"
commit = "main"
mount_as = "/vllm_repo"
[Tests.cmd_args]
docker_image_url = "vllm/vllm-openai:v0.14.0-cu130"
proxy_script = "/vllm_repo/tests/v1/kv_connector/nixl_integration/toy_proxy_server.py"
In this case the proxy script will be mounted from the vLLM repository (cloned locally) as /vllm_repo and used for the test.
API Documentation#
vLLM Serve Arguments#
- pydantic model cloudai.workloads.vllm.vllm.VllmArgs[source]#
Base command arguments for vLLM instances.
- field gpu_ids: str | list[str] | None = None#
Comma-separated GPU IDs. If not set, will use all available GPUs.
- field nixl_threads: int | list[int] | None = None#
Set
kv_connector_extra_config.num_threadsfor--kv-transfer-configCLI argument.
- property serve_args: list[str]#
Convert cmd_args_dict to command-line arguments list for vllm serve.
Command Arguments#
- class cloudai.workloads.vllm.vllm.VllmCmdArgs(
- *,
- docker_image_url: str,
- port: int = 8000,
- vllm_serve_wait_seconds: int = 300,
- proxy_script: str = '/opt/vllm/tests/v1/kv_connector/nixl_integration/toy_proxy_server.py',
- model: str = 'Qwen/Qwen3-0.6B',
- prefill: ~cloudai.workloads.vllm.vllm.VllmArgs | None = None,
- decode: ~cloudai.workloads.vllm.vllm.VllmArgs = <factory>,
Bases:
CmdArgsvLLM serve command arguments.
Benchmark Command Arguments#
Test Definition#
- class cloudai.workloads.vllm.vllm.VllmTestDefinition(
- *,
- name: str,
- description: str,
- test_template_name: str,
- cmd_args: VllmCmdArgs,
- extra_env_vars: dict[str, str | List[str]] = {},
- extra_cmd_args: dict[str, str] = {},
- extra_container_mounts: list[str] = [],
- git_repos: list[GitRepo] = [],
- nsys: NsysConfiguration | None = None,
- predictor: PredictorConfig | None = None,
- agent: str = 'grid_search',
- agent_steps: int = 1,
- agent_metrics: list[str] = ['default'],
- agent_reward_function: str = 'inverse',
- bench_cmd_args: VllmBenchCmdArgs = VllmBenchCmdArgs(random_input_len=16, random_output_len=128, max_concurrency=16, num_prompts=30),
- proxy_script_repo: GitRepo | None = None,
Bases:
TestDefinitionTest object for vLLM.