SGLang#
This workload (test_template_name is sglang) allows users to execute SGLang benchmarks within the CloudAI framework.
SGLang is a high-throughput and memory-efficient inference engine for LLMs. This workload supports both aggregated and disaggregated prefill/decode modes.
Usage Examples#
Test + Scenario example#
name = "sglang_test"
description = "Example SGLang benchmark"
test_template_name = "sglang"
[cmd_args]
docker_image_url = "lmsysorg/sglang:dev-cu13"
model = "Qwen/Qwen3-8B"
[bench_cmd_args]
random_input = 16
random_output = 128
max_concurrency = 16
num_prompts = 30
name = "sglang-benchmark"
[[Tests]]
id = "sglang.1"
num_nodes = 1
time_limit = "00:10:00"
test_name = "sglang_test"
Test-in-Scenario example#
name = "sglang-benchmark"
[[Tests]]
id = "sglang.1"
num_nodes = 1
time_limit = "00:10:00"
name = "sglang_test"
description = "Example SGLang benchmark"
test_template_name = "sglang"
[Tests.cmd_args]
docker_image_url = "lmsysorg/sglang:dev-cu13"
model = "Qwen/Qwen3-8B"
[Tests.bench_cmd_args]
random_input = 16
random_output = 128
max_concurrency = 16
num_prompts = 30
Control number of GPUs#
The number of GPUs can be controlled using the options below, listed from lowest to highest priority:
1. gpus_per_node system property (scalar value)
2. CUDA_VISIBLE_DEVICES environment variable (comma-separated list of GPU IDs)
3. gpu_ids command argument for prefill and decode configurations (comma-separated list of GPU IDs). If disaggregated mode is used (prefill is set), both prefill and decode should define gpu_ids, or none of them should set it.
Control disaggregation#
By default, SGLang will run without disaggregation as a single process. To enable disaggregation, one needs to set prefill configuration:
[cmd_args]
docker_image_url = "lmsysorg/sglang:dev-cu13"
model = "Qwen/Qwen3-8B"
[cmd_args.prefill]
[extra_env_vars]
CUDA_VISIBLE_DEVICES = "0,1,2,3"
The config above will automatically split GPUs specified in CUDA_VISIBLE_DEVICES into two halves, first half will be used for prefill and second half will be used for decode.
For more control, one can specify the GPU IDs explicitly in prefill and decode configurations:
[cmd_args.prefill]
gpu_ids = "0,1"
[cmd_args.decode]
gpu_ids = "2,3"
In this case CUDA_VISIBLE_DEVICES will be ignored and only the GPUs specified in gpu_ids will be used.
API Documentation#
SGLang Serve Arguments#
- pydantic model cloudai.workloads.sglang.sglang.SglangArgs[source]#
Base command arguments for SGLang instances.
- field disaggregation_transfer_backend: str | list[str] | None = None#
Transfer backend used in disaggregated mode. It is consumed by command generation and not emitted as a generic serve argument.
- property serve_args_exclude: set[str]#
Fields consumed internally and excluded from generic serve args.
- property serve_args: list[str]#
- field gpu_ids: str | list[str] | None = None#
Comma-separated GPU IDs. If not set, all available GPUs will be used.
Command Arguments#
- pydantic model cloudai.workloads.sglang.sglang.SglangCmdArgs[source]#
Bases:
LLMServingCmdArgs[SglangArgs]SGLang serve command arguments.
- field model: str = 'Qwen/Qwen3-8B'#
- field port: int = 8000#
- field health_endpoint: str = '/health'#
- field serve_module: str = 'sglang.launch_server'#
- field router_module: str = 'sglang_router.launch_router'#
- field bench_module: str = 'sglang.bench_serving'#
- field prefill: SglangArgs | None = None#
Prefill instance arguments. If not set, a single instance without disaggregation is used.
- field decode: SglangArgs [Optional]#
Decode instance arguments.
- field docker_image_url: str [Required]#
- field serve_wait_seconds: int = 300#
Benchmark Command Arguments#
- pydantic model cloudai.workloads.sglang.sglang.SglangBenchCmdArgs[source]#
Bases:
CmdArgsSGLang bench_serving command arguments.
- field backend: str = 'sglang'#
- field dataset_name: str = 'random'#
- field num_prompts: int = 30#
- field max_concurrency: int = 16#
- field random_input: int = 16#
- field random_output: int = 128#
- field warmup_requests: int = 2#
- field random_range_ratio: float = 1.0#
- field output_details: bool = True#
Test Definition#
- pydantic model cloudai.workloads.sglang.sglang.SglangTestDefinition[source]#
Bases:
LLMServingTestDefinition[SglangCmdArgs]Test object for SGLang.
- field bench_cmd_args: SglangBenchCmdArgs = SglangBenchCmdArgs(backend='sglang', dataset_name='random', num_prompts=30, max_concurrency=16, random_input=16, random_output=128, warmup_requests=2, random_range_ratio=1.0, output_details=True)#
- property cmd_args_dict: Dict[str, str | List[str]]#
- constraint_check(
- tr: TestRun,
- system: System | None,
- property docker_image: DockerImage#
- property extra_args_str: str#
- property extra_installables: list[Installable]#
- property hf_model: HFModel#
- property installables: list[Installable]#
- property is_dse_job: bool#
- field cmd_args: LLMServingCmdArgsT [Required]#
- field name: str [Required]#
- field description: str [Required]#
- field test_template_name: str [Required]#
- field extra_env_vars: dict[str, str | List[str]] = {}#
- field extra_cmd_args: dict[str, str] = {}#
- field extra_container_mounts: list[str] = []#
- field git_repos: list[GitRepo] = []#
- field nsys: NsysConfiguration | None = None#
- field predictor: PredictorConfig | None = None#
- field agent: str = 'grid_search'#
- field agent_steps: int = 1#
- field agent_metrics: list[str] = ['default']#
- field agent_reward_function: str = 'inverse'#
- field agent_config: dict[str, Any] | None = None#
Agent configuration.