SGLang#

This workload (test_template_name is sglang) allows users to execute SGLang benchmarks within the CloudAI framework.

SGLang is a high-throughput and memory-efficient inference engine for LLMs. This workload supports both aggregated and disaggregated prefill/decode modes.

Usage Examples#

Test + Scenario example#

test.toml (test definition)#
name = "sglang_test"
description = "Example SGLang benchmark"
test_template_name = "sglang"

[cmd_args]
docker_image_url = "lmsysorg/sglang:dev-cu13"
model = "Qwen/Qwen3-8B"

[bench_cmd_args]
random_input = 16
random_output = 128
max_concurrency = 16
num_prompts = 30

[semantic_eval_cmd_args]
entrypoint = "python3 -m sglang.test.run_eval"
cli = "--host {host} --port {port} --eval-name gsm8k --num-examples 200 --num-threads 128 --model {model}"
scenario.toml (scenario with one test)#
name = "sglang-benchmark"

[[Tests]]
id = "sglang.1"
num_nodes = 1
time_limit = "00:10:00"
test_name = "sglang_test"

Test-in-Scenario example#

scenario.toml (separate test toml is not needed)#
name = "sglang-benchmark"

[[Tests]]
id = "sglang.1"
num_nodes = 1
time_limit = "00:10:00"

name = "sglang_test"
description = "Example SGLang benchmark"
test_template_name = "sglang"

[Tests.cmd_args]
docker_image_url = "lmsysorg/sglang:dev-cu13"
model = "Qwen/Qwen3-8B"

[Tests.bench_cmd_args]
random_input = 16
random_output = 128
max_concurrency = 16
num_prompts = 30

Semantic Validation#

To run GSM8K semantic validation after the serving benchmark, add semantic_eval_cmd_args. CloudAI reports accuracy from the eval output, but does not enforce an accuracy threshold.

test.toml (semantic validation)#
[semantic_eval_cmd_args]
entrypoint = "python3 -m sglang.test.run_eval"
cli = "--host {host} --port {port} --eval-name gsm8k --num-examples 200 --num-threads 128 --model {model}"

For images that still use the legacy SGLang GSM8K runner, override the entrypoint and raw CLI:

[semantic_eval_cmd_args]
entrypoint = "python3 -m sglang.test.few_shot_gsm8k"
cli = "--host {host} --port {port} --num-questions 200"

The cli string supports {model}, {host}, {port}, {url}, {output_path}, and {result_dir} placeholders.

Readiness health checks#

Healthcheck fields:

  • healthcheck: aggregated server and disaggregated router endpoint, default /v1/models.

  • serve_healthcheck: optional override for serve, prefill, and decode servers.

If serve_healthcheck is omitted, disaggregated prefill/decode servers keep the legacy /health endpoint.

Control number of GPUs#

GPU selection priority, from lowest to highest:

  1. gpus_per_node system property (scalar value)

  2. decode.gpu_ids command argument in non-disaggregated mode when CUDA_VISIBLE_DEVICES is not set

  3. CUDA_VISIBLE_DEVICES environment variable (comma-separated list of GPU IDs)

  4. gpu_ids command argument for both prefill and decode configurations in disaggregated mode

In disaggregated mode, define both prefill.gpu_ids and decode.gpu_ids, or omit both.

Control disaggregation#

By default, SGLang will run without disaggregation as a single process. To enable disaggregation, one needs to set prefill configuration:

test.toml (disaggregated prefill/decode)#
[cmd_args]
docker_image_url = "lmsysorg/sglang:dev-cu13"
model = "Qwen/Qwen3-8B"

[cmd_args.prefill]

[extra_env_vars]
CUDA_VISIBLE_DEVICES = "0,1,2,3"

The config above will automatically split GPUs specified in CUDA_VISIBLE_DEVICES into two halves, first half will be used for prefill and second half will be used for decode.

For more control, one can specify the GPU IDs explicitly in prefill and decode configurations:

test.toml (disaggregated prefill/decode)#
[cmd_args.prefill]
gpu_ids = "0,1"

[cmd_args.decode]
gpu_ids = "2,3"

In this case CUDA_VISIBLE_DEVICES will be ignored and only the GPUs specified in gpu_ids will be used.

Multi-node serving#

For non-disaggregated num_nodes > 1, CloudAI starts one sglang.launch_server task per serving node with shared --dist-init-addr, --nnodes, and --node-rank "$SLURM_PROCID".

For disaggregated serving over more than two nodes, set explicit role sizes:

  • prefill.num_nodes + decode.num_nodes must equal the test num_nodes.

  • CloudAI assigns contiguous node slices: prefill first, decode second.

  • tp is total per role, not per node.

  • CUDA_VISIBLE_DEVICES and gpu_ids are local GPU IDs on each serving node.

Example: four prefill nodes and four decode nodes, each with four visible GPUs:

scenario.toml (multi-node disaggregated serving)#
[[Tests]]
id = "sglang.pd_multi_node"
num_nodes = 8
test_template_name = "sglang"

[Tests.cmd_args.prefill]
num_nodes = 4
tp = 16

[Tests.cmd_args.decode]
num_nodes = 4
tp = 16

[Tests.extra_env_vars]
CUDA_VISIBLE_DEVICES = "0,1,2,3"

API Documentation#

SGLang Serve Arguments#

pydantic model cloudai.workloads.sglang.sglang.SglangArgs[source]#

Base command arguments for SGLang instances.

field disaggregation_transfer_backend: str | list[str] | None = 'nixl'#

Transfer backend used in disaggregated mode. It is consumed by command generation and not emitted as a generic serve argument.

property serve_args_exclude: set[str]#

Fields consumed internally and excluded from generic serve args.

serialize_serve_arg(
key: str,
value: Any,
) list[str]#

Serialize a single serve argument to CLI tokens.

property serve_args: list[str]#
field gpu_ids: str | list[str] | None = None#

Comma-separated GPU IDs. If not set, all available GPUs will be used.

field num_nodes: int | list[int] | None = None#

Number of Slurm nodes assigned to this role in disaggregated serving mode.

Command Arguments#

pydantic model cloudai.workloads.sglang.sglang.SglangCmdArgs[source]#

Bases: LLMServingCmdArgs[SglangArgs]

SGLang serve command arguments.

field model: str = 'Qwen/Qwen3-8B'#
field serve_module: str = 'sglang.launch_server'#
field router_module: str = 'sglang_router.launch_router'#
field bench_module: str = 'sglang.bench_serving'#
field healthcheck: str = '/v1/models'#

Health check router endpoint.

field prefill: SglangArgs | None = None#

Prefill instance arguments. If not set, a single instance without disaggregation is used.

field decode: SglangArgs [Optional]#

Decode instance arguments.

field docker_image_url: str [Required]#
field port: int = 8300#
Constraints:
  • ge = 1

  • le = 65535

field host: str = '0.0.0.0'#

Host/interface for serve or router processes to bind to.

field bench_host: str | None = None#

Hostname used by the benchmark client. Defaults to the allocated node hostname.

field serve_healthcheck: str | None = None#

Readiness endpoint for serve, prefill, and decode server processes. Defaults to healthcheck.

field serve_wait_seconds: int = 300#

Benchmark Command Arguments#

pydantic model cloudai.workloads.sglang.sglang.SglangBenchCmdArgs[source]#

Bases: CmdArgs

SGLang bench_serving command arguments.

field backend: str = 'sglang'#
field dataset_name: str = 'random'#
field num_prompts: int = 30#
field max_concurrency: int = 16#
field random_input: int = 16#
field random_output: int = 128#
field warmup_requests: int = 2#
field random_range_ratio: float = 1.0#
field output_details: bool = True#

Semantic Eval Command Arguments#

pydantic model cloudai.workloads.sglang.sglang.SglangSemanticEvalCmdArgs[source]#

Bases: CmdArgs

SGLang semantic validation command arguments.

field entrypoint: str = 'python3 -m sglang.test.run_eval'#
field cli: str = '--host {host} --port {port} --eval-name gsm8k --num-examples 200 --num-threads 128 --model {model}'#

Test Definition#

pydantic model cloudai.workloads.sglang.sglang.SglangTestDefinition[source]#

Bases: LLMServingTestDefinition[SglangCmdArgs]

Test object for SGLang.

field bench_cmd_args: SglangBenchCmdArgs = SglangBenchCmdArgs(backend='sglang', dataset_name='random', num_prompts=30, max_concurrency=16, random_input=16, random_output=128, warmup_requests=2, random_range_ratio=1.0, output_details=True)#
field custom_bash: CustomBash | None = None#
field semantic_eval_cmd_args: SglangSemanticEvalCmdArgs | None = None#
was_run_successful(
tr: TestRun,
) JobStatusResult[source]#
property cmd_args_dict: Dict[str, str | List[str]]#
constraint_check(
tr: TestRun,
system: System | None,
) bool#
property docker_image: DockerImage#
property extra_args_str: str#
property extra_installables: list[Installable]#
property hf_model: HFModel#
property installables: list[Installable]#
property is_domain_randomization_enabled: bool#

at least one env_params annotation.

Type:

Whether the config declares domain randomization

is_dse_excluded_arg(path: str) bool#

Return whether a dot-separated cmd_args path should be ignored by DSE.

property is_dse_job: bool#
is_env_sampled(cmd_args_path: str) bool#

Whether a cmd_args field is env-sampled (env draws it per trial, not the agent).

field cmd_args: LLMServingCmdArgsT [Required]#
field name: str [Required]#
field description: str [Required]#
field test_template_name: str [Required]#
field dse_excluded_args: list[str] [Optional]#
field extra_env_vars: dict[str, str | List[str]] = {}#
field extra_cmd_args: dict[str, str] = {}#
field extra_container_mounts: list[str] = []#
field git_repos: list[GitRepo] = []#
field nsys: NsysConfiguration | None = None#
field predictor: PredictorConfig | None = None#
field agent: str = 'grid_search'#
field agent_steps: int = 1#
field agent_metrics: list[str] = ['default']#
field agent_reward_function: str = 'inverse'#
field agent_config: dict[str, Any] | None = None#

Agent configuration.

field env_params: dict[str, EnvParamSpec] [Optional]#

Environment parameters sampled by the env per trial. Sibling to cmd_args; not part of the agent’s action space. CloudAIGymEnv samples, persists to env.csv, and includes them in the trajectory cache key.