NIXL EP#

This workload (test_template_name is NixlEP) runs the NIXL Elastic EP benchmark through a Slurm-managed multi-node elastic launcher flow.

Overview#

The Slurm launch model is:

  • one elastic.py process per node, started in sequence as the plan progresses

  • the master node starts first and exposes a TCPStore for rank coordination

  • follower nodes connect via --tcp-server $master_ip once the master is ready

  • the benchmark runtime comes from the container image

  • each run serializes its plan JSON into the output directory

Plan Format#

The plan field is a JSON-encoded list of phases. Each phase is a list of rank indices passed directly to the benchmark. CloudAI uses the following convention to drive the elastic launcher:

  • Positive rank index — the rank is active. A rank that is new relative to the previous phase causes CloudAI to fire an additional srun for that worker.

  • Negative rank index (e.g. -6) — signals a contraction: the benchmark sees the absolute value and treats it as temporarily removed. No new srun is launched for negative indices.

  • Omitted rank — a rank present in an earlier phase but absent from the current phase list is not relaunched. The benchmark’s own phase logic handles its inactivity.

Example:

[[0, 1, 2, 3],              # phase 0: ranks 0–3 start
 [0, 1, 2, 3, 4, 5, 6, 7], # phase 1: ranks 4–7 join (expansion)
 [0, 1, 2, 3, 4, -6, 7],   # phase 2: rank 6 contracted (no new launch)
 [0, 1, 2, 3, 4, 5, 6, 7]] # phase 3: rank 6 rejoins (new launch for rank 6)

Phase completion is detected by polling the primary log for -> end phase N markers.

Usage Examples#

Test TOML example:

name = "nixl-ep-expansion-contraction"
description = "NIXL Elastic EP expansion/contraction benchmark"
test_template_name = "NixlEP"

[cmd_args]
docker_image_url = "<docker container url here>"
elastic_script = "/workspace/nixl/examples/device/ep/tests/elastic/elastic.py"
plan = "[[0, 1, 2, 3], [0, 1, 2, 3, 4, 5, 6, 7], [0, 1, 2, 3, 4, -6, 7], [0, 1, 2, 3, 4, 5, 6, 7]]"
num_processes_per_node = 4
num_tokens = 256
num_experts_per_rank = 4
hidden_dim = 8192
num_topk = 6
disable_ll_nvlink = true

Test-in-Scenario example:

name = "nixl-ep-expansion-contraction"

[[Tests]]
id = "nixl_ep.expansion_contraction"
num_nodes = 3
time_limit = "00:30:00"

name = "nixl-ep-expansion-contraction"
description = "NIXL Elastic EP expansion/contraction benchmark"
test_template_name = "NixlEP"

  [Tests.cmd_args]
  docker_image_url = "<docker container url here>"
  elastic_script = "/workspace/nixl/examples/device/ep/tests/elastic/elastic.py"
  plan = "[[0, 1, 2, 3], [0, 1, 2, 3, 4, 5, 6, 7], [0, 1, 2, 3, 4, -6, 7], [0, 1, 2, 3, 4, 5, 6, 7]]"
  num_processes_per_node = 4
  num_tokens = 256
  num_experts_per_rank = 4
  hidden_dim = 8192
  num_topk = 6
  disable_ll_nvlink = true

Reporting#

After a run completes, CloudAI prints a single table with one row per (node, rank) measurement. The Phases column shows each phase index colour-coded green (passed) or red (failed). Bandwidth columns report dispatch+combine throughput and timing per rank.

The reported metric (default) is the mean dispatch+combine bandwidth in GB/s across all ranks.

API Documentation#

Command Arguments#

pydantic model cloudai.workloads.nixl_ep.nixl_ep.NixlEPCmdArgs[source]#

Command line arguments for the NIXL Elastic EP benchmark.

field docker_image_url: str [Required]#

URL of the Docker image that contains the NIXL EP benchmark.

field elastic_script: str = '/workspace/nixl/examples/device/ep/tests/elastic/elastic.py'#

Path to the benchmark entrypoint, relative to the container’s NIXL runtime root or absolute in the container.

field python_executable: str = 'python3'#

Python executable to use inside the container.

field plan: str | list[str] [Required]#

Serialized phase plan to write into a per-run JSON file. Use a single string such as “[[0, 1], [0, 1, 2, 3]]” for a single run, or a list of such strings to enable DSE mode (one run per plan).

field num_processes_per_node: int | list[int] [Required]#

Number of local worker processes to spawn on each allocated node.

field service_startup_timeout_seconds: int = 60#

Seconds to wait for the master node’s TCPStore to accept connections.

Constraints:
  • ge = 1

field store_port: int = 9999#

TCPStore port used by the benchmark.

Constraints:
  • ge = 1

  • le = 65535

parse_plan() list[list[int]][source]#

Test Definition#

class cloudai.workloads.nixl_ep.nixl_ep.NixlEPTestDefinition(
*,
name: str,
description: str,
test_template_name: str,
cmd_args: NixlEPCmdArgs,
extra_env_vars: dict[str, str | List[str]] = {},
extra_cmd_args: dict[str, str] = {},
extra_container_mounts: list[str] = [],
git_repos: list[GitRepo] = [],
nsys: NsysConfiguration | None = None,
predictor: PredictorConfig | None = None,
agent: str = 'grid_search',
agent_steps: int = 1,
agent_metrics: list[str] = ['default'],
agent_reward_function: str = 'inverse',
agent_config: dict[str, Any] | None = None,
)[source]#

Bases: TestDefinition

Test definition for the NIXL Elastic EP benchmark.