NIXL Bench#

This workload (test_template_name is NIXLBench) runs NIXL benchmarking suite for network and interconnect performance testing.

Usage Examples#

Test TOML example:

name = "my_nixl_bench_test"
description = "Example NIXL Bench test"
test_template_name = "NIXLBench"

[cmd_args]
docker_image_url = "<docker container url here>"
path_to_benchmark = "/workspace/nixlbench/build/nixlbench"
backend = "UCX"
initiator_seg_type = "VRAM"
target_seg_type = "VRAM"
op_type = "READ"
filepath = "/data"
device_list = "11:F:/store0.bin"
# one could also use <num>kb, <num>mb, <num>gb shortcuts
total_buffer_size = 8000000000

Test Scenario example:

name = "nixl-bench-test"

[[Tests]]
id = "bench.1"
num_nodes = 1
time_limit = "00:10:00"

test_name = "my_nixl_bench_test"

Test-in-Scenario example:

name = "nixl-bench-test"

[[Tests]]
id = "bench.1"
num_nodes = 1
time_limit = "00:10:00"

name = "my_nixl_bench_test"
description = "Example NIXL Bench test"
test_template_name = "NIXLBench"

  [Tests.cmd_args]
  docker_image_url = "<docker container url here>"
  path_to_benchmark = "/workspace/nixlbench/build/nixlbench"
  backend = "UCX"
  initiator_seg_type = "DRAM"
  target_seg_type = "DRAM"
  op_type = "WRITE"

API Documentation#

Command Arguments#

pydantic model cloudai.workloads.nixl_bench.nixl_bench.NIXLBenchCmdArgs[source]#

Command line arguments for a NIXL Bench test.

field path_to_benchmark: str [Required]#
field etcd_endpoints: str = 'http://$NIXL_ETCD_ENDPOINTS'#
field docker_image_url: str [Required]#

URL of the Docker image to use for the benchmark.

field etcd_path: str = 'etcd'#

Path to the etcd executable.

field wait_etcd_for: int = 60#

Number of seconds to wait for etcd to become healthy.

field etcd_image_url: str | None = None#

Optional URL of the Docker image to use for etcd, by default etcd will be run from the same image as the benchmark.

field filepath: str | None = None#

Directory path (in container) for storage operations. Example: /data

field total_buffer_size: str | list[str] | None = None#

Total buffer size in bytes. Examples: 1024, 1kb, 1mb, 1gb. Use with device_list. The size will be passed into NIXL as integer (bytes)

field device_list: str | list[str] | None = None#

Device specs in format ‘id:type:path’ (e.g., ‘11:F:/store0.bin,27:K:/dev/nvme0n1’)

Test Definition#

class cloudai.workloads.nixl_bench.nixl_bench.NIXLBenchTestDefinition(*, name: str, description: str, test_template_name: str, cmd_args: ~cloudai.workloads.nixl_bench.nixl_bench.NIXLBenchCmdArgs, dse_excluded_args: list[str] = <factory>, extra_env_vars: dict[str, str | ~typing.List[str]] = {}, extra_cmd_args: dict[str, str] = {}, extra_container_mounts: list[str] = [], git_repos: list[~cloudai._core.installables.git_repo.GitRepo] = [], nsys: ~cloudai.models.workload.NsysConfiguration | None = None, predictor: ~cloudai.models.workload.PredictorConfig | None = None, agent: str = 'grid_search', agent_steps: int = 1, agent_metrics: list[str] = ['default'], agent_reward_function: str = 'inverse', agent_config: dict[str, ~typing.Any] | None = None, env_params: dict[str, ~cloudai.configurator.env_params.EnvParamSpec] = <factory>)[source]#

Bases: NIXLBaseTestDefinition[NIXLBenchCmdArgs]

Test definition for a NIXL Bench test.

property is_domain_randomization_enabled: bool#

at least one env_params annotation.

Type:

Whether the config declares domain randomization

is_dse_excluded_arg(path: str) bool#

Return whether a dot-separated cmd_args path should be ignored by DSE.

is_env_sampled(cmd_args_path: str) bool#

Whether a cmd_args field is env-sampled (env draws it per trial, not the agent).

validator validate_env_params  »  all fields#

Validate env_params annotations against cmd_args.

env_params is an annotation: each key names a cmd_args field whose value is the candidate set (the single source of truth), and the entry carries only how to sample. So each key must name a real cmd_args field whose value is a candidate list; a scalar is already fixed, so annotating it is a meaningless label and is rejected here. When weights are declared, the list needs >= 2 values and the weights must align 1:1 with it. Sampling, persistence, the per-trial cmd_args overlay, and the cache key all live in CloudAIGymEnv; keeping this shape check in core lets the overlay stay agent- and workload-agnostic rather than re-implemented per workload.