AI Dynamo#

This workload (test_template_name is AIDynamo) runs AI inference benchmarks using the Dynamo framework with distributed prefill and decode workers.

Run using Kubernetes#

Prepare cluster#

Before running the AI Dynamo workload on a Kubernetes cluster, ensure that the cluster is set up according to the instructions in the official documentation. Below is a short summary of the required steps:

export NAMESPACE=dynamo-system
export RELEASE_VERSION=0.6.1  # replace with the desired release version

helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz
helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default

helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace

Launch and Monitor the Job#

uv run cloudai run --system-config <k8s system toml> \
   --tests-dir conf/experimental/ai_dynamo/test \
   --test-scenario conf/experimental/ai_dynamo/test_scenario/vllm_k8s.toml

Run using Slurm#

Node Configuration for AI Dynamo#

AI Dynamo jobs use three distinct types of nodes:

  • Frontend node: Hosts the coordination services (etcd, nats), the frontend server, the request generator (genai-perf), and the first decode worker.

  • Prefill node(s): Handle the prefill stage of inference

  • Decode node(s): Handle the decode stage of inference (optional, depending on model and setup)

The total number of nodes required must be:

num_prefill_nodes + num_decode_nodes

If there is a mismatch in the number of nodes between the schema and the test scenario, CloudAI will use the number of nodes specified in the test schema, ignoring the value in the test scenario.

All node role assignments and orchestration are automatically managed by CloudAI.

Launch and Monitor the Job#

To run the job:

uv run cloudai run --system-config <slurm system toml> \
   --tests-dir conf/experimental/ai_dynamo/test \
   --test-scenario conf/experimental/ai_dynamo/test_scenario/vllm_slurm.toml

One can monitor job progress using either of the following options:

watch squeue --me
watch tail -n 4 ./results/<scenario name>/*.txt

The frontend node will initially wait to allow weight loading on all nodes. Once ready, it will launch genai-perf, which begins generating requests to the frontend server. All servers cooperate to complete inference, and the output will appear in stdout.txt.

Review genai-perf benchmark results#

After job completion, CloudAI will place the output logs and result files in the designated results directory. To analyze performance metrics and validate inference outcomes:

  • Navigate to the results directory (e.g., ./results/...)

  • Most importantly, open the profile_genai_perf.csv file to examine the final benchmarking results

This CSV file includes detailed metrics collected by genai-perf, such as request latency, throughput, and system utilization statistics. Use this data to evaluate the model’s performance and identify potential bottlenecks or optimization opportunities.

Metric,avg,min,max,p99,p95,p90,p75,p50,p25
Time To First Token (ms),"1,146.31",249.48,"3,485.23","3,457.97","3,349.56","3,215.06","1,330.93",640.07,286.52
Time To Second Token (ms),26.05,0.00,133.51,96.12,36.56,34.88,34.35,33.55,1.78
Request Latency (ms),"6,406.20","5,371.47","9,608.72","9,436.13","9,046.58","9,028.16","6,549.60","5,690.23","5,493.63"
Inter Token Latency (ms),30.35,27.59,35.60,35.23,33.88,32.53,31.05,30.13,29.04
Output Sequence Length (tokens),174.45,164.00,187.00,186.22,183.10,180.10,177.00,174.00,171.75
Input Sequence Length (tokens),"3,000.05","2,999.00","3,001.00","3,001.00","3,001.00","3,000.00","3,000.00","3,000.00","3,000.00"

Metric,Value
Output Token Throughput (per sec),261.25
Request Throughput (per sec),1.50
Request Count (count),40.00

API Documentation#

Command Arguments#

class cloudai.workloads.ai_dynamo.ai_dynamo.AIDynamoCmdArgs(
*,
docker_image_url: str,
huggingface_home_container_path: Path = PosixPath('/root/.cache/huggingface'),
dynamo: AIDynamoArgs,
genai_perf: GenAIPerfArgs,
run_script: str = '',
**extra_data: Any,
)[source]#

Bases: CmdArgs

Arguments for AI Dynamo.

docker_image_url: str#
huggingface_home_container_path: Path#
dynamo: AIDynamoArgs#
genai_perf: GenAIPerfArgs#
run_script: str#

Test Definition#

class cloudai.workloads.ai_dynamo.ai_dynamo.AIDynamoTestDefinition(*, name: str, description: str, test_template_name: str, cmd_args: ~cloudai.workloads.ai_dynamo.ai_dynamo.AIDynamoCmdArgs, extra_env_vars: dict[str, str | ~typing.List[str]] = {}, extra_cmd_args: dict[str, str] = {}, extra_container_mounts: list[str] = [], git_repos: list[~cloudai._core.installables.GitRepo] = [], nsys: ~cloudai.models.workload.NsysConfiguration | None = None, predictor: ~cloudai.models.workload.PredictorConfig | None = None, agent: str = 'grid_search', agent_steps: int = 1, agent_metrics: list[str] = ['default'], agent_reward_function: str = 'inverse', script: ~cloudai._core.installables.File = File(src=PosixPath('/home/runner/work/cloudai/cloudai/src/cloudai/workloads/ai_dynamo/ai_dynamo.sh')), dynamo_repo: ~cloudai._core.installables.GitRepo = GitRepo(url=https://github.com/ai-dynamo/dynamo.git, commit=f7e468c7e8ff0d1426db987564e60572167e8464), genai_perf_repo: ~cloudai._core.installables.GitRepo = GitRepo(url=https://github.com/triton-inference-server/perf_analyzer.git, commit=3c0bc9efa1844a82dfcc911f094f5026e6dd9214))[source]#

Bases: TestDefinition

Test definition for AI Dynamo.

cmd_args: AIDynamoCmdArgs#
script: File#
dynamo_repo: GitRepo#
genai_perf_repo: GitRepo#
property docker_image: DockerImage#
property hf_model: HFModel#
property installables: list[Installable]#
property python_executable: PythonExecutable#
was_run_successful(
tr: TestRun,
) JobStatusResult[source]#