Sample Workflows
This page contains sample workflow configurations that you can use as starting points for your own workflows. You can also access these samples using the sflow sample command.
📁 View original sample files: src/sflow/samples
Listing Available Samples
# List all available samples
sflow sample --list
# Copy a sample to your current directory
sflow sample local_hello_world
# Copy with custom output path
sflow sample local_dag --output my_workflow.yaml
Local Samples
These samples run locally without requiring a Slurm cluster.
Hello World
A minimal example that demonstrates basic sflow concepts.
version: "0.1"
variables:
WHO:
description: "who to greet"
value: Nvidia
workflow:
name: local_hello_world
tasks:
- name: hello
script:
- echo "Hello ${WHO}"
Run it:
sflow sample local_hello_world
sflow run -f local_hello_world.yaml
DAG Workflow
A multi-task workflow demonstrating task dependencies, data flow between tasks, and parallel execution.
version: "0.1"
variables:
- name: MODEL_NAME
type: string
value: tiny-transformer
workflow:
name: quickstart_dag
tasks:
- name: prepare_data
script:
- echo "prepare_data start"
- echo "model(jinja)=${{ variables.MODEL_NAME }}" > ${SFLOW_WORKFLOW_OUTPUT_DIR}/dataset.txt
- echo "model(shell)=${MODEL_NAME}" >> ${SFLOW_WORKFLOW_OUTPUT_DIR}/dataset.txt
- name: preprocess
depends_on: [prepare_data]
script:
- test -f ${SFLOW_WORKFLOW_OUTPUT_DIR}/dataset.txt
- grep -q "model(jinja)=tiny-transformer" ${SFLOW_WORKFLOW_OUTPUT_DIR}/dataset.txt
- grep -q "model(shell)=tiny-transformer" ${SFLOW_WORKFLOW_OUTPUT_DIR}/dataset.txt
- echo "encoded_data ok" > ${SFLOW_WORKFLOW_OUTPUT_DIR}/encoded.txt
- name: train
depends_on: [preprocess]
script:
- test -f ${SFLOW_WORKFLOW_OUTPUT_DIR}/encoded.txt
- echo "checkpoint for ${MODEL_NAME}" > ${SFLOW_WORKFLOW_OUTPUT_DIR}/checkpoint.pt
- name: evaluate_on_dataset1
depends_on: [train]
script:
- test -f ${SFLOW_WORKFLOW_OUTPUT_DIR}/checkpoint.pt
- echo "accuracy=0.99 dataset=dataset1" > ${SFLOW_TASK_OUTPUT_DIR}/metrics.txt
- name: evaluate_on_dataset2
depends_on: [train]
script:
- test -f ${SFLOW_WORKFLOW_OUTPUT_DIR}/checkpoint.pt
- echo "accuracy=0.88 dataset=dataset2" > ${SFLOW_TASK_OUTPUT_DIR}/metrics.txt
- name: export_model
depends_on: [evaluate_on_dataset1, evaluate_on_dataset2]
script:
- test -f ${SFLOW_WORKFLOW_OUTPUT_DIR}/evaluate_on_dataset1/metrics.txt
- test -f ${SFLOW_WORKFLOW_OUTPUT_DIR}/evaluate_on_dataset2/metrics.txt
- echo "exported ${MODEL_NAME}" > ${SFLOW_WORKFLOW_OUTPUT_DIR}/model.onnx
Run it:
sflow sample local_dag
sflow run -f local_dag.yaml --dry-run # Validate
sflow run -f local_dag.yaml # Execute
Slurm Samples
These samples require a Slurm cluster with GPU resources.
SGLang Server + Benchmark (Single Node)
Deploys an SGLang inference server with AIPerf benchmarking on Slurm.
Features:
- SGLang server with FP8 inference
- GPU monitoring
- AIPerf benchmarking client
- Readiness probes for service orchestration
version: "0.1"
variables:
# Slurm Configuration
SLURM_ACCOUNT:
description: "SLURM account"
value: your_account
SLURM_PARTITION:
description: "SLURM partition"
value: your_partition
SLURM_TIMELIMIT:
description: "SLURM time limit"
value: 60
GPUS_PER_NODE:
description: "GPUs per node"
value: 4
SLURM_NODES:
description: "Number of nodes"
value: 1
# Model Configuration
HF_MODEL_NAME:
description: "HF model name"
value: Qwen/Qwen3-0.6B-FP8
SERVED_MODEL_NAME:
description: "Served model name"
value: Qwen3-0-6B-FP8
LOCAL_MODEL_PATH:
description: "Local model path"
value: /tmp/models/Qwen3-0.6B-FP8
# SGLang Server Configuration
NUM_SERVERS:
description: "Number of servers"
value: 1
TP_SIZE:
description: "Tensor parallel size"
value: 4
MAX_RUNNING_REQUESTS:
description: "Max running requests"
value: 32
# Benchmark Configuration
ISL:
description: "Input sequence length"
value: 1024
OSL:
description: "Output sequence length"
value: 1024
MULTI_ROUND:
description: "Number of benchmark rounds"
value: 8
CONCURRENCY:
description: "Concurrency"
value: 32
# Container Images
SGLANG_IMAGE:
description: "SGLang image"
value: "lmsysorg/sglang:v0.5.7-cu130-runtime"
AIPERF_IMAGE:
description: "AIPerf container image"
value: python:3.12-slim
backends:
- name: slurm_cluster
type: slurm
default: true
time: ${{ variables.SLURM_TIMELIMIT }}
nodes: ${{ variables.SLURM_NODES }}
partition: ${{ variables.SLURM_PARTITION }}
account: ${{ variables.SLURM_ACCOUNT }}
gpus_per_node: ${{ variables.GPUS_PER_NODE }}
operators:
- name: sglang_runtime
type: srun
container_name: sglang_runtime
container_writable: true
container_mount_home: false
ntasks_per_node: 1
mpi: pmix
extra_args:
- --container-image=${{ variables.SGLANG_IMAGE }}
- name: aiperf
type: srun
container_name: aiperf
container_writable: true
mpi: pmix
extra_args:
- --container-image=${{ variables.AIPERF_IMAGE }}
workflow:
name: sglang_qwen3_0_6b
timeout: 60m
variables:
HEAD_NODE_IP:
description: "Head node IP"
value: "${{ backends.slurm_cluster.nodes[0].ip_address }}"
tasks:
- name: load_image
operator:
name: sglang_runtime
ntasks_per_node: 1
script:
- echo "Image Loaded"
- sleep 3600
probes:
readiness:
log_watch:
regex_pattern: "Image Loaded"
timeout: 1200
interval: 2
- name: install_aiperf
operator:
name: aiperf
ntasks_per_node: 1
script:
- pip install aiperf==0.3.0
- hf download ${{ variables.HF_MODEL_NAME }} --local-dir ${{ variables.LOCAL_MODEL_PATH }}
- echo "AIPerf installed"
- sleep 3600
probes:
readiness:
log_watch:
regex_pattern: "AIPerf installed"
timeout: 1200
interval: 2
- name: gpu_monitor
operator: sglang_runtime
script:
- echo "Starting gpu monitor"
- >
nvidia-smi --query-gpu=index,utilization.gpu,utilization.memory,temperature.gpu,temperature.memory,power.draw,clocks.sm,clocks.mem,memory.total,memory.used
--format=csv,noheader,nounits -lms 2000 |
while IFS= read -r input || [ -n "$input" ] ;
do timestamp=$(date +%s%3N);
printf "%s.%s,%s\n" "${timestamp:0:10}" "${timestamp:10:3}" "${input}";
done
>> ${SFLOW_TASK_OUTPUT_DIR}/gpu_monitor_node_${SLURM_NODEID}_${SLURMD_NODENAME}.log
probes:
readiness:
log_watch:
regex_pattern: "Starting gpu monitor"
resources:
nodes:
indices: [0]
depends_on:
- load_image
- install_aiperf
- name: sglang_server
operator: sglang_runtime
replicas:
count: ${{ variables.NUM_SERVERS }}
policy: parallel
resources:
gpus:
count: ${{ variables.TP_SIZE }}
nodes:
indices: [0]
script:
- set -x
- export SGLANG_DISABLE_WATCHDOG=1
- >
python -m sglang_router.launch_server --model ${{ variables.HF_MODEL_NAME }}
--host 0.0.0.0
--port 8000
--fp8-gemm-backend flashinfer_trtllm
--moe-runner-backend flashinfer_trtllm
--served-model-name ${{ variables.SERVED_MODEL_NAME }}
--tensor-parallel-size ${{ variables.TP_SIZE }}
--trust-remote-code
--max-running-requests ${{ variables.MAX_RUNNING_REQUESTS }}
probes:
readiness:
log_watch:
regex_pattern: "Workflow completed"
depends_on:
- load_image
- name: benchmark
operator:
name: aiperf
ntasks: 1
script:
- set -x
- >
aiperf profile --artifact-dir ${SFLOW_WORKFLOW_OUTPUT_DIR}/aiperf_concurrency_${CONCURRENCY}
--model ${{ variables.SERVED_MODEL_NAME }}
--tokenizer ${{ variables.LOCAL_MODEL_PATH }}
--endpoint-type chat
--endpoint /v1/chat/completions
--streaming
--url http://${{ variables.HEAD_NODE_IP }}:8000
--synthetic-input-tokens-mean ${{ variables.ISL }}
--synthetic-input-tokens-stddev 0
--output-tokens-mean ${{ variables.OSL }}
--output-tokens-stddev 0
--extra-inputs "max_tokens:${{ variables.OSL }}"
--extra-inputs "min_tokens:${{ variables.OSL }}"
--extra-inputs "ignore_eos:true"
--concurrency ${CONCURRENCY}
--request-count $((${{ variables.MULTI_ROUND }}*${CONCURRENCY}))
--warmup-request-count ${CONCURRENCY}
--num-dataset-entries $((${{ variables.MULTI_ROUND }}*${CONCURRENCY}))
--random-seed 100
--ui simple
- echo "Benchmarking finished"
resources:
nodes:
indices: [0]
depends_on:
- sglang_server
- install_aiperf
Run it:
sflow sample slurm_sglang_server_client
# Validate configuration
sflow run -f slurm_sglang_server_client.yaml \
--set SLURM_ACCOUNT=your_account \
--set SLURM_PARTITION=your_partition \
--dry-run
# Submit to Slurm
sflow batch -f slurm_sglang_server_client.yaml \
-A your_account -p your_partition -N 1 -G 4 \
--sbatch-path sglang_job.sh --submit
Dynamo TRT-LLM Disaggregated Inference (Single Node)
Deploys a disaggregated inference setup with separate prefill and decode servers using NVIDIA Dynamo and TensorRT-LLM.
Features:
- Disaggregated prefill/decode architecture
- NATS and etcd for service discovery
- Configurable tensor parallelism
- Sequential benchmark sweeps with variable domains
- Retry policies for server reliability
- File-type artifacts for dynamic configuration
version: "0.1"
variables:
# Slurm Configuration
SLURM_ACCOUNT:
description: "SLURM account"
value: your_account
SLURM_PARTITION:
description: "SLURM partition"
value: your_partition
SLURM_TIMELIMIT:
description: "SLURM time limit"
value: 120
GPUS_PER_NODE:
description: "GPUs per node"
value: 4
SLURM_NODES:
description: "Number of nodes"
value: 1
# Model Configuration
SERVED_MODEL_NAME:
description: "Served model name"
value: Qwen3-0-6B-FP8
MODEL_NAME:
description: "Model path"
value: Qwen/Qwen3-0.6B-FP8
LOCAL_MODEL_PATH:
description: "Local model path"
value: /tmp/models/Qwen3-0.6B-FP8
# Prefill Server Configuration
NUM_CTX_SERVERS:
description: "Number of context/prefill servers"
value: 1
CTX_TP_SIZE:
description: "Context tensor parallel size"
value: 2
# Decode Server Configuration
NUM_GEN_SERVERS:
description: "Number of generation/decode servers"
value: 1
GEN_TP_SIZE:
description: "Generation tensor parallel size"
value: 2
# Benchmark Configuration with Domain Sweep
CONCURRENCY:
description: "Concurrency"
value: 64
domain: [64, 128] # Will create sequential benchmark runs
# Container Images
DYNAMO_IMAGE:
description: "Dynamo TRTLLM container image"
value: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.0
artifacts:
# File-type artifacts are generated by sflow with dynamic content
- name: PREFILL_CONFIG
uri: file://prefill_config.yaml
content: |
max_batch_size: 128
tensor_parallel_size: ${{ variables.CTX_TP_SIZE }}
# ... additional configuration
- name: DECODE_CONFIG
uri: file://decode_config.yaml
content: |
tensor_parallel_size: ${{ variables.GEN_TP_SIZE }}
# ... additional configuration
backends:
- name: slurm_cluster
type: slurm
default: true
time: ${{ variables.SLURM_TIMELIMIT }}
nodes: ${{ variables.SLURM_NODES }}
partition: ${{ variables.SLURM_PARTITION }}
account: ${{ variables.SLURM_ACCOUNT }}
gpus_per_node: ${{ variables.GPUS_PER_NODE }}
operators:
- name: dynamo_trtllm
type: srun
container_name: dynamo_trtllm
container_writable: true
container_mount_home: false
mpi: pmix
extra_args:
- --container-image=${{ variables.DYNAMO_IMAGE }}
workflow:
name: dynamo
timeout: 115m
variables:
HEAD_NODE_IP:
value: "${{ backends.slurm_cluster.nodes[0].ip_address }}"
ETCD_ENDPOINTS:
value: "${{ backends.slurm_cluster.nodes[0].ip_address }}:2379"
NATS_SERVER:
value: "nats://${{ backends.slurm_cluster.nodes[0].ip_address }}:4222"
tasks:
- name: nats_server
operator: dynamo_trtllm
script:
- nats-server -js
probes:
readiness:
tcp_port:
port: 4222
timeout: 60
- name: etcd_server
operator: dynamo_trtllm
script:
- etcd --listen-client-urls "http://0.0.0.0:2379" ...
probes:
readiness:
tcp_port:
port: 2379
timeout: 60
- name: frontend_server
operator: dynamo_trtllm
script:
- python3 -m dynamo.frontend --http-port 8000
probes:
readiness:
tcp_port:
port: 8000
timeout: 120
depends_on:
- nats_server
- etcd_server
- name: prefill_server
operator:
name: dynamo_trtllm
ntasks: ${{ variables.CTX_TP_SIZE }}
replicas:
count: ${{ variables.NUM_CTX_SERVERS }}
policy: parallel
script:
- trtllm-llmapi-launch python3 -m dynamo.trtllm --disaggregation-mode prefill ...
resources:
gpus:
count: ${{ variables.CTX_TP_SIZE }}
probes:
readiness:
log_watch:
regex_pattern: "Setting PyTorch memory fraction"
timeout: 600
failure:
log_watch:
regex_pattern: "Traceback (most recent call last)"
retries:
count: 3
interval: 30
backoff: 2
depends_on:
- frontend_server
- name: decode_server
operator:
name: dynamo_trtllm
ntasks: ${{ variables.GEN_TP_SIZE }}
replicas:
count: ${{ variables.NUM_GEN_SERVERS }}
policy: parallel
script:
- trtllm-llmapi-launch python3 -m dynamo.trtllm --disaggregation-mode decode ...
resources:
gpus:
count: ${{ variables.GEN_TP_SIZE }}
retries:
count: 3
interval: 30
backoff: 2
depends_on:
- frontend_server
- name: benchmark
operator:
name: aiperf
ntasks: 1
replicas:
variables:
- CONCURRENCY # Sweeps over domain [64, 128]
policy: sequential
script:
- aiperf profile --concurrency ${CONCURRENCY} ...
depends_on:
- prefill_server
- decode_server
- frontend_server
Run it:
sflow sample slurm_dynamo_trtllm_disagg
# Validate configuration
sflow run -f slurm_dynamo_trtllm_disagg.yaml \
--set SLURM_ACCOUNT=your_account \
--set SLURM_PARTITION=your_partition \
--dry-run
# Submit to Slurm
sflow batch -f slurm_dynamo_trtllm_disagg.yaml \
-A your_account -p your_partition -N 1 -G 4 \
--sbatch-path dynamo_job.sh --submit
TRT-LLM Serve Disaggregated Inference (Single Node)
Deploys a disaggregated inference setup with separate prefill and decode servers using TensorRT-LLM's native trtllm-serve disaggregated command.
Features:
- Disaggregated prefill/decode architecture with
trtllm-serve - Dynamic configuration using file-type artifacts with backend node IP resolution
- Configurable tensor parallelism for prefill and decode servers
- GPU monitoring task
- Sequential benchmark sweeps with variable domains
- Failure probes for error detection
version: "0.1"
variables:
# Slurm Configuration
SLURM_ACCOUNT:
description: "SLURM account"
value: your_account
SLURM_PARTITION:
description: "SLURM partition"
value: your_partition
SLURM_TIMELIMIT:
description: "SLURM time limit"
value: 120
GPUS_PER_NODE:
description: "GPUs per node"
value: 4
SLURM_NODES:
description: "Number of nodes"
value: 1
# Model Configuration
SERVED_MODEL_NAME:
description: "Served model name"
value: Qwen3-0-6B-FP8
MODEL_NAME:
description: "Model path"
value: Qwen/Qwen3-0.6B-FP8
LOCAL_MODEL_PATH:
description: "Local model path"
value: /tmp/models/Qwen3-0.6B-FP8
# Prefill Server Configuration
NUM_CTX_SERVERS:
description: "Number of context/prefill servers"
value: 1
CTX_TP_SIZE:
description: "Context tensor parallel size"
value: 2
# Decode Server Configuration
NUM_GEN_SERVERS:
description: "Number of generation/decode servers"
value: 1
GEN_TP_SIZE:
description: "Generation tensor parallel size"
value: 2
# Benchmark Configuration with Domain Sweep
CONCURRENCY:
description: "Concurrency"
value: 128
domain: [128, 256] # Will create sequential benchmark runs
# Container Images
TRTLLM_IMAGE:
description: "TRT-LLM container image"
value: nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6.post2
AIPERF_IMAGE:
description: "AIPerf container image"
value: python:3.12-slim
artifacts:
# File-type artifacts with dynamic backend node IP resolution
- name: SERVER_CONFIG
uri: file://server_config.yaml
content: |
hostname: ${{ backends.slurm_cluster.nodes[0].ip_address }}
port: 8000
backend: pytorch
context_servers:
num_instances: ${{ variables.NUM_CTX_SERVERS }}
urls:
- ${{ backends.slurm_cluster.nodes[0].ip_address }}:8536
generation_servers:
num_instances: ${{ variables.NUM_GEN_SERVERS }}
urls:
- ${{ backends.slurm_cluster.nodes[0].ip_address }}:8336
- name: PREFILL_CONFIG
uri: file://prefill_config.yaml
content: |
max_batch_size: 128
tensor_parallel_size: ${{ variables.CTX_TP_SIZE }}
# ... additional configuration
- name: DECODE_CONFIG
uri: file://decode_config.yaml
content: |
tensor_parallel_size: ${{ variables.GEN_TP_SIZE }}
# ... additional configuration
backends:
- name: slurm_cluster
type: slurm
default: true
time: ${{ variables.SLURM_TIMELIMIT }}
nodes: ${{ variables.SLURM_NODES }}
partition: ${{ variables.SLURM_PARTITION }}
account: ${{ variables.SLURM_ACCOUNT }}
gpus_per_node: ${{ variables.GPUS_PER_NODE }}
operators:
- name: trtllm_container
type: srun
container_name: trtllm_container
container_writable: true
container_mount_home: false
mpi: pmix
extra_args:
- --container-image=${{ variables.TRTLLM_IMAGE }}
- name: aiperf
type: srun
container_name: aiperf
container_writable: true
mpi: pmix
extra_args:
- --container-image=${{ variables.AIPERF_IMAGE }}
workflow:
name: trtllm_server_disagg
timeout: 115m
variables:
HEAD_NODE_IP:
description: "Head node IP (resolved after allocation)"
value: "${{ backends.slurm_cluster.nodes[0].ip_address }}"
tasks:
- name: load_image
operator:
name: trtllm_container
ntasks_per_node: 1
script:
- hf download ${{ variables.MODEL_NAME }} --local-dir ${{ variables.LOCAL_MODEL_PATH }}
- echo "Image Loaded"
- sleep 3600
probes:
readiness:
log_watch:
regex_pattern: "Image Loaded"
timeout: 1200
- name: frontend_server
operator: trtllm_container
script:
- cat ${{ artifacts.SERVER_CONFIG.path }}
- trtllm-serve disaggregated -c ${{ artifacts.SERVER_CONFIG.path }} -t 7200 -r 7200
resources:
nodes:
indices: [0]
probes:
readiness:
log_watch:
regex_pattern: "Application startup complete"
timeout: 120
depends_on:
- prefill_server
- decode_server
- name: prefill_server
operator:
name: trtllm_container
ntasks: ${{ variables.CTX_TP_SIZE }}
ntasks_per_node: ${{ [ variables.CTX_TP_SIZE, variables.GPUS_PER_NODE ] | min }}
replicas:
count: ${{ variables.NUM_CTX_SERVERS }}
policy: parallel
script:
- cat ${{ artifacts.PREFILL_CONFIG.path }}
- >
trtllm-llmapi-launch trtllm-serve ${LOCAL_MODEL_PATH}
--host ${HEAD_NODE_IP}
--port $((8536 + ${SFLOW_REPLICA_INDEX}))
--extra_llm_api_options ${{ artifacts.PREFILL_CONFIG.path }}
resources:
gpus:
count: ${{ variables.CTX_TP_SIZE }}
probes:
readiness:
log_watch:
regex_pattern: "Application startup complete"
timeout: 600
failure:
log_watch:
regex_pattern: "Traceback (most recent call last)"
depends_on:
- load_image
- name: decode_server
operator:
name: trtllm_container
ntasks: ${{ variables.GEN_TP_SIZE }}
ntasks_per_node: ${{ [ variables.GEN_TP_SIZE, variables.GPUS_PER_NODE ] | min }}
replicas:
count: ${{ variables.NUM_GEN_SERVERS }}
policy: parallel
script:
- cat ${{ artifacts.DECODE_CONFIG.path }}
- >
trtllm-llmapi-launch trtllm-serve ${LOCAL_MODEL_PATH}
--host ${HEAD_NODE_IP}
--port $((8336 + ${SFLOW_REPLICA_INDEX}))
--extra_llm_api_options ${{ artifacts.DECODE_CONFIG.path }}
resources:
gpus:
count: ${{ variables.GEN_TP_SIZE }}
probes:
readiness:
log_watch:
regex_pattern: "Application startup complete"
timeout: 600
failure:
log_watch:
regex_pattern: "Traceback (most recent call last)"
depends_on:
- load_image
- name: benchmark
operator:
name: aiperf
ntasks: 1
replicas:
variables:
- CONCURRENCY # Sweeps over domain [128, 256]
policy: sequential
script:
- aiperf profile --concurrency ${CONCURRENCY} --url http://${HEAD_NODE_IP}:8000 ...
depends_on:
- prefill_server
- decode_server
- frontend_server
Run it:
sflow sample slurm_trtllm_serve_disagg
# Validate configuration
sflow run -f slurm_trtllm_serve_disagg.yaml \
--set SLURM_ACCOUNT=your_account \
--set SLURM_PARTITION=your_partition \
--dry-run
# Submit to Slurm
sflow batch -f slurm_trtllm_serve_disagg.yaml \
-A your_account -p your_partition -N 1 -G 4 \
--sbatch-path trtllm_disagg_job.sh --submit
InfMax Multi-Node Disaggregated Inference (DS-R1)
A production-ready multi-node disaggregated inference setup optimized for large models like DeepSeek-R1 using NVIDIA Dynamo and TensorRT-LLM.
Features:
- Multi-node deployment (default 3 nodes with 4 GPUs each)
- Disaggregated prefill/decode architecture with configurable parallelism
- NATS and etcd for service discovery
- GPU monitoring across all nodes
- MoE (Mixture of Experts) optimization parameters
- Sequential benchmark sweeps with variable domains
- File-type artifacts for dynamic server configuration
- Failure probes for error detection
version: "0.1"
variables:
# Slurm Configuration
SLURM_ACCOUNT:
description: "SLURM account"
value: your_account
SLURM_PARTITION:
description: "SLURM partition"
value: your_partition
SLURM_TIMELIMIT:
description: "SLURM time limit"
value: 120
GPUS_PER_NODE:
description: "GPUs per node"
value: 4
SLURM_NODES:
description: "Number of nodes"
value: 3
# Model Configuration
SERVED_MODEL_NAME:
description: "Served model name"
value: DS-R1
# Prefill Server Configuration
NUM_CTX_SERVERS:
description: "Number of context/prefill servers"
value: 1
CTX_TP_SIZE:
description: "Context tensor parallel size"
value: 4
CTX_BATCH_SIZE:
description: "Context batch size"
value: 1
CTX_MAX_NUM_TOKENS:
description: "Context max number of tokens"
value: 8448
# Decode Server Configuration
NUM_GEN_SERVERS:
description: "Number of generation/decode servers"
value: 1
GEN_TP_SIZE:
description: "Generation tensor parallel size"
value: 8
GEN_BATCH_SIZE:
description: "Generation batch size"
value: 128
# Benchmark Configuration with Domain Sweep
CONCURRENCY:
description: "Concurrency"
value: 64
domain: [32, 64] # Will create sequential benchmark runs
# Container Images
DYNAMO_IMAGE:
description: "Dynamo TRTLLM container image"
value: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.0
artifacts:
- name: LOCAL_MODEL_PATH
uri: fs:///path/to/your/model
- name: PREFILL_CONFIG
uri: file://prefill_config.yaml
content: |
max_batch_size: ${{ variables.CTX_BATCH_SIZE }}
tensor_parallel_size: ${{ variables.CTX_TP_SIZE }}
moe_expert_parallel_size: ${{ variables.CTX_TP_SIZE }}
# ... additional configuration
- name: DECODE_CONFIG
uri: file://decode_config.yaml
content: |
tensor_parallel_size: ${{ variables.GEN_TP_SIZE }}
max_batch_size: ${{ variables.GEN_BATCH_SIZE }}
# ... additional configuration
backends:
- name: slurm_cluster
type: slurm
default: true
time: ${{ variables.SLURM_TIMELIMIT }}
nodes: ${{ variables.SLURM_NODES }}
partition: ${{ variables.SLURM_PARTITION }}
account: ${{ variables.SLURM_ACCOUNT }}
gpus_per_node: ${{ variables.GPUS_PER_NODE }}
operators:
- name: dynamo_trtllm
type: srun
container_image: ${{ variables.DYNAMO_IMAGE }}
container_writable: true
container_mount_home: false
mpi: pmix
workflow:
name: infmax
timeout: 115m
variables:
HEAD_NODE_IP:
value: "${{ backends.slurm_cluster.nodes[0].ip_address }}"
ETCD_ENDPOINTS:
value: "${{ backends.slurm_cluster.nodes[0].ip_address }}:2379"
NATS_SERVER:
value: "nats://${{ backends.slurm_cluster.nodes[0].ip_address }}:4222"
tasks:
- name: load_image
operator:
name: dynamo_trtllm
ntasks: ${{ variables.SLURM_NODES }}
ntasks_per_node: 1
script:
- echo "Image Loaded"
probes:
readiness:
log_watch:
regex_pattern: "Image Loaded"
timeout: 1200
- name: gpu_monitor
operator:
name: dynamo_trtllm
ntasks_per_node: 1
resources:
nodes:
count: ${{ variables.SLURM_NODES }}
script:
- nvidia-smi monitoring...
depends_on:
- load_image
- name: nats_server
operator: dynamo_trtllm
script:
- nats-server -js
resources:
nodes:
indices: [0]
probes:
readiness:
tcp_port:
port: 4222
depends_on:
- load_image
- name: etcd_server
operator: dynamo_trtllm
script:
- etcd --listen-client-urls "http://0.0.0.0:2379" ...
resources:
nodes:
indices: [0]
probes:
readiness:
tcp_port:
port: 2379
depends_on:
- load_image
- name: frontend_server
operator: dynamo_trtllm
script:
- python3 -m dynamo.frontend --http-port 8000
resources:
nodes:
indices: [0]
probes:
readiness:
tcp_port:
port: 8000
depends_on:
- nats_server
- etcd_server
- name: prefill_server
operator:
name: dynamo_trtllm
ntasks: ${{ variables.CTX_TP_SIZE }}
ntasks_per_node: ${{ [ variables.CTX_TP_SIZE, variables.GPUS_PER_NODE ] | min }}
replicas:
count: ${{ variables.NUM_CTX_SERVERS }}
policy: parallel
script:
- trtllm-llmapi-launch python3 -m dynamo.trtllm --disaggregation-mode prefill ...
resources:
gpus:
count: ${{ variables.CTX_TP_SIZE }}
probes:
readiness:
log_watch:
regex_pattern: "Setting PyTorch memory fraction"
failure:
log_watch:
regex_pattern: "Traceback (most recent call last)"
depends_on:
- frontend_server
- name: decode_server
operator:
name: dynamo_trtllm
ntasks: ${{ variables.GEN_TP_SIZE }}
ntasks_per_node: ${{ [ variables.GEN_TP_SIZE, variables.GPUS_PER_NODE ] | min }}
replicas:
count: ${{ variables.NUM_GEN_SERVERS }}
policy: parallel
script:
- trtllm-llmapi-launch python3 -m dynamo.trtllm --disaggregation-mode decode ...
resources:
gpus:
count: ${{ variables.GEN_TP_SIZE }}
probes:
readiness:
log_watch:
regex_pattern: "Setting PyTorch memory fraction"
failure:
log_watch:
regex_pattern: "Traceback (most recent call last)"
depends_on:
- frontend_server
- name: benchmark
operator:
name: aiperf
ntasks: 1
replicas:
variables:
- CONCURRENCY # Sweeps over domain [32, 64]
policy: sequential
script:
- aiperf profile --concurrency ${CONCURRENCY} ...
depends_on:
- prefill_server
- decode_server
- frontend_server
Run it:
sflow sample slurm_infmax_v1_ds_r1
# Validate configuration
sflow run -f slurm_infmax_v1_ds_r1.yaml \
--set SLURM_ACCOUNT=your_account \
--set SLURM_PARTITION=your_partition \
--dry-run
# Submit to Slurm (multi-node)
sflow batch -f slurm_infmax_v1_ds_r1.yaml \
-A your_account -p your_partition -N 3 -G 4 \
--sbatch-path infmax_job.sh --submit
Key Concepts Demonstrated
| Sample | Concepts |
|---|---|
local_hello_world | Variables, basic task execution |
local_dag | Task dependencies, parallel execution, built-in env vars |
slurm_sglang_server_client | Slurm backend, operators, probes, replicas, GPU resources |
slurm_dynamo_trtllm_disagg | Service discovery (NATS/etcd), retry policies, multi-process tasks |
slurm_trtllm_serve_disagg | Artifacts with backend IP resolution, failure probes, variable sweeps |
slurm_infmax_v1_ds_r1 | Multi-node deployment, MoE optimization, GPU monitoring, file artifacts |
slurm_auto_replica | Auto replica detection, task context, node/GPU assignment |
slurm_aiperf_template | AIPerf benchmarking template, simple single-task workflow |
Modular Samples (Folder-based)
Modular samples are folders containing multiple composable YAML files. Instead of one monolithic config, the workflow is split into reusable building blocks.
inference_x_v2
A modular inference benchmark setup supporting multiple frameworks (SGLang, vLLM, TensorRT-LLM) with disaggregated prefill/decode servers.
Structure:
inference_x_v2/
├── slurm_config.yaml # Slurm backend configuration
├── common_workflow.yaml # Shared tasks (load_image, nats, etcd, frontend)
├── benchmark_aiperf.yaml # AIPerf benchmark task
├── benchmark_infmax.yaml # InfMax benchmark task
├── bulk_input.csv # CSV for bulk batch jobs (disagg + agg rows)
├── sglang/
│ ├── prefill.yaml # SGLang prefill server task (disaggregated)
│ ├── decode.yaml # SGLang decode server task (disaggregated)
│ └── agg.yaml # SGLang aggregated server task
├── vllm/
│ ├── prefill.yaml # vLLM prefill server task (disaggregated)
│ ├── decode.yaml # vLLM decode server task (disaggregated)
│ └── agg.yaml # vLLM aggregated server task
└── trtllm/
├── prefill.yaml # TRT-LLM prefill server task (disaggregated)
├── decode.yaml # TRT-LLM decode server task (disaggregated)
└── agg.yaml # TRT-LLM aggregated server task
The bulk_input.csv supports both disaggregated and aggregated workflows using the missable_tasks column:
- Disagg rows include
prefill.yaml + decode.yamland setmissable_tasks=agg_server - Agg rows include
agg.yamland setmissable_tasks=prefill_server decode_server
Copy the modular sample:
sflow sample inference_x_v2
Usage Option A: Bulk batch (CSV-driven)
Each row in bulk_input.csv defines a job with its own config files and variable overrides:
# Preview (no submission)
sflow batch --bulk-input inference_x_v2/bulk_input.csv \
-a LOCAL_MODEL_PATH=fs:///path/to/model -G 4 -A ACCOUNT -p PARTITION
# Submit all jobs
sflow batch --bulk-input inference_x_v2/bulk_input.csv \
-a LOCAL_MODEL_PATH=fs:///path/to/model -G 4 -A ACCOUNT -p PARTITION --submit
Usage Option B: Compose + Submit (step-by-step)
# Step 1: Compose modular files into a complete config
sflow compose inference_x_v2/slurm_config.yaml \
inference_x_v2/common_workflow.yaml \
inference_x_v2/trtllm/prefill.yaml \
inference_x_v2/trtllm/decode.yaml \
inference_x_v2/benchmark_aiperf.yaml \
-o composed.yaml
# Step 2: Validate, run, or submit
sflow run -f composed.yaml --dry-run # validate
sflow run -f composed.yaml --tui # run interactively
sflow batch -f composed.yaml -N 1 -G 4 -p PARTITION -A ACCOUNT \
-o run.sh --submit # submit to Slurm
Computed variables:
The modular samples use chained computed variables to simplify GPU/node calculations:
variables:
CTX_TP_SIZE:
type: integer
value: 2
CTX_DP_SIZE:
type: integer
value: 1
CTX_PP_SIZE:
type: integer
value: 1
CTX_GPUS_PER_WORKER:
type: integer
value: ${{ variables.CTX_TP_SIZE * variables.CTX_DP_SIZE * variables.CTX_PP_SIZE }}
CTX_NODES_PER_WORKER:
type: integer
value: ${{ [variables.CTX_GPUS_PER_WORKER // variables.GPUS_PER_NODE, 1] | max }}
Tips
- Always validate first: Use
--dry-runbefore actual execution - Override variables: Use
--set KEY=VALUEto customize configurations - Override model path: Use
--artifact LOCAL_MODEL_PATH=fs:///path/to/modelto point to your actual model - Use
--resolve: Add--resolvetosflow composeorsflow batch --bulk-inputto inline all variables into literal values for a fully-baked config - Check sample source: Samples are located in
src/sflow/samples/in the sflow package