Version: develop

Sample Workflows

This page contains sample workflow configurations that you can use as starting points for your own workflows. You can also access these samples using the sflow sample command.

📁 View original sample files: src/sflow/samples

Listing Available Samples

# List all available samples
sflow sample --list

# Copy a sample to your current directory
sflow sample local_hello_world

# Copy with custom output path
sflow sample local_dag --output my_workflow.yaml

Local Samples

These samples run locally without requiring a Slurm cluster.

Hello World

A minimal example that demonstrates basic sflow concepts.

version: "0.1"

variables:
  WHO:
    description: "who to greet"
    value: Nvidia

workflow:
  name: local_hello_world
  tasks:
    - name: hello
      script:
        - echo "Hello ${WHO}"

Run it:

sflow sample local_hello_world
sflow run -f local_hello_world.yaml

DAG Workflow

A multi-task workflow demonstrating task dependencies, data flow between tasks, and parallel execution.

version: "0.1"

variables:
  - name: MODEL_NAME
    type: string
    value: tiny-transformer

workflow:
  name: quickstart_dag
  tasks:
    - name: prepare_data
      script:
        - echo "prepare_data start"
        - echo "model(jinja)=${{ variables.MODEL_NAME }}" > ${SFLOW_WORKFLOW_OUTPUT_DIR}/dataset.txt
        - echo "model(shell)=${MODEL_NAME}" >> ${SFLOW_WORKFLOW_OUTPUT_DIR}/dataset.txt

    - name: preprocess
      depends_on: [prepare_data]
      script:
        - test -f ${SFLOW_WORKFLOW_OUTPUT_DIR}/dataset.txt
        - grep -q "model(jinja)=tiny-transformer" ${SFLOW_WORKFLOW_OUTPUT_DIR}/dataset.txt
        - grep -q "model(shell)=tiny-transformer" ${SFLOW_WORKFLOW_OUTPUT_DIR}/dataset.txt
        - echo "encoded_data ok" > ${SFLOW_WORKFLOW_OUTPUT_DIR}/encoded.txt

    - name: train
      depends_on: [preprocess]
      script:
        - test -f ${SFLOW_WORKFLOW_OUTPUT_DIR}/encoded.txt
        - echo "checkpoint for ${MODEL_NAME}" > ${SFLOW_WORKFLOW_OUTPUT_DIR}/checkpoint.pt

    - name: evaluate_on_dataset1
      depends_on: [train]
      script:
        - test -f ${SFLOW_WORKFLOW_OUTPUT_DIR}/checkpoint.pt
        - echo "accuracy=0.99 dataset=dataset1" > ${SFLOW_TASK_OUTPUT_DIR}/metrics.txt

    - name: evaluate_on_dataset2
      depends_on: [train]
      script:
        - test -f ${SFLOW_WORKFLOW_OUTPUT_DIR}/checkpoint.pt
        - echo "accuracy=0.88 dataset=dataset2" > ${SFLOW_TASK_OUTPUT_DIR}/metrics.txt

    - name: export_model
      depends_on: [evaluate_on_dataset1, evaluate_on_dataset2]
      script:
        - test -f ${SFLOW_WORKFLOW_OUTPUT_DIR}/evaluate_on_dataset1/metrics.txt
        - test -f ${SFLOW_WORKFLOW_OUTPUT_DIR}/evaluate_on_dataset2/metrics.txt
        - echo "exported ${MODEL_NAME}" > ${SFLOW_WORKFLOW_OUTPUT_DIR}/model.onnx

Run it:

sflow sample local_dag
sflow run -f local_dag.yaml --dry-run  # Validate
sflow run -f local_dag.yaml            # Execute

Slurm Samples

These samples require a Slurm cluster with GPU resources.

SGLang Server + Benchmark (Single Node)

Deploys an SGLang inference server with AIPerf benchmarking on Slurm.

Features:

SGLang server with FP8 inference
GPU monitoring
AIPerf benchmarking client
Readiness probes for service orchestration

version: "0.1"

variables:
  # Slurm Configuration
  SLURM_ACCOUNT:
    description: "SLURM account"
    value: your_account
  SLURM_PARTITION:
    description: "SLURM partition"
    value: your_partition
  SLURM_TIMELIMIT:
    description: "SLURM time limit"
    value: 60
  GPUS_PER_NODE:
    description: "GPUs per node"
    value: 4
  SLURM_NODES:
    description: "Number of nodes"
    value: 1
  
  # Model Configuration
  HF_MODEL_NAME:
    description: "HF model name"
    value: Qwen/Qwen3-0.6B-FP8
  SERVED_MODEL_NAME:
    description: "Served model name"
    value: Qwen3-0-6B-FP8
  LOCAL_MODEL_PATH:
    description: "Local model path"
    value: /tmp/models/Qwen3-0.6B-FP8
  
  # SGLang Server Configuration
  NUM_SERVERS:
    description: "Number of servers"
    value: 1
  TP_SIZE:
    description: "Tensor parallel size"
    value: 4
  MAX_RUNNING_REQUESTS:
    description: "Max running requests"
    value: 32
  
  # Benchmark Configuration
  ISL:
    description: "Input sequence length"
    value: 1024
  OSL:
    description: "Output sequence length"
    value: 1024
  MULTI_ROUND:
    description: "Number of benchmark rounds"
    value: 8
  CONCURRENCY:
    description: "Concurrency"
    value: 32
  
  # Container Images
  SGLANG_IMAGE:
    description: "SGLang image"
    value: "lmsysorg/sglang:v0.5.7-cu130-runtime"
  AIPERF_IMAGE:
    description: "AIPerf container image"
    value: python:3.12-slim

backends:
  - name: slurm_cluster
    type: slurm
    default: true
    time: ${{ variables.SLURM_TIMELIMIT }}
    nodes: ${{ variables.SLURM_NODES }}
    partition: ${{ variables.SLURM_PARTITION }}
    account: ${{ variables.SLURM_ACCOUNT }}
    gpus_per_node: ${{ variables.GPUS_PER_NODE }}

operators:
  - name: sglang_runtime
    type: srun
    container_name: sglang_runtime
    container_writable: true
    container_mount_home: false
    ntasks_per_node: 1
    mpi: pmix
    extra_args:
      - --container-image=${{ variables.SGLANG_IMAGE }}
  - name: aiperf
    type: srun
    container_name: aiperf
    container_writable: true
    mpi: pmix
    extra_args:
      - --container-image=${{ variables.AIPERF_IMAGE }}

workflow:
  name: sglang_qwen3_0_6b
  timeout: 60m
  variables:
    HEAD_NODE_IP:
      description: "Head node IP"
      value: "${{ backends.slurm_cluster.nodes[0].ip_address }}"
  tasks:
    - name: load_image
      operator: 
        name: sglang_runtime
        ntasks_per_node: 1
      script:
        - echo "Image Loaded"
        - sleep 3600
      probes:
        readiness:
          log_watch:
            regex_pattern: "Image Loaded"
          timeout: 1200
          interval: 2

    - name: install_aiperf
      operator: 
        name: aiperf
        ntasks_per_node: 1
      script:
        - pip install aiperf==0.3.0
        - hf download ${{ variables.HF_MODEL_NAME }} --local-dir ${{ variables.LOCAL_MODEL_PATH }}
        - echo "AIPerf installed"
        - sleep 3600
      probes:
        readiness:
          log_watch:
            regex_pattern: "AIPerf installed"
          timeout: 1200
          interval: 2

    - name: gpu_monitor
      operator: sglang_runtime
      script:
        - echo "Starting gpu monitor"
        - >
          nvidia-smi --query-gpu=index,utilization.gpu,utilization.memory,temperature.gpu,temperature.memory,power.draw,clocks.sm,clocks.mem,memory.total,memory.used 
          --format=csv,noheader,nounits -lms 2000 | 
          while IFS= read -r input || [ -n "$input" ] ; 
          do timestamp=$(date +%s%3N); 
          printf "%s.%s,%s\n" "${timestamp:0:10}" "${timestamp:10:3}" "${input}"; 
          done 
          >> ${SFLOW_TASK_OUTPUT_DIR}/gpu_monitor_node_${SLURM_NODEID}_${SLURMD_NODENAME}.log
      probes:
        readiness:
          log_watch:
            regex_pattern: "Starting gpu monitor"
      resources:
        nodes:
          indices: [0]
      depends_on:
        - load_image
        - install_aiperf
    
    - name: sglang_server
      operator: sglang_runtime
      replicas:
        count: ${{ variables.NUM_SERVERS }}
        policy: parallel
      resources:
        gpus:
          count: ${{ variables.TP_SIZE }}
        nodes:
          indices: [0]
      script:
        - set -x
        - export SGLANG_DISABLE_WATCHDOG=1
        - >
          python -m sglang_router.launch_server --model ${{ variables.HF_MODEL_NAME }} 
          --host 0.0.0.0
          --port 8000
          --fp8-gemm-backend flashinfer_trtllm
          --moe-runner-backend flashinfer_trtllm
          --served-model-name ${{ variables.SERVED_MODEL_NAME }}
          --tensor-parallel-size ${{ variables.TP_SIZE }}
          --trust-remote-code
          --max-running-requests ${{ variables.MAX_RUNNING_REQUESTS }}
      probes:
        readiness:
          log_watch:
            regex_pattern: "Workflow completed"
      depends_on:
        - load_image

    - name: benchmark
      operator:
        name: aiperf
        ntasks: 1
      script:
        - set -x
        - >
          aiperf profile --artifact-dir ${SFLOW_WORKFLOW_OUTPUT_DIR}/aiperf_concurrency_${CONCURRENCY}
          --model ${{ variables.SERVED_MODEL_NAME }}
          --tokenizer ${{ variables.LOCAL_MODEL_PATH }}
          --endpoint-type chat
          --endpoint /v1/chat/completions
          --streaming
          --url http://${{ variables.HEAD_NODE_IP }}:8000
          --synthetic-input-tokens-mean ${{ variables.ISL }}
          --synthetic-input-tokens-stddev 0
          --output-tokens-mean ${{ variables.OSL }}
          --output-tokens-stddev 0
          --extra-inputs "max_tokens:${{ variables.OSL }}"
          --extra-inputs "min_tokens:${{ variables.OSL }}"
          --extra-inputs "ignore_eos:true"
          --concurrency ${CONCURRENCY}
          --request-count $((${{ variables.MULTI_ROUND }}*${CONCURRENCY}))
          --warmup-request-count ${CONCURRENCY}
          --num-dataset-entries $((${{ variables.MULTI_ROUND }}*${CONCURRENCY}))
          --random-seed 100
          --ui simple
        - echo "Benchmarking finished"
      resources:
        nodes:
          indices: [0]
      depends_on:
        - sglang_server
        - install_aiperf

Run it:

sflow sample slurm_sglang_server_client

# Validate configuration
sflow run -f slurm_sglang_server_client.yaml \
  --set SLURM_ACCOUNT=your_account \
  --set SLURM_PARTITION=your_partition \
  --dry-run

# Submit to Slurm
sflow batch -f slurm_sglang_server_client.yaml \
  -A your_account -p your_partition -N 1 -G 4 \
  --sbatch-path sglang_job.sh --submit

Dynamo TRT-LLM Disaggregated Inference (Single Node)

Deploys a disaggregated inference setup with separate prefill and decode servers using NVIDIA Dynamo and TensorRT-LLM.

Features:

Disaggregated prefill/decode architecture
NATS and etcd for service discovery
Configurable tensor parallelism
Sequential benchmark sweeps with variable domains
Retry policies for server reliability
File-type artifacts for dynamic configuration

version: "0.1"

variables:
  # Slurm Configuration
  SLURM_ACCOUNT:
    description: "SLURM account"
    value: your_account
  SLURM_PARTITION:
    description: "SLURM partition"
    value: your_partition
  SLURM_TIMELIMIT:
    description: "SLURM time limit"
    value: 120
  GPUS_PER_NODE:
    description: "GPUs per node"
    value: 4
  SLURM_NODES:
    description: "Number of nodes"
    value: 1

  # Model Configuration
  SERVED_MODEL_NAME:
    description: "Served model name"
    value: Qwen3-0-6B-FP8
  MODEL_NAME:
    description: "Model path"
    value: Qwen/Qwen3-0.6B-FP8
  LOCAL_MODEL_PATH:
    description: "Local model path"
    value: /tmp/models/Qwen3-0.6B-FP8

  # Prefill Server Configuration
  NUM_CTX_SERVERS:
    description: "Number of context/prefill servers"
    value: 1
  CTX_TP_SIZE:
    description: "Context tensor parallel size"
    value: 2

  # Decode Server Configuration
  NUM_GEN_SERVERS:
    description: "Number of generation/decode servers"
    value: 1
  GEN_TP_SIZE:
    description: "Generation tensor parallel size"
    value: 2

  # Benchmark Configuration with Domain Sweep
  CONCURRENCY:
    description: "Concurrency"
    value: 64
    domain: [64, 128]  # Will create sequential benchmark runs
  
  # Container Images
  DYNAMO_IMAGE:
    description: "Dynamo TRTLLM container image"
    value: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.0

artifacts:
  # File-type artifacts are generated by sflow with dynamic content
  - name: PREFILL_CONFIG
    uri: file://prefill_config.yaml
    content: |
      max_batch_size: 128
      tensor_parallel_size: ${{ variables.CTX_TP_SIZE }}
      # ... additional configuration

  - name: DECODE_CONFIG
    uri: file://decode_config.yaml
    content: |
      tensor_parallel_size: ${{ variables.GEN_TP_SIZE }}
      # ... additional configuration

backends:
  - name: slurm_cluster
    type: slurm
    default: true
    time: ${{ variables.SLURM_TIMELIMIT }}
    nodes: ${{ variables.SLURM_NODES }}
    partition: ${{ variables.SLURM_PARTITION }}
    account: ${{ variables.SLURM_ACCOUNT }}
    gpus_per_node: ${{ variables.GPUS_PER_NODE }}

operators:
  - name: dynamo_trtllm
    type: srun
    container_name: dynamo_trtllm
    container_writable: true
    container_mount_home: false
    mpi: pmix
    extra_args:
      - --container-image=${{ variables.DYNAMO_IMAGE }}

workflow:
  name: dynamo
  timeout: 115m
  variables:
    HEAD_NODE_IP:
      value: "${{ backends.slurm_cluster.nodes[0].ip_address }}"
    ETCD_ENDPOINTS:
      value: "${{ backends.slurm_cluster.nodes[0].ip_address }}:2379"
    NATS_SERVER:
      value: "nats://${{ backends.slurm_cluster.nodes[0].ip_address }}:4222"

  tasks:
    - name: nats_server
      operator: dynamo_trtllm
      script:
        - nats-server -js
      probes:
        readiness:
          tcp_port:
            port: 4222
          timeout: 60

    - name: etcd_server
      operator: dynamo_trtllm
      script:
        - etcd --listen-client-urls "http://0.0.0.0:2379" ...
      probes:
        readiness:
          tcp_port:
            port: 2379
          timeout: 60

    - name: frontend_server
      operator: dynamo_trtllm
      script:
        - python3 -m dynamo.frontend --http-port 8000
      probes:
        readiness:
          tcp_port:
            port: 8000
            timeout: 120
      depends_on:
        - nats_server
        - etcd_server

    - name: prefill_server
      operator:
        name: dynamo_trtllm
        ntasks: ${{ variables.CTX_TP_SIZE }}
      replicas:
        count: ${{ variables.NUM_CTX_SERVERS }}
        policy: parallel
      script:
        - trtllm-llmapi-launch python3 -m dynamo.trtllm --disaggregation-mode prefill ...
      resources:
        gpus:
          count: ${{ variables.CTX_TP_SIZE }}
      probes:
        readiness:
          log_watch:
            regex_pattern: "Setting PyTorch memory fraction"
          timeout: 600
        failure:
          log_watch:
            regex_pattern: "Traceback (most recent call last)"
      retries:
        count: 3
        interval: 30
        backoff: 2
      depends_on:
        - frontend_server

    - name: decode_server
      operator:
        name: dynamo_trtllm
        ntasks: ${{ variables.GEN_TP_SIZE }}
      replicas:
        count: ${{ variables.NUM_GEN_SERVERS }}
        policy: parallel
      script:
        - trtllm-llmapi-launch python3 -m dynamo.trtllm --disaggregation-mode decode ...
      resources:
        gpus:
          count: ${{ variables.GEN_TP_SIZE }}
      retries:
        count: 3
        interval: 30
        backoff: 2
      depends_on:
        - frontend_server

    - name: benchmark
      operator:
        name: aiperf
        ntasks: 1
      replicas:
        variables:
          - CONCURRENCY  # Sweeps over domain [64, 128]
        policy: sequential
      script:
        - aiperf profile --concurrency ${CONCURRENCY} ...
      depends_on:
        - prefill_server
        - decode_server
        - frontend_server

Run it:

sflow sample slurm_dynamo_trtllm_disagg

# Validate configuration
sflow run -f slurm_dynamo_trtllm_disagg.yaml \
  --set SLURM_ACCOUNT=your_account \
  --set SLURM_PARTITION=your_partition \
  --dry-run

# Submit to Slurm
sflow batch -f slurm_dynamo_trtllm_disagg.yaml \
  -A your_account -p your_partition -N 1 -G 4 \
  --sbatch-path dynamo_job.sh --submit

TRT-LLM Serve Disaggregated Inference (Single Node)

Deploys a disaggregated inference setup with separate prefill and decode servers using TensorRT-LLM's native trtllm-serve disaggregated command.

Features:

Disaggregated prefill/decode architecture with trtllm-serve
Dynamic configuration using file-type artifacts with backend node IP resolution
Configurable tensor parallelism for prefill and decode servers
GPU monitoring task
Sequential benchmark sweeps with variable domains
Failure probes for error detection

version: "0.1"

variables:
  # Slurm Configuration
  SLURM_ACCOUNT:
    description: "SLURM account"
    value: your_account
  SLURM_PARTITION:
    description: "SLURM partition"
    value: your_partition
  SLURM_TIMELIMIT:
    description: "SLURM time limit"
    value: 120
  GPUS_PER_NODE:
    description: "GPUs per node"
    value: 4
  SLURM_NODES:
    description: "Number of nodes"
    value: 1

  # Model Configuration
  SERVED_MODEL_NAME:
    description: "Served model name"
    value: Qwen3-0-6B-FP8
  MODEL_NAME:
    description: "Model path"
    value: Qwen/Qwen3-0.6B-FP8
  LOCAL_MODEL_PATH:
    description: "Local model path"
    value: /tmp/models/Qwen3-0.6B-FP8

  # Prefill Server Configuration
  NUM_CTX_SERVERS:
    description: "Number of context/prefill servers"
    value: 1
  CTX_TP_SIZE:
    description: "Context tensor parallel size"
    value: 2

  # Decode Server Configuration
  NUM_GEN_SERVERS:
    description: "Number of generation/decode servers"
    value: 1
  GEN_TP_SIZE:
    description: "Generation tensor parallel size"
    value: 2

  # Benchmark Configuration with Domain Sweep
  CONCURRENCY:
    description: "Concurrency"
    value: 128
    domain: [128, 256]  # Will create sequential benchmark runs

  # Container Images
  TRTLLM_IMAGE:
    description: "TRT-LLM container image"
    value: nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6.post2
  AIPERF_IMAGE:
    description: "AIPerf container image"
    value: python:3.12-slim

artifacts:
  # File-type artifacts with dynamic backend node IP resolution
  - name: SERVER_CONFIG
    uri: file://server_config.yaml
    content: |
      hostname: ${{ backends.slurm_cluster.nodes[0].ip_address }}
      port: 8000
      backend: pytorch
      context_servers:
        num_instances: ${{ variables.NUM_CTX_SERVERS }}
        urls:
          - ${{ backends.slurm_cluster.nodes[0].ip_address }}:8536
      generation_servers:
        num_instances: ${{ variables.NUM_GEN_SERVERS }}
        urls:
          - ${{ backends.slurm_cluster.nodes[0].ip_address }}:8336

  - name: PREFILL_CONFIG
    uri: file://prefill_config.yaml
    content: |
      max_batch_size: 128
      tensor_parallel_size: ${{ variables.CTX_TP_SIZE }}
      # ... additional configuration

  - name: DECODE_CONFIG
    uri: file://decode_config.yaml
    content: |
      tensor_parallel_size: ${{ variables.GEN_TP_SIZE }}
      # ... additional configuration

backends:
  - name: slurm_cluster
    type: slurm
    default: true
    time: ${{ variables.SLURM_TIMELIMIT }}
    nodes: ${{ variables.SLURM_NODES }}
    partition: ${{ variables.SLURM_PARTITION }}
    account: ${{ variables.SLURM_ACCOUNT }}
    gpus_per_node: ${{ variables.GPUS_PER_NODE }}

operators:
  - name: trtllm_container
    type: srun
    container_name: trtllm_container
    container_writable: true
    container_mount_home: false
    mpi: pmix
    extra_args:
      - --container-image=${{ variables.TRTLLM_IMAGE }}
  - name: aiperf
    type: srun
    container_name: aiperf
    container_writable: true
    mpi: pmix
    extra_args:
      - --container-image=${{ variables.AIPERF_IMAGE }}

workflow:
  name: trtllm_server_disagg
  timeout: 115m
  variables:
    HEAD_NODE_IP:
      description: "Head node IP (resolved after allocation)"
      value: "${{ backends.slurm_cluster.nodes[0].ip_address }}"

  tasks:
    - name: load_image
      operator:
        name: trtllm_container
        ntasks_per_node: 1
      script:
        - hf download ${{ variables.MODEL_NAME }} --local-dir ${{ variables.LOCAL_MODEL_PATH }}
        - echo "Image Loaded"
        - sleep 3600
      probes:
        readiness:
          log_watch:
            regex_pattern: "Image Loaded"
          timeout: 1200

    - name: frontend_server
      operator: trtllm_container
      script:
        - cat ${{ artifacts.SERVER_CONFIG.path }}
        - trtllm-serve disaggregated -c ${{ artifacts.SERVER_CONFIG.path }} -t 7200 -r 7200
      resources:
        nodes:
          indices: [0]
      probes:
        readiness:
          log_watch:
            regex_pattern: "Application startup complete"
          timeout: 120
      depends_on:
        - prefill_server
        - decode_server

    - name: prefill_server
      operator:
        name: trtllm_container
        ntasks: ${{ variables.CTX_TP_SIZE }}
        ntasks_per_node: ${{ [ variables.CTX_TP_SIZE, variables.GPUS_PER_NODE ] | min }}
      replicas:
        count: ${{ variables.NUM_CTX_SERVERS }}
        policy: parallel
      script:
        - cat ${{ artifacts.PREFILL_CONFIG.path }}
        - >
          trtllm-llmapi-launch trtllm-serve ${LOCAL_MODEL_PATH}
          --host ${HEAD_NODE_IP}
          --port $((8536 + ${SFLOW_REPLICA_INDEX}))
          --extra_llm_api_options ${{ artifacts.PREFILL_CONFIG.path }}
      resources:
        gpus:
          count: ${{ variables.CTX_TP_SIZE }}
      probes:
        readiness:
          log_watch:
            regex_pattern: "Application startup complete"
          timeout: 600
        failure:
          log_watch:
            regex_pattern: "Traceback (most recent call last)"
      depends_on:
        - load_image

    - name: decode_server
      operator:
        name: trtllm_container
        ntasks: ${{ variables.GEN_TP_SIZE }}
        ntasks_per_node: ${{ [ variables.GEN_TP_SIZE, variables.GPUS_PER_NODE ] | min }}
      replicas:
        count: ${{ variables.NUM_GEN_SERVERS }}
        policy: parallel
      script:
        - cat ${{ artifacts.DECODE_CONFIG.path }}
        - >
          trtllm-llmapi-launch trtllm-serve ${LOCAL_MODEL_PATH}
          --host ${HEAD_NODE_IP}
          --port $((8336 + ${SFLOW_REPLICA_INDEX}))
          --extra_llm_api_options ${{ artifacts.DECODE_CONFIG.path }}
      resources:
        gpus:
          count: ${{ variables.GEN_TP_SIZE }}
      probes:
        readiness:
          log_watch:
            regex_pattern: "Application startup complete"
          timeout: 600
        failure:
          log_watch:
            regex_pattern: "Traceback (most recent call last)"
      depends_on:
        - load_image

    - name: benchmark
      operator:
        name: aiperf
        ntasks: 1
      replicas:
        variables:
          - CONCURRENCY  # Sweeps over domain [128, 256]
        policy: sequential
      script:
        - aiperf profile --concurrency ${CONCURRENCY} --url http://${HEAD_NODE_IP}:8000 ...
      depends_on:
        - prefill_server
        - decode_server
        - frontend_server

Run it:

sflow sample slurm_trtllm_serve_disagg

# Validate configuration
sflow run -f slurm_trtllm_serve_disagg.yaml \
  --set SLURM_ACCOUNT=your_account \
  --set SLURM_PARTITION=your_partition \
  --dry-run

# Submit to Slurm
sflow batch -f slurm_trtllm_serve_disagg.yaml \
  -A your_account -p your_partition -N 1 -G 4 \
  --sbatch-path trtllm_disagg_job.sh --submit

InfMax Multi-Node Disaggregated Inference (DS-R1)

A production-ready multi-node disaggregated inference setup optimized for large models like DeepSeek-R1 using NVIDIA Dynamo and TensorRT-LLM.

Features:

Multi-node deployment (default 3 nodes with 4 GPUs each)
Disaggregated prefill/decode architecture with configurable parallelism
NATS and etcd for service discovery
GPU monitoring across all nodes
MoE (Mixture of Experts) optimization parameters
Sequential benchmark sweeps with variable domains
File-type artifacts for dynamic server configuration
Failure probes for error detection

version: "0.1"

variables:
  # Slurm Configuration
  SLURM_ACCOUNT:
    description: "SLURM account"
    value: your_account
  SLURM_PARTITION:
    description: "SLURM partition"
    value: your_partition
  SLURM_TIMELIMIT:
    description: "SLURM time limit"
    value: 120
  GPUS_PER_NODE:
    description: "GPUs per node"
    value: 4
  SLURM_NODES:
    description: "Number of nodes"
    value: 3

  # Model Configuration
  SERVED_MODEL_NAME:
    description: "Served model name"
    value: DS-R1

  # Prefill Server Configuration
  NUM_CTX_SERVERS:
    description: "Number of context/prefill servers"
    value: 1
  CTX_TP_SIZE:
    description: "Context tensor parallel size"
    value: 4
  CTX_BATCH_SIZE:
    description: "Context batch size"
    value: 1
  CTX_MAX_NUM_TOKENS:
    description: "Context max number of tokens"
    value: 8448

  # Decode Server Configuration
  NUM_GEN_SERVERS:
    description: "Number of generation/decode servers"
    value: 1
  GEN_TP_SIZE:
    description: "Generation tensor parallel size"
    value: 8
  GEN_BATCH_SIZE:
    description: "Generation batch size"
    value: 128

  # Benchmark Configuration with Domain Sweep
  CONCURRENCY:
    description: "Concurrency"
    value: 64
    domain: [32, 64]  # Will create sequential benchmark runs

  # Container Images
  DYNAMO_IMAGE:
    description: "Dynamo TRTLLM container image"
    value: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.0

artifacts:
  - name: LOCAL_MODEL_PATH
    uri: fs:///path/to/your/model
  - name: PREFILL_CONFIG
    uri: file://prefill_config.yaml
    content: |
      max_batch_size: ${{ variables.CTX_BATCH_SIZE }}
      tensor_parallel_size: ${{ variables.CTX_TP_SIZE }}
      moe_expert_parallel_size: ${{ variables.CTX_TP_SIZE }}
      # ... additional configuration
  - name: DECODE_CONFIG
    uri: file://decode_config.yaml
    content: |
      tensor_parallel_size: ${{ variables.GEN_TP_SIZE }}
      max_batch_size: ${{ variables.GEN_BATCH_SIZE }}
      # ... additional configuration

backends:
  - name: slurm_cluster
    type: slurm
    default: true
    time: ${{ variables.SLURM_TIMELIMIT }}
    nodes: ${{ variables.SLURM_NODES }}
    partition: ${{ variables.SLURM_PARTITION }}
    account: ${{ variables.SLURM_ACCOUNT }}
    gpus_per_node: ${{ variables.GPUS_PER_NODE }}

operators:
  - name: dynamo_trtllm
    type: srun
    container_image: ${{ variables.DYNAMO_IMAGE }}
    container_writable: true
    container_mount_home: false
    mpi: pmix

workflow:
  name: infmax
  timeout: 115m
  variables:
    HEAD_NODE_IP:
      value: "${{ backends.slurm_cluster.nodes[0].ip_address }}"
    ETCD_ENDPOINTS:
      value: "${{ backends.slurm_cluster.nodes[0].ip_address }}:2379"
    NATS_SERVER:
      value: "nats://${{ backends.slurm_cluster.nodes[0].ip_address }}:4222"

  tasks:
    - name: load_image
      operator:
        name: dynamo_trtllm
        ntasks: ${{ variables.SLURM_NODES }}
        ntasks_per_node: 1
      script:
        - echo "Image Loaded"
      probes:
        readiness:
          log_watch:
            regex_pattern: "Image Loaded"
          timeout: 1200

    - name: gpu_monitor
      operator:
        name: dynamo_trtllm
        ntasks_per_node: 1
      resources:
        nodes:
          count: ${{ variables.SLURM_NODES }}
      script:
        - nvidia-smi monitoring...
      depends_on:
        - load_image

    - name: nats_server
      operator: dynamo_trtllm
      script:
        - nats-server -js
      resources:
        nodes:
          indices: [0]
      probes:
        readiness:
          tcp_port:
            port: 4222
      depends_on:
        - load_image

    - name: etcd_server
      operator: dynamo_trtllm
      script:
        - etcd --listen-client-urls "http://0.0.0.0:2379" ...
      resources:
        nodes:
          indices: [0]
      probes:
        readiness:
          tcp_port:
            port: 2379
      depends_on:
        - load_image

    - name: frontend_server
      operator: dynamo_trtllm
      script:
        - python3 -m dynamo.frontend --http-port 8000
      resources:
        nodes:
          indices: [0]
      probes:
        readiness:
          tcp_port:
            port: 8000
      depends_on:
        - nats_server
        - etcd_server

    - name: prefill_server
      operator:
        name: dynamo_trtllm
        ntasks: ${{ variables.CTX_TP_SIZE }}
        ntasks_per_node: ${{ [ variables.CTX_TP_SIZE, variables.GPUS_PER_NODE ] | min }}
      replicas:
        count: ${{ variables.NUM_CTX_SERVERS }}
        policy: parallel
      script:
        - trtllm-llmapi-launch python3 -m dynamo.trtllm --disaggregation-mode prefill ...
      resources:
        gpus:
          count: ${{ variables.CTX_TP_SIZE }}
      probes:
        readiness:
          log_watch:
            regex_pattern: "Setting PyTorch memory fraction"
        failure:
          log_watch:
            regex_pattern: "Traceback (most recent call last)"
      depends_on:
        - frontend_server

    - name: decode_server
      operator:
        name: dynamo_trtllm
        ntasks: ${{ variables.GEN_TP_SIZE }}
        ntasks_per_node: ${{ [ variables.GEN_TP_SIZE, variables.GPUS_PER_NODE ] | min }}
      replicas:
        count: ${{ variables.NUM_GEN_SERVERS }}
        policy: parallel
      script:
        - trtllm-llmapi-launch python3 -m dynamo.trtllm --disaggregation-mode decode ...
      resources:
        gpus:
          count: ${{ variables.GEN_TP_SIZE }}
      probes:
        readiness:
          log_watch:
            regex_pattern: "Setting PyTorch memory fraction"
        failure:
          log_watch:
            regex_pattern: "Traceback (most recent call last)"
      depends_on:
        - frontend_server

    - name: benchmark
      operator:
        name: aiperf
        ntasks: 1
      replicas:
        variables:
          - CONCURRENCY  # Sweeps over domain [32, 64]
        policy: sequential
      script:
        - aiperf profile --concurrency ${CONCURRENCY} ...
      depends_on:
        - prefill_server
        - decode_server
        - frontend_server

Run it:

sflow sample slurm_infmax_v1_ds_r1

# Validate configuration
sflow run -f slurm_infmax_v1_ds_r1.yaml \
  --set SLURM_ACCOUNT=your_account \
  --set SLURM_PARTITION=your_partition \
  --dry-run

# Submit to Slurm (multi-node)
sflow batch -f slurm_infmax_v1_ds_r1.yaml \
  -A your_account -p your_partition -N 3 -G 4 \
  --sbatch-path infmax_job.sh --submit

Key Concepts Demonstrated

Sample	Concepts
`local_hello_world`	Variables, basic task execution
`local_dag`	Task dependencies, parallel execution, built-in env vars
`slurm_sglang_server_client`	Slurm backend, operators, probes, replicas, GPU resources
`slurm_dynamo_trtllm_disagg`	Service discovery (NATS/etcd), retry policies, multi-process tasks
`slurm_trtllm_serve_disagg`	Artifacts with backend IP resolution, failure probes, variable sweeps
`slurm_infmax_v1_ds_r1`	Multi-node deployment, MoE optimization, GPU monitoring, file artifacts
`slurm_auto_replica`	Auto replica detection, task context, node/GPU assignment
`slurm_aiperf_template`	AIPerf benchmarking template, simple single-task workflow

Modular Samples (Folder-based)

Modular samples are folders containing multiple composable YAML files. Instead of one monolithic config, the workflow is split into reusable building blocks.

inference_x_v2

A modular inference benchmark setup supporting multiple frameworks (SGLang, vLLM, TensorRT-LLM) with disaggregated prefill/decode servers.

Structure:

inference_x_v2/
├── slurm_config.yaml          # Slurm backend configuration
├── common_workflow.yaml       # Shared tasks (load_image, nats, etcd, frontend)
├── benchmark_aiperf.yaml      # AIPerf benchmark task
├── benchmark_infmax.yaml      # InfMax benchmark task
├── bulk_input.csv             # CSV for bulk batch jobs (disagg + agg rows)
├── sglang/
│   ├── prefill.yaml           # SGLang prefill server task (disaggregated)
│   ├── decode.yaml            # SGLang decode server task (disaggregated)
│   └── agg.yaml               # SGLang aggregated server task
├── vllm/
│   ├── prefill.yaml           # vLLM prefill server task (disaggregated)
│   ├── decode.yaml            # vLLM decode server task (disaggregated)
│   └── agg.yaml               # vLLM aggregated server task
└── trtllm/
    ├── prefill.yaml           # TRT-LLM prefill server task (disaggregated)
    ├── decode.yaml            # TRT-LLM decode server task (disaggregated)
    └── agg.yaml               # TRT-LLM aggregated server task

The bulk_input.csv supports both disaggregated and aggregated workflows using the missable_tasks column:

Disagg rows include prefill.yaml + decode.yaml and set missable_tasks=agg_server
Agg rows include agg.yaml and set missable_tasks=prefill_server decode_server

Copy the modular sample:

sflow sample inference_x_v2

Usage Option A: Bulk batch (CSV-driven)

Each row in bulk_input.csv defines a job with its own config files and variable overrides:

# Preview (no submission)
sflow batch --bulk-input inference_x_v2/bulk_input.csv \
  -a LOCAL_MODEL_PATH=fs:///path/to/model -G 4 -A ACCOUNT -p PARTITION

# Submit all jobs
sflow batch --bulk-input inference_x_v2/bulk_input.csv \
  -a LOCAL_MODEL_PATH=fs:///path/to/model -G 4 -A ACCOUNT -p PARTITION --submit

Usage Option B: Compose + Submit (step-by-step)

# Step 1: Compose modular files into a complete config
sflow compose inference_x_v2/slurm_config.yaml \
              inference_x_v2/common_workflow.yaml \
              inference_x_v2/trtllm/prefill.yaml \
              inference_x_v2/trtllm/decode.yaml \
              inference_x_v2/benchmark_aiperf.yaml \
              -o composed.yaml

# Step 2: Validate, run, or submit
sflow run -f composed.yaml --dry-run                          # validate
sflow run -f composed.yaml --tui                               # run interactively
sflow batch -f composed.yaml -N 1 -G 4 -p PARTITION -A ACCOUNT \
            -o run.sh --submit                                 # submit to Slurm

Computed variables:

The modular samples use chained computed variables to simplify GPU/node calculations:

variables:
  CTX_TP_SIZE:
    type: integer
    value: 2
  CTX_DP_SIZE:
    type: integer
    value: 1
  CTX_PP_SIZE:
    type: integer
    value: 1
  CTX_GPUS_PER_WORKER:
    type: integer
    value: ${{ variables.CTX_TP_SIZE * variables.CTX_DP_SIZE * variables.CTX_PP_SIZE }}
  CTX_NODES_PER_WORKER:
    type: integer
    value: ${{ [variables.CTX_GPUS_PER_WORKER // variables.GPUS_PER_NODE, 1] | max }}

Tips

Always validate first: Use --dry-run before actual execution
Override variables: Use --set KEY=VALUE to customize configurations
Override model path: Use --artifact LOCAL_MODEL_PATH=fs:///path/to/model to point to your actual model
Use --resolve: Add --resolve to sflow compose or sflow batch --bulk-input to inline all variables into literal values for a fully-baked config
Check sample source: Samples are located in src/sflow/samples/ in the sflow package

Listing Available Samples​

Local Samples​

Hello World​

DAG Workflow​

Slurm Samples​

SGLang Server + Benchmark (Single Node)​

Dynamo TRT-LLM Disaggregated Inference (Single Node)​

TRT-LLM Serve Disaggregated Inference (Single Node)​

InfMax Multi-Node Disaggregated Inference (DS-R1)​

Key Concepts Demonstrated​

Modular Samples (Folder-based)​

inference_x_v2​

Tips​

Listing Available Samples

Local Samples

Hello World

DAG Workflow

Slurm Samples

SGLang Server + Benchmark (Single Node)

Dynamo TRT-LLM Disaggregated Inference (Single Node)

TRT-LLM Serve Disaggregated Inference (Single Node)

InfMax Multi-Node Disaggregated Inference (DS-R1)

Key Concepts Demonstrated

Modular Samples (Folder-based)

inference_x_v2

Tips