Docs →
click or press P to enable auto-scroll
01 / 12

NV-sflow

Declarative Workflow Descriptor

Describe once. Run anywhere._

02 / 12
The Problem

Same workflow. Different infra. Rewrite everything.

Take NVIDIA Dynamo — start etcd & NATS, launch a frontend, spin up workers, service is up. That logical flow never changes.

But making it run on Slurm, Docker Compose, or Kubernetes requires platform-specific scripts, networking, and resource management — repeated for every new platform.

03 / 12
The Solution

Separate what to deploy from where

Describe Once

Portable YAML — tasks, deps, resources, launch methods

Swappable Backends

Slurm now. Docker & K8S planned.

Pluggable Plugins

Probes, artifacts, replicas — no platform coupling.

K8S Docker Slurm Docker K8S Slurm sflow
04 / 12
DAG Orchestration

Workflow DAG

load_image
install_dependency
gpu_monitor
nats_server
etcd_server
frontend_server_0
frontend_server_1
frontend_server_2
nginx_server
prefill_server_0
prefill_server_1
prefill_server_2
prefill_server_3
decode_server_0
benchmark_infmax_16
05 / 12
Resource Planning

Topology-aware GPU Allocation

Allocation map (finalized node/GPU assignment): - backend 'slurm_cluster': ├─ node slurm_cluster-node0 │ GPU 0: prefill_server_0 │ GPU 1: prefill_server_1 │ GPU 2: prefill_server_2 │ GPU 3: prefill_server_3 │ Tasks: load_image, gpu_monitor, nats_server, etcd_server, frontend_server_0, ... ├─ node slurm_cluster-node1 │ GPU 0: decode_server_0 │ GPU 1: decode_server_0 │ GPU 2: decode_server_0 │ GPU 3: decode_server_0 │ Tasks: load_image, gpu_monitor, frontend_server_1, decode_server_0
06 / 12

Core Features

Probes

Readiness & failure gates — TCP, HTTP, log watch

Replicas

Parallel / sequential, Cartesian sweeps

{{}}

Expressions

Jinja2 ${{}} — variables, backends

📦

Artifacts

Named URIs: fs://, file://, http://

Live TUI

Rich terminal — task status, logs

📋

Batch Mode

sbatch, CSV-driven bulk sweeps

07 / 12
Modular Composition

Split. Reuse. Swap.

Input file composition (5 files → merged workflow): ├─ inference_x_v2/slurm_config.yaml │ variables: [SLURM_ACCOUNT, SLURM_PARTITION, SLURM_TIMELIMIT, ...] │ backends: [slurm_cluster] ├─ inference_x_v2/common_workflow.yaml │ variables: [SERVED_MODEL_NAME, MODEL_NAME, ... (+7)] │ artifacts: [LOCAL_MODEL_PATH] │ operators: [dynamo, nginx] │ workflow.tasks: [load_image, install_dependency, gpu_monitor, ...] ├─ vllm/prefill.yaml │ variables: [NUM_CTX_SERVERS, CTX_TP_SIZE, ... (+4)] │ workflow.tasks: [prefill_server] ├─ vllm/decode.yaml │ variables: [NUM_GEN_SERVERS, GEN_TP_SIZE, ... (+4)] │ workflow.tasks: [decode_server] └─ benchmark_infmax.yamldynamo_benchmark variables: [ISL, OSL, CONCURRENCY, ... (+1)] workflow.tasks: [benchmark_infmax]
BenefitDescription
ReuseShared components written once, used by all variants
SwapChange benchmark by swapping benchmark_aiperf for benchmark_infmax
Mix frameworkssglang + aiperf, vllm + infmax, trtllm + aiperf, etc.
Smaller diffsChanges to one component only touch one file
Bulk testingCSV combinations, generate/submit all at once
Computed varsGPUS_PER_WORKER chains across modules
08 / 12
Real-world Debugging

Structured Error Analysis

Workflow: b200-fp8-low-latency-tep8-1p-1d Model: DeepSeek R1 FP8 | 2 nodes × 8 GPUs | ISL=8192, OSL=1024 Allocation Map ├─ slurm-node-01 (node 0) │ GPU 0-7: prefill_server_0 (TP=8) │ Also: load_image, nats, etcd, frontend, benchmark_* └─ slurm-node-02 (node 1) GPU 0-7: decode_server_0 (TP=8) Also: load_image, gpu_monitor Timeline 01:57:08 — load_image + install_aiperf submitted 01:59:10 — load_image COMPLETED on both nodes 01:59:43 — nats_server READY 01:59:45 — etcd_server READY 02:00:31 — frontend_server_0 READY (10.52.32.8) 02:05:14 — prefill + decode READY → benchmark_4 starts 02:05:20 — HTTP 500 — all benchmark requests fail 02:05:41 — Workflow finished (8m 33s) Error from frontend logs: Invalid TCP address 'dynamo_prefill.generate-58b49ce145f56609' Invalid TCP address 'dynamo_backend.generate-58b49ce145f5660b'

Diagnosis — sflow orchestration vs application error

LayerStatusDetail
sflow✓ OKDAG executed, all tasks launched, probes passed
GPU alloc✓ OK8 GPUs/node, no overlap, TP=8 per server
Infra✓ OKetcd, NATS, frontend all READY
Routing✗ FAILFrontend gets service names, not host:port
Benchmark✗ FAIL0/800 requests succeed (all HTTP 500)
Root Cause Frontend receives discovery service names (e.g. dynamo_prefill.generate-58b49...) instead of host:port addresses. NATS/etcd service registry returns internal identifiers that the TCP router can't parse. Fix: Verify DYN_REQUEST_PLANE and frontend networking config match SGLang disagg routing.
09 / 12

CLI at a Glance

CommandPurposeKey Flags
sflow runExecute a workflow--dry-run --tui --set
sflow batchGenerate sbatch scripts--submit --bulk-input
sflow composeMerge multiple YAMLs--resolve --validate
sflow visualizeRender DAG image--format png/svg/mermaid
sflow sampleList / copy examples--list -o
sflow skillExport AI agent skills--list -o
10 / 12
AI-Native

Built-in Agent Skills

sflow ships with AI agent skills that teach coding assistants (Cursor, Copilot) how to write sflow YAML and debug errors — no training required.

$ sflow skill --list Available AI agent skills: - writing-sflow-yaml Schema, examples, validation - sflow-error-analysis Error catalog, diagnostics + AGENTS.md (agent workflow guidelines) $ sflow skill -o .cursor/skills Skills will be copied to: .cursor/skills/ Note: directory exists — files will be merged. ✓ Skills copied to: .cursor/skills/

What the agent learns from AGENTS.md

CapabilityDetail
Write YAMLSchema-aware config authoring with validation
Debug errorsCategorize errors, read logs, suggest fixes
GPU planningTP/DP/PP sizing, node allocation math
Modular composeSplit configs, missable tasks, CSV sweeps
Step-by-stepHardcode first → validate → parameterize
AGENTS.md workflow 1. Gather cluster info (account, partition, GPUs) 2. Write minimal plain-text config 3. sflow run --dry-run to validate 4. Run and debug with --tui 5. Extract variables, parameterize 6. Modularize for multi-framework support
11 / 12
Vision

Heterogeneous Computing

As a top-level orchestration abstraction, sflow is uniquely positioned to unify heterogeneous compute — allocating the right accelerator from the right resource pool for each stage of a disaggregated pipeline.

Vera Rubin GPU pool • Prefill LPU Cluster Accelerator • Decode sflow prefill_server Vera Rubin GPUs decode_server LPU accelerators

Example: PD disaggregation — one sflow.yaml routes prefill to Vera Rubin GPU pools for maximum throughput, and decode to LPU accelerators for low-latency token generation. Each task lands on the optimal hardware automatically.

12 / 12
What’s Next

Slurm today. Docker & Kubernetes tomorrow.

Slurm lacks a workflow orchestration layer — that’s where sflow starts. Docker and K8S backends are planned, leveraging native ecosystems (Helm charts, Argo Workflows).

Slurm ✓ Docker (planned) Kubernetes (planned)

$ uv pip install "sflow @ git+https://github.com/NVIDIA/nv-sflow.git@main"

github.com/NVIDIA/nv-sflow