Version: develop

Introduction

sflow is a declarative workflow descriptor that separates what to deploy from where to deploy it.

An application's deployment steps are usually logically the same regardless of the underlying infrastructure. Take NVIDIA Dynamo as an example: you start etcd and NATS, launch a frontend server, spin up workers that register to the frontend, and the service is up. That logical flow never changes — but making it actually run on Slurm, Docker Compose, or Kubernetes requires a different set of infrastructure-specific scripts, resource management, and networking tweaks each time, and the effort must be repeated for every new platform.

sflow is trying to eliminate this duplication. You describe the workflow once in a portable YAML format — tasks, dependencies, resources, and launch methods — and sflow delegates execution to the target infrastructure through swappable backends, leveraging each platform's native ecosystem rather than reimplementing it (e.g. Kubernetes, Helm charts, Argo Workflows).

Pluggable extensions such as probes and artifacts integrate naturally without coupling your workflow to any specific platform. Write one sflow.yaml and run it across environments with minimal changes.

The current focus is Slurm, which — unlike Kubernetes or Docker — lacks a built-in workflow orchestration layer, making multi-step deployments especially cumbersome. Docker and Kubernetes backends are planned to follow.

sflow TUI

Docs versions

The docs site version selector intentionally shows only maintained documentation streams:

develop: verified pre-release documentation for tested features that are queued for the next release.
main: stable documentation aligned with the latest released state.
vX.Y.Z release tags: immutable documentation snapshots for a specific release.

Both develop and main are kept up to date. Use main or a release tag for production/stable behavior, and use develop when validating upcoming tested features before the next release.

Use Cases

Complex Slurm Workflows

sflow streamlines orchestration within Slurm clusters with built-in support for:

Automatic hostname/IP detection after allocation
Workload distribution across nodes and GPUs
Runtime readiness and failure checks (probes)
Replica scaling (parallel workers, sweeps)

Define what you want to run — no more hand-crafted bash scripts to manage resource placement or ensure processes land on the right nodes and GPUs. Below is an example DAG for a Dynamo PD disaggregated LLM inference service:

Cross-Environment Orchestration

Codify startup order, replica scale, readiness probes, and log capture in YAML — then run the same file locally or on a cluster by switching the backend.

Benchmarking & Experiment Automation

Standardize how you launch runs, capture logs/artifacts, and structure outputs so results are reproducible across teams and machines.

Local Development & Testing

Use the local backend with the bash operator to validate your DAG and scripts on your laptop before moving to a Slurm cluster.

Core Concepts

Concept	Description
Workflow	A set of tasks wired into a DAG via `depends_on`.
Task	An executable unit. The key field is `script` — a list of lines joined into a bash script.
Backend	Where compute comes from. Built-ins: `slurm` (allocates via `salloc`) and `local` (simulates nodes on the local machine).
Operator	How a task is launched. Built-ins: `bash`, `srun`, `docker`, `ssh`, `python`. Named operators let you preset flags and reuse them across tasks.
Variable	A named value referenced as `${{ variables.NAME }}` in YAML or `${NAME}` in scripts. Override from the CLI with `--set`.
Expression	Jinja2-based `${{ ... }}` syntax inside YAML to reference variables, backend info, task metadata, and more (e.g. `${{ backends.slurm.nodes[0].ip_address }}`). Supports filters (`${{ [a, b] \| min }}`), conditionals, and list indexing.
Artifact	A named external resource (model, config, dataset) referenced by URI and resolved to a local path at runtime.
Probe	A health-check gate. Readiness probes block dependents until a service is live; failure probes terminate the workflow when a fatal condition is detected.
Replica	A task can be replicated N times (parallel or sequential) with per-replica variable overrides for sweeps.

For detailed architecture diagrams, execution flow, assembly pipeline, orchestrator internals, plugin reference, and output structure, see Architecture.

How to Use sflow (General Workflow)

Modular Workflow

For larger projects, split config into composable modules and pass them directly to sflow run or sflow batch -- no separate compose step required. This enables framework swapping, benchmark mixing, and CSV-driven parameter sweeps. See Modular Workflows for details.

Config Merging Rules

When multiple YAML files are provided:

Section	Merge Strategy
`version`	Must match across all files
`variables`	Merge by name (later overrides earlier)
`artifacts`	Merge by name
`backends`	Merge by name
`operators`	Merge by name
`workflow.tasks`	Concatenated (later files append tasks)
`workflow.name`	Last non-null wins

Expression System

The ${{ ... }} expression syntax (powered by Jinja2) provides access to the full runtime context:

Namespace	Example	Description
`variables`	`${{ variables.MODEL_NAME }}`	Resolved variable value
`artifacts`	`${{ artifacts.MODEL.path }}`	Artifact local path
`backends`	`${{ backends.slurm.nodes[0].ip_address }}`	Backend node info
`task`	`${{ task.assigned_nodes }}`	Current task's node assignment
Filters	`${{ [a, b] \| min }}`	Jinja2 filters

Expressions are resolved in phases — variables first, then backends, then artifacts, then task-level — so later phases can reference earlier results.

Known Limitations

The following features are not yet implemented in the current release:

sflow run --resume — raises NotImplementedError
sflow run --task — raises BadParameter
hf:// and docker:// artifact materialization — raises NotImplementedError

This user guide reflects actual code behavior. Not all planned features may be available yet.

Next Steps

Topic	Page
Architecture, execution flow, plugins	Architecture
Run a minimal example	Quickstart
Variables, expressions, env injection	Variables
Named inputs (paths, images, etc.)	Artifacts
Compute backends (local, Slurm)	Backends
Task launch methods (bash, srun, containers)	Operators
Node/GPU placement, CUDA_VISIBLE_DEVICES	Resources
Parallel/sequential replicas, sweeps	Replicas
Composable configs, sweeps, missable tasks	Modular Workflows
Readiness/failure gates for services	Probes
Log and output directory structure	Outputs & Logs
Full sflow.yaml schema	Configuration
CLI options	CLI Reference
Frequently asked questions	FAQ

Docs versions​

Use Cases​

Complex Slurm Workflows​

Cross-Environment Orchestration​

Benchmarking & Experiment Automation​

Local Development & Testing​

Core Concepts​

How to Use sflow (General Workflow)​

Modular Workflow​

Config Merging Rules​

Expression System​

Known Limitations​

Next Steps​