Skip to main content

Introduction

sflow is a declarative workflow descriptor that separates what to deploy from where to deploy it.

An application's deployment steps are usually logically the same regardless of the underlying infrastructure. Take NVIDIA Dynamo as an example: you start etcd and NATS, launch a frontend server, spin up workers that register to the frontend, and the service is up. That logical flow never changes — but making it actually run on Slurm, Docker Compose, or Kubernetes requires a different set of infrastructure-specific scripts, resource management, and networking tweaks each time, and the effort must be repeated for every new platform.

sflow is trying to eliminate this duplication. You describe the workflow once in a portable YAML format — tasks, dependencies, resources, and launch methods — and sflow delegates execution to the target infrastructure through swappable backends, leveraging each platform's native ecosystem rather than reimplementing it (e.g. Kubernetes, Helm charts, Argo Workflows).

Pluggable extensions such as probes and artifacts integrate naturally without coupling your workflow to any specific platform. Write one sflow.yaml and run it across environments with minimal changes.

The current focus is Slurm, which — unlike Kubernetes or Docker — lacks a built-in workflow orchestration layer, making multi-step deployments especially cumbersome. Docker and Kubernetes backends are planned to follow.

sflow TUI

Use Cases

Complex Slurm Workflows

sflow streamlines orchestration within Slurm clusters with built-in support for:

  • Automatic hostname/IP detection after allocation
  • Workload distribution across nodes and GPUs
  • Runtime readiness and failure checks (probes)
  • Replica scaling (parallel workers, sweeps)

Define what you want to run — no more hand-crafted bash scripts to manage resource placement or ensure processes land on the right nodes and GPUs. Below is an example DAG for a Dynamo PD disaggregated LLM inference service:

Cross-Environment Orchestration

Codify startup order, replica scale, readiness probes, and log capture in YAML — then run the same file locally or on a cluster by switching the backend.

Benchmarking & Experiment Automation

Standardize how you launch runs, capture logs/artifacts, and structure outputs so results are reproducible across teams and machines.

Local Development & Testing

Use the local backend with the bash operator to validate your DAG and scripts on your laptop before moving to a Slurm cluster.

Core Concepts

ConceptDescription
WorkflowA set of tasks wired into a DAG via depends_on.
TaskAn executable unit. The key field is script — a list of lines joined into a bash script.
BackendWhere compute comes from. Built-ins: slurm (allocates via salloc) and local (simulates nodes on the local machine).
OperatorHow a task is launched. Built-ins: bash, srun, docker, ssh, python. Named operators let you preset flags and reuse them across tasks.
VariableA named value referenced as ${{ variables.NAME }} in YAML or ${NAME} in scripts. Override from the CLI with --set.
ExpressionJinja2-based ${{ ... }} syntax inside YAML to reference variables, backend info, task metadata, and more (e.g. ${{ backends.slurm.nodes[0].ip_address }}). Supports filters (${{ [a, b] | min }}), conditionals, and list indexing.
ArtifactA named external resource (model, config, dataset) referenced by URI and resolved to a local path at runtime.
ProbeA health-check gate. Readiness probes block dependents until a service is live; failure probes terminate the workflow when a fatal condition is detected.
ReplicaA task can be replicated N times (parallel or sequential) with per-replica variable overrides for sweeps.

For detailed architecture diagrams, execution flow, assembly pipeline, orchestrator internals, plugin reference, and output structure, see Architecture.

How to Use sflow (General Workflow)

Modular Workflow

For larger projects, split config into composable modules and pass them directly to sflow run or sflow batch -- no separate compose step required. This enables framework swapping, benchmark mixing, and CSV-driven parameter sweeps. See Modular Workflows for details.

Config Merging Rules

When multiple YAML files are provided:

SectionMerge Strategy
versionMust match across all files
variablesMerge by name (later overrides earlier)
artifactsMerge by name
backendsMerge by name
operatorsMerge by name
workflow.tasksConcatenated (later files append tasks)
workflow.nameLast non-null wins

Expression System

The ${{ ... }} expression syntax (powered by Jinja2) provides access to the full runtime context:

NamespaceExampleDescription
variables${{ variables.MODEL_NAME }}Resolved variable value
artifacts${{ artifacts.MODEL.path }}Artifact local path
backends${{ backends.slurm.nodes[0].ip_address }}Backend node info
task${{ task.assigned_nodes }}Current task's node assignment
Filters${{ [a, b] | min }}Jinja2 filters

Expressions are resolved in phases — variables first, then backends, then artifacts, then task-level — so later phases can reference earlier results.

Known Limitations

The following features are not yet implemented in the current release:

  • sflow run --resume — raises NotImplementedError
  • sflow run --task — raises BadParameter
  • hf:// and docker:// artifact materialization — raises NotImplementedError

This user guide reflects actual code behavior. Not all planned features may be available yet.

Next Steps

TopicPage
Architecture, execution flow, pluginsArchitecture
Run a minimal exampleQuickstart
Variables, expressions, env injectionVariables
Named inputs (paths, images, etc.)Artifacts
Compute backends (local, Slurm)Backends
Task launch methods (bash, srun, containers)Operators
Node/GPU placement, CUDA_VISIBLE_DEVICESResources
Parallel/sequential replicas, sweepsReplicas
Composable configs, sweeps, missable tasksModular Workflows
Readiness/failure gates for servicesProbes
Log and output directory structureOutputs & Logs
Full sflow.yaml schemaConfiguration
CLI optionsCLI Reference
Frequently asked questionsFAQ