Configuration
sflow uses a YAML config file (default name: sflow.yaml). Top-level structure:
version: "0.1"
variables: ...
artifacts: ...
backends: ...
operators: ...
workflow: ...
This page follows the current schema in code (src/sflow/config/schema.py).
Looking for a quick lookup of all config fields? See the Quick Reference table.
version
Currently supported:
version: "0.1"
variables
Variables can be written as a dict or a list (they are normalized internally).
Dict form (recommended):
variables:
SLURM_PARTITION:
description: "Slurm partition"
value: debug
List form:
variables:
- name: SLURM_PARTITION
description: "Slurm partition"
value: debug
How to use:
- In YAML expressions:
${{ variables.SLURM_PARTITION }} - In task scripts (as env var):
${SLURM_PARTITION}
Override variables via CLI (--set)
sflow run --file sflow.yaml --set SLURM_PARTITION=debug --set NUM_GPUS=4
Notes:
--setcan only override variables that already exist in the config; otherwise it errors.- Values use simple type inference (int/float/bool/string).
artifacts
artifacts are “named resources” you can reference via ${{ artifacts.NAME.path }} in expressions.
In v0.1, the code resolves uri to a local path only for:
fs://<path>file://<path>
Other schemes are kept as-is (path remains the URI string). No automatic download/pull happens today.
Example:
artifacts:
model_dir:
uri: fs://./models/qwen
Override artifacts via CLI (--artifact)
sflow run --file sflow.yaml --artifact model_dir=fs:///mnt/models/qwen
Same requirement: the artifact must already be defined in artifacts, otherwise it errors.
backends
local backend
backends:
local:
type: local
default: true
nodes: 1
slurm backend
backends:
slurm_cluster:
type: slurm
default: true
account: ${{ variables.SLURM_ACCOUNT }}
partition: ${{ variables.SLURM_PARTITION }}
time: 00:30:00
nodes: 2
extra_args:
- "--gpus-per-node=8"
If you are already inside a Slurm allocation (e.g. via salloc or sbatch), you can use:
sflow run --file sflow.yaml
This skips salloc and attempts to infer node info from the current environment (SLURM_JOB_ID/SLURM_JOB_NODELIST).
operators
An operator defines how a task is launched. Two common ones:
bash: run locally via bashsrun: run via Slurmsrun(supports common Pyxis--container-*flags)
Example (bash):
operators:
local_bash:
type: bash
Example (srun + container):
operators:
runtime:
type: srun
container_image: nvcr.io/xxx/yyy:tag
container_mount_home: false
container_mounts:
- "/mnt:/mnt:rw"
extra_args:
- "--shm-size=64g"
workflow / tasks
Minimal task:
workflow:
name: demo
tasks:
- name: step1
script:
- echo "hello"
depends_on
- name: step2
depends_on: [step1]
script:
- echo "step2"
resources (nodes / GPUs)
- name: server
resources:
nodes:
indices: [0]
gpus:
count: 4
script:
- echo "server on node0, 4 gpus"
replicas
Run multiple instances of a task in parallel or sequentially:
- name: worker
replicas:
count: 4
policy: parallel
script:
- echo "replica $SFLOW_REPLICA_INDEX"
See Replicas for details on policies, variable sweeps, and GPU allocation.
probes (readiness / failure)
Probes are useful for service-style tasks (e.g. start a server, then run a client):
- name: api_server
script:
- python -m http.server 8000
probes:
readiness:
tcp_port:
port: 8000