Probes
Probes let you gate task execution on an external condition, like:
- “wait until a TCP port is open”
- “wait until an HTTP endpoint returns success”
- “wait until a log line appears”
- “fail the workflow early when an error pattern appears”
You can use probes under:
probes.readiness: wait before treating the task as ready (so dependents can run)probes.failure: mark task as failed early if a failure condition is met
Common timing options:
delay: seconds before the first check (default0)timeout: overall readiness deadline in seconds (default1200). Only readiness probes time out the task.each_check_timeout: per-check timeout in seconds (default30)interval: seconds between checks (default5)success_threshold: consecutive successful readiness checks required (default1)failure_threshold: consecutive matching failure checks required (default3)
readiness may be a single probe or a list of probes. When multiple readiness probes are configured, the task becomes ready only after every readiness probe has triggered.
Readiness: TCP port probe
Example:
version: "0.1"
workflow:
name: http_echo
tasks:
- name: echo_server
script:
- python3 -m http.server 8000
probes:
readiness:
tcp_port:
port: 8000
timeout: 30
interval: 1
- name: echo_client
depends_on: [echo_server]
script:
- curl -sf http://127.0.0.1:8000/ > /dev/null
Readiness: HTTP probes
Use http_get or http_post when an HTTP endpoint is a better health signal than an open port:
workflow:
name: http_ready
tasks:
- name: api_server
script:
- python -m my_server --port 8000
probes:
readiness:
http_get:
url: "http://127.0.0.1:8000/health"
headers:
Accept: application/json
timeout: 120
interval: 2
- name: client
depends_on: [api_server]
script:
- curl -sf http://127.0.0.1:8000/health
http_post supports the same url and headers fields plus an optional body:
probes:
readiness:
http_post:
url: "http://127.0.0.1:8000/v1/health"
headers:
Content-Type: application/json
body: '{"ping": true}'
Readiness: log watch probe (+ retries)
log_watch scans a task's log file for a matching string.
Pattern field — use one of (not both):
| Field | Description |
|---|---|
regex_pattern | Original field name |
match_pattern | Alias (identical behavior, for forward compatibility) |
Matching behavior:
- By default the pattern is treated as a literal string match — characters like
(,),.,*are matched as-is, no escaping needed. - To use a real regex, prefix the pattern with
re:(orregex:).
| Pattern value | What it matches |
|---|---|
"server started" | Literal text server started |
"Traceback (most recent call last)" | Literal text including the parentheses |
"re:worker_\\d+ ready" | Regex: worker_ followed by one or more digits, then ready |
"regex:ERROR|FATAL" | Regex: ERROR or FATAL |
Other options:
logger: watch another task's log instead of the current task's (must be a valid task name)match_count: number of times the pattern must appear before the probe passes (default1)
workflow:
name: wf
tasks:
- name: worker
script:
- echo "Setting PyTorch memory fraction"
- sleep 999
probes:
readiness:
log_watch:
regex_pattern: "Setting PyTorch memory fraction"
timeout: 600
interval: 10
retries:
count: 3
interval: 10
backoff: 2
Failure probes
Failure probes watch for conditions that should stop the workflow early. A common pattern is to watch long-running server logs for tracebacks or fatal errors:
workflow:
name: wf
tasks:
- name: server
script:
- start_server.sh
probes:
readiness:
log_watch:
match_pattern: "server ready"
timeout: 600
failure:
log_watch:
match_pattern: "Traceback (most recent call last)"
match_count: 1
interval: 2
failure_threshold: 1
- name: benchmark
depends_on: [server]
script:
- run_benchmark.sh
When a failure probe triggers, sflow marks the task as failed by probe and cancels downstream work through fail-fast. Failure probes do not use the overall timeout as a deadline; they keep checking while the task is running. each_check_timeout still applies to each individual check.
Replicas and HTTP probe deduplication
For parallel replicas, identical HTTP probes that do not reference per-replica values are checked once on the first replica and propagated to follower replicas. This avoids sending the same health check N times when all replicas share one service endpoint.
sflow keeps a separate HTTP probe on every replica when the probe references a per-replica value such as:
- a swept variable from
replicas.variables SFLOW_REPLICA_INDEX
TCP probes always stay per replica because each replica may expose a different port or node binding.