Workflow Lifecycle#
When you submit a workflow, it progresses through distinct phases from submission to completion. This page explains what happens at each stage and what you should expect to see.
Overview#
Every successful workflow follows this path:
PENDING ⏱️
Setting up resources and preparing to run
RUNNING ▶️
Your tasks are executing
COMPLETED ✓
All tasks finished successfully
Tip
Most workflows follow this simple progression. If you see other statuses like WAITING or various FAILED states, see the sections below to understand what’s happening.
Task Lifecycle#
What’s happening:
Workflow YAML is being validated (syntax, names, resources)
Credentials are checked (registry and data access)
Resource requests are matched against pool capacity
Common issues:
Invalid/missing credentials → Configure with
osmo credential setResource requests too large → Reduce GPU/CPU/memory or verify with
osmo pool listorosmo resource list
What’s happening:
Task is waiting for upstream tasks to complete
No resources are being consumed during this phase
Common issues:
Upstream task failures will cause dependent tasks to fail
Long wait times if upstream tasks are slow or queued
What’s happening:
Converting task specification to backend format
Submitting task to backend scheduler
Common issues:
Rare - usually internal processing errors
If stuck here, please contact your administrator
What’s happening:
Task is in the backend queue
Waiting for nodes with requested resources (CPU, GPU, memory)
Priority and queue position determine the order in which tasks are scheduled
Common issues:
Insufficient resources in pool → Check pool capacity:
osmo pool listResource requests too large → Reduce GPU/CPU/memory requests or request a larger pool
Queue timeout exceeded → Increase
queue_timeout(see Timeouts)
What’s happening:
Pulling Docker image (if not already cached on the node)
Running preflight tests
Preparing container environment
Common issues:
Image doesn’t exist →
FAILED_IMAGE_PULL- Check image name and registryNo pull credentials →
FAILED_IMAGE_PULL- Verify registry credentialsImage pull timeout →
FAILED_START_TIMEOUT- Image too large or network issuesPreflight test failures →
FAILED_START_ERROR- Container startup problems
What’s happening:
Three sequential activities occur during the RUNNING phase:
Input download - Sidecar container downloads any specified inputs from:
Upstream task outputs
Datasets
Cloud storage URLs
Command execution - Your code runs in the container:
Standard output/error is captured in logs
Exec and port-forwarding are available during this time
You can interact with the running task
Output upload - After your command completes, sidecar uploads outputs:
Files from the output directory are uploaded
Uploads to specified locations or intermediate storage
Happens before status changes to COMPLETED
Common issues:
Execution timeout exceeded →
FAILED_EXEC_TIMEOUT- Increaseexec_timeoutor optimize codeMemory limits exceeded →
FAILED_EVICTED- Request more memory or reduce usageStorage limits exceeded →
FAILED_EVICTED- Clean up intermediate files or request more storageNode failures →
FAILED_BACKEND_ERROR- Infrastructure issue, consider auto-rescheduleCommand exits with error →
FAILED- Check logs:osmo workflow logs <workflow-id> <task-name>
COMPLETED
Task finished successfully (exit code
0)All outputs have been uploaded
Task is done and cannot transition to any other state
FAILED
Task encountered an error and stopped
See Status Reference for all failure types
Check logs to diagnose:
osmo workflow logs <workflow-id> <task-name>Check exit code (see Exit Codes)
Task’s Output Behavior
When are outputs uploaded?
Outputs are uploaded when the task completes successfully. However:
Attention
If the task is canceled or terminated (due to backend error, eviction, or preemption), outputs are NOT uploaded.
Where are outputs uploaded?
OSMO determines the upload destination based on your configuration:
Custom location - If you specify
outputsin the task spec → See OutputsIntermediate storage - In these cases:
Task has downstream dependencies (outputs needed by other tasks)
Task has no downstream dependencies AND no
outputsspecified (automatic backup)
How to recover from task failures?
When tasks fail, you can configure automatic recovery using Exit Actions:
Reschedule (creates new task) - Use for:
Node failures (
FAILED_BACKEND_ERROR)Preemption (
FAILED_PREEMPTED)Image pull issues (
FAILED_IMAGE_PULL)
Restart (re-runs command on same task) - Use for:
Code crashes that can resume from checkpoints
Timeouts where work can continue
Temporary failures that don’t require a fresh start
Example configuration:
tasks:
- name: resilient-task
image: my-image
...
exitActions:
COMPLETE: 0-10
RESTART: 11-20
RESCHEDULE: 21-255
Note
Please contact your administrator to enable/configure maximum number of retries.
Group Lifecycle#
Groups allow multiple tasks to run together and communicate. Understanding group lifecycle is important when using distributed training or multi-task coordination.
Groups follow a similar lifecycle to tasks, but represent the collective state of all tasks within the group:
SUBMITTING → Group is being submitted
WAITING → Group waits for upstream groups (if any)
PROCESSING → Service is preparing the group
SCHEDULING → Group is waiting to be scheduled
INITIALIZING → Tasks are pulling images
RUNNING → At least one task in the group is running
COMPLETED or FAILED → Group finished
How ignoreNonleadStatus Affects Group Behavior
Every group must have exactly one lead task. The ignoreNonleadStatus field (default: true)
determines how non-lead task failures affect the group:
Value |
Finished Status |
Reschedule Behavior |
|---|---|---|
|
The group’s status is dependent only on the lead task. |
When a task is rescheduled, other tasks in the group continue running. |
|
The group’s status is dependent on all the tasks in the group. If any task fails, the group will fail. |
When a task is rescheduled, the other tasks in the group are restarted and the group
status stays at |
Learn more about group fields at Group.
Status Reference#
Workflow Statuses
Status |
Description |
|---|---|
PENDING |
Workflow is waiting for a group to start running |
WAITING |
Workflow has started but doesn’t have any tasks running. Either a downstream task is waiting to be scheduled, or a task is waiting to be rescheduled |
RUNNING |
Workflow is running at least one group |
COMPLETED |
Workflow execution was successful and all tasks have completed |
FAILED |
Workflow failed to complete. One or more tasks have failed |
FAILED_EXEC_TIMEOUT |
Workflow was running longer than the set execution timeout (see Timeouts) |
FAILED_QUEUE_TIMEOUT |
Workflow was queued longer than the set queue timeout (see Timeouts) |
FAILED_SUBMISSION |
Workflow failed to submit due to resource or credential validation failure |
FAILED_SERVER_ERROR |
Workflow failed due to internal server error |
FAILED_CANCELED |
Workflow was canceled by a user |
Task Statuses
Status |
Description |
|---|---|
SUBMITTING |
Task is being submitted |
WAITING |
Task is waiting for an upstream task to complete |
PROCESSING |
Task is being processed by the service to be sent to the backend |
SCHEDULING |
Task is in the backend queue waiting to run |
INITIALIZING |
Task is pulling images and running preflight tests |
RUNNING |
Task is running (downloading inputs → executing command → uploading outputs) |
RESCHEDULED |
Task has finished and a new task with the same spec has been created |
COMPLETED |
Task has finished successfully |
FAILED |
Task has failed (your command returned non-zero exit code) |
FAILED_CANCELED |
Task was canceled by the user |
FAILED_SERVER_ERROR |
Task has failed due to internal service error |
FAILED_BACKEND_ERROR |
Task has failed due to some backend error like the node entering a Not Ready state |
FAILED_EXEC_TIMEOUT |
Workflow ran longer than the set execution timeout (see Timeouts) |
FAILED_QUEUE_TIMEOUT |
Workflow was queued longer than the set queue timeout (see Timeouts) |
FAILED_IMAGE_PULL |
Task has failed to pull Docker image |
FAILED_UPSTREAM |
Task has failed due to failed upstream dependencies |
FAILED_EVICTED |
Task was evicted due to memory or storage usage exceeding limits |
FAILED_PREEMPTED |
Task was preempted to make space for a higher priority task |
FAILED_START_ERROR |
Task failed to start up properly during the initialization process |
FAILED_START_TIMEOUT |
Task timed-out while initializing |
Group Statuses
Status |
Description |
|---|---|
SUBMITTING |
Group is being submitted |
WAITING |
Group is waiting for an upstream group to complete |
PROCESSING |
Group is being processed by the service to be sent to the backend |
SCHEDULING |
Group is waiting to be scheduled in the backend |
INITIALIZING |
All tasks in the group are initializing |
RUNNING |
Any task in the group is running |
COMPLETED |
Task completed status as defined by the |
FAILED |
If the lead task has failed or if |
FAILED_UPSTREAM |
Upstream group has failed |
FAILED_SERVER_ERROR |
Some OSMO internal error occurred |
FAILED_PREEMPTED |
Any of the tasks in the group were preempted |
FAILED_EVICTED |
Any of the tasks in the group were evicted |
See also
Related Documentation:
Group - Group configuration and behavior
Timeouts - Set execution and queue timeouts
Exit Codes - Understanding exit codes
Exit Actions - Configure automatic retry behavior