Workflow Lifecycle#

When you submit a workflow, it progresses through distinct phases from submission to completion. This page explains what happens at each stage and what you should expect to see.

Overview#

Every successful workflow follows this path:

PENDING ⏱️

Setting up resources and preparing to run

What’s happening: Validating, queuing, and initializing

Active

RUNNING ▶️

Your tasks are executing

What’s happening: Commands running, transferring data

Active

COMPLETED ✓

All tasks finished successfully

What’s happening: Outputs uploaded, workflow done

Terminal

Tip

Most workflows follow this simple progression. If you see other statuses like WAITING or various FAILED states, see the sections below to understand what’s happening.

Task Lifecycle#

SUBMITTING

What’s happening:

Workflow YAML is being validated (syntax, names, resources)
Credentials are checked (registry and data access)
Resource requests are matched against pool capacity

Common issues:

Invalid/missing credentials → Configure with osmo credential set
Resource requests too large → Reduce GPU/CPU/memory or verify with osmo pool list or osmo resource list

WAITING

What’s happening:

Task is waiting for upstream tasks to complete
No resources are being consumed during this phase

Common issues:

Upstream task failures will cause dependent tasks to fail
Long wait times if upstream tasks are slow or queued

PROCESSING

What’s happening:

Converting task specification to backend format
Submitting task to backend scheduler

Common issues:

Rare - usually internal processing errors
If stuck here, please contact your administrator

SCHEDULING

What’s happening:

Task is in the backend queue
Waiting for nodes with requested resources (CPU, GPU, memory)
Priority and queue position determine the order in which tasks are scheduled

Common issues:

Insufficient resources in pool → Check pool capacity: osmo pool list
Resource requests too large → Reduce GPU/CPU/memory requests or request a larger pool
Queue timeout exceeded → Increase queue_timeout (see Timeouts)

INITIALIZING

What’s happening:

Pulling Docker image (if not already cached on the node)
Running preflight tests
Preparing container environment

Common issues:

Image doesn’t exist → FAILED_IMAGE_PULL - Check image name and registry
No pull credentials → FAILED_IMAGE_PULL - Verify registry credentials
Image pull timeout → FAILED_START_TIMEOUT - Image too large or network issues
Preflight test failures → FAILED_START_ERROR - Container startup problems

RUNNING

What’s happening:

Three sequential activities occur during the RUNNING phase:

Input download - Sidecar container downloads any specified inputs from:
- Upstream task outputs
- Datasets
- Cloud storage URLs
Command execution - Your code runs in the container:
- Standard output/error is captured in logs
- Exec and port-forwarding are available during this time
- You can interact with the running task
Output upload - After your command completes, sidecar uploads outputs:
- Files from the output directory are uploaded
- Uploads to specified locations or intermediate storage
- Happens before status changes to COMPLETED

Common issues:

Execution timeout exceeded → FAILED_EXEC_TIMEOUT - Increase exec_timeout or optimize code
Memory limits exceeded → FAILED_EVICTED - Request more memory or reduce usage
Storage limits exceeded → FAILED_EVICTED - Clean up intermediate files or request more storage
Node failures → FAILED_BACKEND_ERROR - Infrastructure issue, consider auto-reschedule
Command exits with error → FAILED - Check logs: osmo workflow logs <workflow-id> <task-name>

COMPLETED

Task finished successfully (exit code 0)
All outputs have been uploaded
Task is done and cannot transition to any other state

FAILED

Task encountered an error and stopped
See Status Reference for all failure types
Check logs to diagnose: osmo workflow logs <workflow-id> <task-name>
Check exit code (see Exit Codes)

Group Lifecycle#

Groups allow multiple tasks to run together and communicate. Understanding group lifecycle is important when using distributed training or multi-task coordination.

Groups follow a similar lifecycle to tasks, but represent the collective state of all tasks within the group:

SUBMITTING → Group is being submitted
WAITING → Group waits for upstream groups (if any)
PROCESSING → Service is preparing the group
SCHEDULING → Group is waiting to be scheduled
INITIALIZING → Tasks are pulling images
RUNNING → At least one task in the group is running
COMPLETED or FAILED → Group finished

How ignoreNonleadStatus Affects Group Behavior

Every group must have exactly one lead task. The ignoreNonleadStatus field (default: true) determines how non-lead task failures affect the group:

Value	Finished Status	Reschedule Behavior
`true`	The group’s status is dependent only on the lead task.	When a task is rescheduled, other tasks in the group continue running.
`false`	The group’s status is dependent on all the tasks in the group. If any task fails, the group will fail.	When a task is rescheduled, the other tasks in the group are restarted and the group status stays at `RUNNING`.

Learn more about group fields at Group.

Status Reference#

Workflow Statuses

Status	Description
PENDING	Workflow is waiting for a group to start running
WAITING	Workflow has started but doesn’t have any tasks running. Either a downstream task is waiting to be scheduled, or a task is waiting to be rescheduled
RUNNING	Workflow is running at least one group
COMPLETED	Workflow execution was successful and all tasks have completed
FAILED	Workflow failed to complete. One or more tasks have failed
FAILED_EXEC_TIMEOUT	Workflow was running longer than the set execution timeout (see Timeouts)
FAILED_QUEUE_TIMEOUT	Workflow was queued longer than the set queue timeout (see Timeouts)
FAILED_SUBMISSION	Workflow failed to submit due to resource or credential validation failure
FAILED_SERVER_ERROR	Workflow failed due to internal server error
FAILED_CANCELED	Workflow was canceled by a user

Task Statuses

Status	Description
SUBMITTING	Task is being submitted
WAITING	Task is waiting for an upstream task to complete
PROCESSING	Task is being processed by the service to be sent to the backend
SCHEDULING	Task is in the backend queue waiting to run
INITIALIZING	Task is pulling images and running preflight tests
RUNNING	Task is running (downloading inputs → executing command → uploading outputs)
RESCHEDULED	Task has finished and a new task with the same spec has been created
COMPLETED	Task has finished successfully
FAILED	Task has failed (your command returned non-zero exit code)
FAILED_CANCELED	Task was canceled by the user
FAILED_SERVER_ERROR	Task has failed due to internal service error
FAILED_BACKEND_ERROR	Task has failed due to some backend error like the node entering a Not Ready state
FAILED_EXEC_TIMEOUT	Workflow ran longer than the set execution timeout (see Timeouts)
FAILED_QUEUE_TIMEOUT	Workflow was queued longer than the set queue timeout (see Timeouts)
FAILED_IMAGE_PULL	Task has failed to pull Docker image
FAILED_UPSTREAM	Task has failed due to failed upstream dependencies
FAILED_EVICTED	Task was evicted due to memory or storage usage exceeding limits
FAILED_PREEMPTED	Task was preempted to make space for a higher priority task
FAILED_START_ERROR	Task failed to start up properly during the initialization process
FAILED_START_TIMEOUT	Task timed-out while initializing

Group Statuses

Status	Description
SUBMITTING	Group is being submitted
WAITING	Group is waiting for an upstream group to complete
PROCESSING	Group is being processed by the service to be sent to the backend
SCHEDULING	Group is waiting to be scheduled in the backend
INITIALIZING	All tasks in the group are initializing
RUNNING	Any task in the group is running
COMPLETED	Task completed status as defined by the `ignoreNonleadStatus` field. See Group Lifecycle for more information.
FAILED	If the lead task has failed or if `ignoreNonleadStatus` is set to `false` and any of the non-lead tasks have failed
FAILED_UPSTREAM	Upstream group has failed
FAILED_SERVER_ERROR	Some OSMO internal error occurred
FAILED_PREEMPTED	Any of the tasks in the group were preempted
FAILED_EVICTED	Any of the tasks in the group were evicted