Advanced Patterns#

This tutorial covers advanced workflow patterns and optimization techniques for building production-grade workflows in OSMO.

You’ll learn:

These patterns help you build scalable, maintainable, and efficient production workflows.

Workflow Templates#

OSMO supports Jinja templates for creating reusable and configurable workflows. You can define variables in your workflow and override them at submission time.

workflow:
  name: "{{workflow_name}}"

  resources:
    training:
      cpu: 8
      memory: 32Gi
      gpu: {{gpu_count}}

  tasks:
  {% for i in range(num_tasks) %}
  - name: train-model-{{i}}
    image: {{training_image}}
    command: ["python", "train.py"]
    args:
    - "--dataset={{dataset_name}}"
    - "--model={{model_type}}"
    - "--fold={{i}}"
    resource: training
    outputs:
    - dataset:
        name: "{{model_type}}_model_fold_{{i}}"
  {% endfor %}

default-values:
  workflow_name: ml-training
  dataset_name: imagenet
  model_type: resnet50
  num_tasks: 3
  gpu_count: 1
  training_image: nvcr.io/nvidia/pytorch:24.01-py3

This template uses a Jinja {% for %} loop to create multiple training tasks dynamically. Each task gets a unique name with an index (e.g., train-model-0, train-model-1) and produces separate output datasets. This is useful for cross-validation, hyperparameter sweeps, or parallel training runs.

Submit with custom values:

# Use defaults (creates 3 tasks)
$ osmo workflow submit template-workflow.yaml

# Override values to create 5 tasks
$ osmo workflow submit template-workflow.yaml \
    --set model_type=efficientnet \
    --set gpu_count=4 \
    --set num_tasks=5

See also

See Templates and Special Tokens for complete template documentation.

Injecting Local Files with localpath#

When developing workflows, you often need to inject local configuration files or scripts into your tasks. OSMO supports the localpath attribute in the files section to inject files from your local machine into the container.

This is particularly useful for:

  • Including configuration files without hard coding them inline

  • Reusing existing scripts across multiple workflows

  • Keeping workflow specifications clean and readable

To inject a local file, use the localpath attribute in the files section:

tasks:
- name: run-local-script
  image: ubuntu:24.04
  command: [sh]
  args: [/tmp/run.sh]
  files:
  - localpath: scripts/my_script.sh   # (1)
    path: /tmp/run.sh                 # (2)
  1. The localpath field designates the path of the file on your local machine (relative to the workflow spec).

  2. The path field designates where to create this file in the task’s container.

Warning

The localpath field in the files section only supports files, NOT directories. If you need to transfer entire directories, follow Folder for more information.

See also

See File Injection for complete file injection documentation, including how to inject directories using dataset inputs.

Periodic Data Checkpointing#

OSMO supports automatic checkpointing to periodically save your task’s working data to a remote data store. This is useful for long-running training tasks where you want to preserve intermediate results.

Basic Checkpointing#

workflow:
  name: train-with-checkpointing
  tasks:
  - name: train-with-checkpointing
    image: ubuntu:24.04
    command: [/bin/bash]
    args: [/tmp/run.sh]
    files:
    - path: /tmp/run.sh
      contents: |-
        #!/bin/bash
        set -ex

        mkdir -p /tmp/data
        for i in {1..30}; do
            filename="/tmp/data/file_$i.txt"
            dd if=/dev/urandom of=$filename bs=1M count=5
            sleep 1s
        done
        sleep 60s
    checkpoint:
    - path: /tmp/data
      url: s3://my-bucket/model-checkpoints
      frequency: 10s

This will automatically upload the contents of /tmp/data to S3 every 10 seconds while the task is running. When the task completes, a final checkpoint is automatically uploaded.

Checkpointing Specific Files#

You can use regex patterns to checkpoint only specific files:

workflow:
  name: train-with-selective-checkpointing
  tasks:
  - name: train-with-selective-checkpointing
    image: ubuntu:24.04
    command: [/bin/bash]
    args: [/tmp/run.sh]
    files:
    - path: /tmp/run.sh
      contents: |-
        #!/bin/bash
        set -ex

        mkdir -p /tmp/data
        for i in {1..30}; do
            # Alternate file type for odd/even files
            if (( $i % 2 == 0 )); then
                filename="/tmp/data/file_$i.bin"
            else
                filename="/tmp/data/file_$i.txt"
            fi
            dd if=/dev/urandom of=$filename bs=1M count=5
            sleep 1s
        done
        sleep 60s
    checkpoint:
    - path: /tmp/data
      url: s3://my-bucket/model-selective-checkpoints
      frequency: 10s
      regex: .*\.(bin)$

This will checkpoint only binary files (.bin) every 10 seconds.

See also

See Checkpointing for complete checkpointing documentation.

Running OSMO CLI in a Workflow#

Users can use the OSMO CLI from within their workflow. OSMO CLI is always injected into the workflow.

workflow:
  name: osmo-cli
  tasks:
  - name: task1
    resource: default
    image: ubuntu:24.04
    command: ['sh']
    args: ['/tmp/run.sh']
    files:
    - contents: |
        echo "Invoking OSMO client from a script"
        osmo version # (1)
      path: /tmp/run.sh
  1. You can run any OSMO CLI command here.

Error Handling with Exit Actions#

OSMO supports exit actions that allow you to control task behavior based on exit codes. This is useful for handling failures and retries.

tasks:
- name: resilient-task
  image: ubuntu:24.04
  command: ["bash", "-c", "curl https://api.example.com/data"]
  exitActions:
    COMPLETE: 0
    RESCHEDULE: 1-255

This configuration will reschedule the task for any non-zero exit code.

See also

See Exit Actions for complete exit actions documentation.

Excluding Specific Nodes#

If some nodes have poor performance or network issues in your pool, you can exclude them from scheduling using the nodesExcluded field:

workflow:
  name: exclude-nodes-demo
  resources:
    default:
      cpu: 1
      memory: 16Gi
      storage: 1Gi
      nodesExcluded:
      - worker1
      - worker2
  tasks:
  - name: my-task
    image: ubuntu:24.04
    command: ["bash", "-c", "echo 'Running on a good node'"]

Warning

Excluding too many nodes can lead to tasks being stuck in PENDING forever! Only exclude nodes when absolutely necessary.

Next Steps#

Congratulations! You have completed the OSMO tutorials. Continue to our how-to guides for more real-world examples.