Working with Data#

OSMO makes it easy to upload and download data for your workflows. This tutorial will cover:

Prerequisites

Before you start, please make sure you have configured your data credentials. See Data for more details.

Hint

The examples below demonstrate reading and writing from remote storage. Please replace any URLs with your own storage URLs.

Inside a Workflow#

OSMO provides two directories for data management in every task:

/osmo/
├── input/              ← Read input data here
│   ├── 0/
│   └── 1/
└── output/             ← Write results here
    └── (user outputs)

How it works:

  1. Before task starts → OSMO downloads data specified in inputs: to /osmo/input/

  2. During task execution → Your code reads from {{input:#}}/

  3. After task completes → OSMO uploads /osmo/output/ to locations specified in outputs:

Example:

tasks:
- name: process
  command: ["bash", "-c"]
  args:
  - |
    cat {{input:0}}/data.txt                # Reads the first input
    echo "Result" > {{output}}/result.txt   # Write output

  inputs:
  - url: s3://my-bucket/inputs/             # ← Downloads here
  outputs:
  - url: s3://my-bucket/outputs/            # ← Uploads here

See also

The above explains the fundamentals of how a workflow can read/write data. For more details on how data flows between tasks in a workflow, see Serial Workflows.

Storage URLs#

URL Patterns#

Storage Providers

URL Pattern

AWS S3

s3://<bucket>/<path>

GCP Google Storage

gs://<bucket>/<path>

Azure Blob Storage

azure://<account>/<container>/<path>

Torch Object Storage

tos://<endpoint>/<bucket>

OpenStack Swift

swift://<endpoint>/<account>/<bucket>

Uploading Data#

Upload data directly to cloud storage (S3, GCS, Azure) using URLs:

workflow:
  name: upload-to-s3

  tasks:
  - name: save-to-cloud
    image: ubuntu:24.04
    command: ["bash", "-c"]
    args:
    - |
      mkdir -p {{output}}/results
      echo "Model checkpoint" > {{output}}/results/model.pth
      echo "Upload complete"

    outputs:
    - url: s3://my-bucket/models/ # (1)
  1. Files from {{output}} are uploaded to the S3 bucket after task completion.

Downloading Data#

Download data directly from cloud storage using URLs:

workflow:
  name: download-from-s3

  tasks:
  - name: load-from-cloud
    image: ubuntu:24.04
    command: ["bash", "-c"]
    args:
    - |
      echo "Loading data from S3..."
      ls -la {{input:0}}/ # (1)
      echo "Download complete"

    inputs:
    - url: s3://my-bucket/data/ # (2)
  1. Access downloaded files at {{input:0}}/.

  2. Files are downloaded from S3 before the task starts.

Filtering Data#

Filter which files to download or upload using regex patterns:

workflow:
  name: filtered-io

  tasks:
  - name: selective-download
    image: ubuntu:24.04
    command: ["bash", "-c"]
    args: ["ls -la {{input:0}}/"]

    inputs:
    - dataset:
        name: large_dataset
        regex: .*\.txt$ # (1)

    outputs:
    - dataset:
        name: output_dataset
        regex: .*\.(json|yaml)$ # (2)
  1. Only download .txt files from the input.

  2. Only upload .json and .yaml files to the output.

Next Steps#

Now that you understand data management, you’re ready to build more complex workflows. Continue to Serial Workflows to learn about task dependencies.