Working with Data#

OSMO makes it easy to upload and download data for your workflows. This tutorial will cover:

Prerequisites

Before you start, please make sure you have configured your data credentials. See Data for more details.

Hint

The examples below demonstrate reading and writing from remote storage. Please replace any URLs with your own storage URLs.

Inside a Workflow#

OSMO provides two directories for data management in every task:

/osmo/
├── input/              ← Read input data here
│   ├── 0/
│   └── 1/
└── output/             ← Write results here
    └── (user outputs)

How it works:

  1. Before task starts → OSMO downloads data specified in inputs: to /osmo/input/

  2. During task execution → Your code reads from {{input:#}}/

  3. After task completes → OSMO uploads /osmo/output/ to locations specified in outputs:

Example:

tasks:
- name: process
  command: ["bash", "-c"]
  args:
  - |
    cat {{input:0}}/data.txt                # Reads the first input
    echo "Result" > {{output}}/result.txt   # Write output

  inputs:
  - dataset: {name: my_data}     # ← Downloads here
  outputs:
  - dataset: {name: my_results}  # ← Uploads here

See also

The above explains the fundamentals of how a workflow can read/write data. For more details on how data flows between tasks in a workflow, see Serial Workflows.

Storage URLs#

URL Patterns#

Storage Providers

URL Pattern

AWS S3

s3://<bucket>/<path>

GCP Google Storage

gs://<bucket>/<path>

Azure Blob Storage

azure://<account>/<container>/<path>

Torch Object Storage

tos://<endpoint>/<bucket>

OpenStack Swift

swift://<endpoint>/<account>/<bucket>

Uploading Data#

Upload data directly to cloud storage (S3, GCS, Azure) using URLs:

workflow:
  name: upload-to-s3

  tasks:
  - name: save-to-cloud
    image: ubuntu:24.04
    command: ["bash", "-c"]
    args:
    - |
      mkdir -p {{output}}/results
      echo "Model checkpoint" > {{output}}/results/model.pth
      echo "Upload complete"

    outputs:
    - url: s3://my-bucket/models/ # (1)
  1. Files from {{output}} are uploaded to the S3 bucket after task completion.

Downloading Data#

Download data directly from cloud storage using URLs:

workflow:
  name: download-from-s3

  tasks:
  - name: load-from-cloud
    image: ubuntu:24.04
    command: ["bash", "-c"]
    args:
    - |
      echo "Loading data from S3..."
      ls -la {{input:0}}/ # (1)
      echo "Download complete"

    inputs:
    - url: s3://my-bucket/data/ # (2)
  1. Access downloaded files at {{input:0}}/.

  2. Files are downloaded from S3 before the task starts.

Datasets#

See also

Before you start, please make sure you have set a Default Dataset Bucket.

What is a Dataset?#

Important

A dataset is a versioned collection of files managed by OSMO. Datasets persist beyond workflow execution and can be shared across workflows and teams.

Key characteristics:

  • Datasets are versioned - each upload creates a new version

  • Content-addressed for efficient storage and deduplication

  • Accessible via CLI, workflows, and Web UI

  • Support filtering and metadata

Dataset Naming Convention

Datasets use the pattern dataset_name:version:

  • training_data - Latest version

  • training_data:v1 - Specific version

  • training_data:baseline - Named version

Uploading a Dataset#

To upload a dataset from a workflow task, write files to the {{output}} directory and specify a dataset in the outputs:

workflow:
  name: create-dataset

  tasks:
  - name: generate-data
    image: ubuntu:24.04
    command: ["bash", "-c"]
    args:
    - |
      echo "Generating data..."
      mkdir -p {{output}}/data
      for i in {1..10}; do
        echo "Sample data $i" > {{output}}/data/file_$i.txt
      done
      echo "Data generation complete"

    outputs:
    - dataset:
        name: my_dataset # (1)
  1. Everything in {{output}} is uploaded to my_dataset after the task completes successfully.

Once uploaded, you can download a dataset to your local machine using the CLI:

# Download latest version
$ osmo dataset download my_dataset /tmp

# Download specific version
$ osmo dataset download my_dataset:1 /tmp

Downloading a Dataset#

To download a dataset in a workflow, add it to the task’s inputs. To reference the dataset, use the Special Tokens {{input:#}} where # is the zero-based index of the input.

workflow:
  name: read-dataset

  tasks:
  - name: process-data
    image: ubuntu:24.04
    command: ["bash", "-c"]
    args:
    - |
      echo "Reading dataset..."
      ls -la {{input:0}}/my_dataset/ # (1)
      cat {{input:0}}/my_dataset/data/file_1.txt
      echo "Processing complete"

    inputs:
    - dataset:
        name: my_dataset # (2)
  1. Access the dataset at {{input:0}}/my_dataset/ where {{input:0}} is the first input.

  2. The dataset is downloaded before the task starts.

Combining URLs and Datasets#

You can mix URLs and datasets in the same workflow:

workflow:
  name: mixed-storage

  tasks:
  - name: process-multiple-sources
    image: ubuntu:24.04
    command: ["bash", "-c"]
    args:
    - |
      echo "Processing data from multiple sources..."
      cat {{input:0}}/my_dataset/data.txt
      cat {{input:1}}/s3_data.txt

      # Generate outputs
      echo "Processed results" > {{output}}/results.txt

    inputs:
    - dataset:
        name: my_dataset # (1)
    - url: s3://my-bucket/raw-data/ # (2)

    outputs:
    - dataset:
        name: processed_data # (3)
    - url: s3://my-bucket/outputs/ # (4)
  1. Download from OSMO dataset at {{input:0}}/my_dataset/.

  2. Download from S3 at {{input:1}}/.

  3. Upload to OSMO dataset.

  4. Also upload to S3 bucket.

Filtering Data#

Filter which files to download or upload using regex patterns:

workflow:
  name: filtered-io

  tasks:
  - name: selective-download
    image: ubuntu:24.04
    command: ["bash", "-c"]
    args: ["ls -la {{input:0}}/"]

    inputs:
    - dataset:
        name: large_dataset
        regex: .*\.txt$ # (1)

    outputs:
    - dataset:
        name: output_dataset
        regex: .*\.(json|yaml)$ # (2)
  1. Only download .txt files from the input dataset.

  2. Only upload .json and .yaml files to the output dataset.

Next Steps#

Now that you understand data management, you’re ready to build more complex workflows. Continue to Serial Workflows to learn about task dependencies.