Working with Data#
OSMO makes it easy to upload and download data for your workflows. This tutorial will cover:
How data is used inside a workflow.
How to work with storage URLs
How to work with datasets
Prerequisites
Before you start, please make sure you have configured your data credentials. See Data for more details.
Hint
The examples below demonstrate reading and writing from remote storage. Please replace any URLs with your own storage URLs.
Inside a Workflow#
OSMO provides two directories for data management in every task:
/osmo/
├── input/ ← Read input data here
│ ├── 0/
│ └── 1/
└── output/ ← Write results here
└── (user outputs)
How it works:
Before task starts → OSMO downloads data specified in
inputs:to/osmo/input/During task execution → Your code reads from
{{input:#}}/After task completes → OSMO uploads
/osmo/output/to locations specified inoutputs:
Example:
tasks:
- name: process
command: ["bash", "-c"]
args:
- |
cat {{input:0}}/data.txt # Reads the first input
echo "Result" > {{output}}/result.txt # Write output
inputs:
- dataset: {name: my_data} # ← Downloads here
outputs:
- dataset: {name: my_results} # ← Uploads here
See also
The above explains the fundamentals of how a workflow can read/write data. For more details on how data flows between tasks in a workflow, see Serial Workflows.
Storage URLs#
URL Patterns#
Storage Providers |
URL Pattern |
|---|---|
|
|
|
|
|
|
|
|
|
Uploading Data#
Upload data directly to cloud storage (S3, GCS, Azure) using URLs:
workflow:
name: upload-to-s3
tasks:
- name: save-to-cloud
image: ubuntu:24.04
command: ["bash", "-c"]
args:
- |
mkdir -p {{output}}/results
echo "Model checkpoint" > {{output}}/results/model.pth
echo "Upload complete"
outputs:
- url: s3://my-bucket/models/ # (1)
Files from
{{output}}are uploaded to the S3 bucket after task completion.
Downloading Data#
Download data directly from cloud storage using URLs:
workflow:
name: download-from-s3
tasks:
- name: load-from-cloud
image: ubuntu:24.04
command: ["bash", "-c"]
args:
- |
echo "Loading data from S3..."
ls -la {{input:0}}/ # (1)
echo "Download complete"
inputs:
- url: s3://my-bucket/data/ # (2)
Access downloaded files at
{{input:0}}/.Files are downloaded from S3 before the task starts.
Datasets#
See also
Before you start, please make sure you have set a Default Dataset Bucket.
What is a Dataset?#
Important
A dataset is a versioned collection of files managed by OSMO. Datasets persist beyond workflow execution and can be shared across workflows and teams.
Key characteristics:
Datasets are versioned - each upload creates a new version
Content-addressed for efficient storage and deduplication
Accessible via CLI, workflows, and Web UI
Support filtering and metadata
Dataset Naming Convention
Datasets use the pattern dataset_name:version:
training_data- Latest versiontraining_data:v1- Specific versiontraining_data:baseline- Named version
Uploading a Dataset#
To upload a dataset from a workflow task, write files to the {{output}} directory and
specify a dataset in the outputs:
workflow:
name: create-dataset
tasks:
- name: generate-data
image: ubuntu:24.04
command: ["bash", "-c"]
args:
- |
echo "Generating data..."
mkdir -p {{output}}/data
for i in {1..10}; do
echo "Sample data $i" > {{output}}/data/file_$i.txt
done
echo "Data generation complete"
outputs:
- dataset:
name: my_dataset # (1)
Everything in
{{output}}is uploaded tomy_datasetafter the task completes successfully.
Once uploaded, you can download a dataset to your local machine using the CLI:
# Download latest version
$ osmo dataset download my_dataset /tmp
# Download specific version
$ osmo dataset download my_dataset:1 /tmp
Downloading a Dataset#
To download a dataset in a workflow, add it to the task’s inputs. To reference the dataset, use the
Special Tokens {{input:#}} where # is the zero-based index of the input.
workflow:
name: read-dataset
tasks:
- name: process-data
image: ubuntu:24.04
command: ["bash", "-c"]
args:
- |
echo "Reading dataset..."
ls -la {{input:0}}/my_dataset/ # (1)
cat {{input:0}}/my_dataset/data/file_1.txt
echo "Processing complete"
inputs:
- dataset:
name: my_dataset # (2)
Access the dataset at
{{input:0}}/my_dataset/where{{input:0}}is the first input.The dataset is downloaded before the task starts.
Combining URLs and Datasets#
You can mix URLs and datasets in the same workflow:
workflow:
name: mixed-storage
tasks:
- name: process-multiple-sources
image: ubuntu:24.04
command: ["bash", "-c"]
args:
- |
echo "Processing data from multiple sources..."
cat {{input:0}}/my_dataset/data.txt
cat {{input:1}}/s3_data.txt
# Generate outputs
echo "Processed results" > {{output}}/results.txt
inputs:
- dataset:
name: my_dataset # (1)
- url: s3://my-bucket/raw-data/ # (2)
outputs:
- dataset:
name: processed_data # (3)
- url: s3://my-bucket/outputs/ # (4)
Download from OSMO dataset at
{{input:0}}/my_dataset/.Download from S3 at
{{input:1}}/.Upload to OSMO dataset.
Also upload to S3 bucket.
Filtering Data#
Filter which files to download or upload using regex patterns:
workflow:
name: filtered-io
tasks:
- name: selective-download
image: ubuntu:24.04
command: ["bash", "-c"]
args: ["ls -la {{input:0}}/"]
inputs:
- dataset:
name: large_dataset
regex: .*\.txt$ # (1)
outputs:
- dataset:
name: output_dataset
regex: .*\.(json|yaml)$ # (2)
Only download
.txtfiles from the input dataset.Only upload
.jsonand.yamlfiles to the output dataset.
Next Steps#
Now that you understand data management, you’re ready to build more complex workflows. Continue to Serial Workflows to learn about task dependencies.