Inputs and Outputs#
Inputs#
An input is a source of data to be downloaded into the task’s input directory. There are 3 types of inputs supported:
task: Specifies the upstream task that the current task depends on. The task dependency implies that the current task cannot be scheduled until the upstream task hasCOMPLETED. All files uploaded from the upstream tasks’ output directory will be downloaded.url: Downloads files from an external object storage bucket using a URI. Learn more about the URI syntax at Storage URLs.dataset: Downloads the files from a dataset. Learn more about datasets at Working with Data.
Note
dataset can also be used to download the user’s local files/directories with the localpath
attribute. For more information, see File Injection.
For example:
workflow:
name: "input-example"
tasks:
- name: task1
inputs:
- url: s3://bucket/path # (1)
- dataset:
name: workflow_example # (2)
...
- name: task2
inputs:
- task: task1 # (3)
...
Downloads the files from URI
s3://bucket/path.Downloads the files from the dataset
workflow_example.Downloads the files outputted by
task1.
All inputs types also allow for regex filtering on what to include. For example, a filter to only
include .txt files:
workflow:
name: "input-example"
tasks:
- name: task1
image: ubuntu
command: [echo]
args: ["Hello!"]
inputs:
- task: task1
regex: .*\.txt$
- url: s3://bucket/path
regex: .*\.txt$
- dataset:
name: workflow_example
regex: .*\.txt$
These inputs can be referenced in the task using Special Tokens.
Dataset#
Dataset and collection inputs has the additional fields:
Field |
Description |
|---|---|
name |
The name of the dataset/collection. |
regex |
A regex to filter the files to download. |
localpath |
When this is specified, this path is taken from the user’s local machine and uploaded as a dataset to be downloaded in the task. Learn more about localpath at File Injection. |
Examples of some regex usage:
inputs:
- dataset:
name: my_dataset
regex: .*\.txt$ # (1)
- dataset:
name: my_dataset
regex: .*\.(yaml|json)$ # (2)
- dataset:
name: my_dataset
regex: ^(.*\/my_folder\/|my_folder\/.*) # (3)
- dataset:
name: my_dataset
regex: ^(.*\/my_folder\/|my_folder\/.*)(\.jpg)$ # (4)
Downloads all files ending with
.txt.Downloads all files ending with
.yamlor.json.Downloads all files inside the folder or subfolder
my_folder.Downloads all files inside the folder or subfolder
my_folderending with.jpg.
Outputs#
An output folder is uploaded once the task has finished. To define a task output, use the outputs field when defining a task. There are three types of supported output types:
url: Upload files to an external object storage bucket using a URI. Learn more about the URI syntax at Storage URLs.dataset: Uploads the files to a dataset. Learn more about datasets at Working with Data.update_dataset: Creates a new dataset version with the combined files from the task’s output folder and the existing dataset version. Learn more about datasets at Update Dataset.
For example:
workflow:
name: "output-example"
tasks:
- name: task1
image: ubuntu
command: [echo]
args: ["Hello!"]
outputs:
- url: s3://bucket/path # (1)
- dataset:
name: workflow_example # (2)
- update_dataset:
name: workflow_example:1 # (3)
Uploads the files to the URI
s3://bucket/path.Uploads the files to the dataset
workflow_example.Creates a new dataset version with the combined files from the task’s output folder and the existing dataset version
workflow_example:1.
url and dataset types allow for regex filtering on what to include. For example,
a filter to only include .txt files:
workflow:
name: "output-example"
tasks:
- name: task1
image: ubuntu
command: [echo]
args: ["Hello!"]
outputs:
- url: s3://bucket/path
regex: .*\.txt$
- dataset:
name: workflow_example
regex: .*\.txt$
On how to specify which files to be uploaded, go to Templates and Special Tokens.
Dataset#
dataset has the additional fields:
Field |
Description |
|---|---|
name |
The name of the dataset/collection. |
path |
A relative path from |
regex |
A regex to filter the files to upload. |
metadata |
A list of metadata files to apply to the dataset version. Learn more at Dataset Metadata CLI Command. |
labels |
A list of labels files to apply to the dataset. Learn more at Dataset Labels CLI Command. |
update_dataset has the additional fields:
Field |
Description |
|---|---|
name |
The name of the dataset/collection. |
path |
A relative path from |
metadata |
A list of metadata files to apply to the dataset version. Learn more at Dataset Metadata CLI Command. |
labels |
A list of labels files to apply to the dataset. Learn more at Dataset Labels CLI Command. |
Note
The update_dataset field does not support using regex.
An example of how to use the metadata and labels fields:
outputs:
- dataset:
name: my_dataset
metadata:
- path/to/metadata.yaml
labels:
- path/to/labels.yaml