Combination Workflows#

This tutorial teaches you how to combine serial and parallel execution patterns by creating groups with dependencies—enabling sophisticated multi-stage workflows.

So far, you have learned:

Serial workflows (Tutorial #5) - Tasks run one after another with dependencies
Parallel workflows (Tutorial #6) - Tasks run simultaneously using groups

Combination workflows merge both patterns by creating groups with dependencies.

By the end, you’ll understand:

How to create workflows with groups that depend on each other
How data flows between groups
How to build complex multi-stage pipelines

Tip

Combination workflows are ideal for:

Data processing pipelines - Preprocess → train/validate in parallel → aggregate
ML workflows - Data prep → train multiple models → compare results
Testing workflows - Build → test on multiple configs → report
ETL pipelines - Extract → transform in parallel → load

Simple Example#

Let’s build a data processing pipeline with multiple stages by downloading the workflow definition here: combination_workflow_simple.yaml.

workflow:
  name: data-pipeline

  groups:
  ##################################################
  # Group 1: Data Preparation (runs first)
  ##################################################
  - name: prepare-data
    tasks:
    - name: generate-dataset
      lead: true
      image: ubuntu:24.04
      command: ["bash", "-c"]
      args:
      - |
        echo "Generating training dataset..."
        mkdir -p {{output}}/data
        for i in {1..10}; do
          echo "sample_$i,value_$i" >> {{output}}/data/dataset.csv
        done
        echo "✓ Dataset generation complete!"

    - name: validate-data
      image: ubuntu:24.04
      command: ["bash", "-c"]
      args:
      - |
        echo "Validating dataset..."
        sleep 3
        echo "✓ Validation passed!"

  ##################################################
  # Group 2: Training (depends on Group 1)
  ##################################################
  - name: train-models
    tasks:
    - name: train-model-a
      lead: true
      image: ubuntu:24.04
      command: ["bash", "-c"]
      args:
      - |
        echo "Training Model A..."
        cat {{input:0}}/data/dataset.csv
        echo "✓ Model A complete!"
      inputs:
      - task: generate-dataset  # (1)

    - name: train-model-b
      image: ubuntu:24.04
      command: ["bash", "-c"]
      args:
      - |
        echo "Training Model B..."
        wc -l {{input:0}}/data/dataset.csv
        echo "✓ Model B complete!"
      inputs:
      - task: generate-dataset

The generate-dataset task is an input task for the train-model-a task. Therefore, the entire group train-models waits for prepare-data to complete.

Execution Flow:

Group prepare-data starts → generate-dataset and validate-data run in parallel.
Task generate-dataset completes → Group train-models dependencies are satisfied.
Group train-models starts → train-model-a and train-model-b run in parallel.

Important

Group dependencies are established through task dependencies.

If any task in a group depends on a task from another group, the entire group waits for the other group to complete.

Key Characteristics:

✅ Serial execution between groups
✅ Parallel execution within groups
✅ Data flows from Group 1 to Group 2
✅ All tasks access the same data from the previous group

Complex Example#

Let’s build a more complex data processing pipeline by downloading the workflow definition here: combination_workflow_complex.yaml.

workflow:
  name: complex-pipeline

  groups:
  ##################################################
  # Group 1: Fetch data
  ##################################################
  - name: fetch
    tasks:
    - name: download
      lead: true
      image: ubuntu:24.04
      command: ["bash", "-c"]
      args:
      - |
        echo 'Downloading data...'
        mkdir -p {{output}}/data
        echo "apple" > {{output}}/data/fruits.txt
        echo "banana" >> {{output}}/data/fruits.txt
        echo "cherry" >> {{output}}/data/fruits.txt
        echo "Data downloaded!"

  ##################################################
  # Group 2: Process (depends on Group 1)
  ##################################################
  - name: process
    tasks:
    - name: transform-a
      lead: true
      image: ubuntu:24.04
      command: ["bash", "-c"]
      args:
      - |
        echo 'Transform A: Converting to uppercase...'
        mkdir -p {{output}}/transformed
        tr '[:lower:]' '[:upper:]' < {{input:0}}/data/fruits.txt > {{output}}/transformed/uppercase.txt

        sleep 120 # (1)
        echo "✓ Transform A complete!"
      inputs:
      - task: download

    - name: transform-b
      image: ubuntu:24.04
      command: ["bash", "-c"]
      args:
      - |
        echo 'Transform B: Adding line numbers...'
        mkdir -p {{output}}/transformed
        cat -n {{input:0}}/data/fruits.txt > {{output}}/transformed/numbered.txt
        echo "✓ Transform B complete!"
      inputs:
      - task: download

  ##################################################
  # Group 3: Aggregate (depends on Group 2)
  ##################################################
  - name: aggregate
    tasks:
    - name: combine
      lead: true
      image: ubuntu:24.04
      command: ["bash", "-c"]
      args:
      - |
        echo 'Combining results from both transforms...'
        mkdir -p {{output}}/final

        echo "=== UPPERCASE VERSION ===" > {{output}}/final/combined.txt
        cat {{input:0}}/transformed/uppercase.txt >> {{output}}/final/combined.txt

        echo "" >> {{output}}/final/combined.txt
        echo "=== NUMBERED VERSION ===" >> {{output}}/final/combined.txt
        cat {{input:1}}/transformed/numbered.txt >> {{output}}/final/combined.txt

        echo "✓ Results combined!"
        cat {{output}}/final/combined.txt
      inputs:  # (2)
      - task: transform-a
      - task: transform-b

The transform-a task is intentionally longer than the transform-b task to ensure that the lead task doesn’t prematurely terminate non-lead tasks.
Group 3 depends on both tasks from Group 2, so it waits for Group 2 to complete.

Execution:

Group fetch starts → download task runs
download task completes → Group process dependencies are satisfied.
Group process starts → transform-a and transform-b run in parallel.
transform-b task completes → Group aggregate dependencies not yet satisfied.
transform-a task completes → Group aggregate dependencies are satisfied.
Group aggregate starts → combine task runs with outputs from both transforms.

Caution

To ensure that all tasks in a group run to completion, you should make sure that the lead task does not terminate before non-lead tasks.

This can be done by coordinating task completion through a barrier script (osmo_barrier.py ) or by ensuring that the lead task duration is longer than the non-lead tasks.

Important

Best Practices:

✅ Always designate one task as lead: true in each group
✅ Use clear group names that reflect their purpose (e.g., prepare-data, train-models)
✅ Make dependencies explicit through task inputs
✅ Consider which tasks should run in parallel vs. serially

Next Steps#

Continue Learning:

Gang Scheduling - Run tasks across different hardware platforms (x86, ARM, GPU) simultaneously
Advanced Patterns - Workflow templates, checkpointing, error handling, and more