Architecture#
OSMO follows a modular, cloud-native architecture designed to orchestrate complex AI and robotics workflows across heterogeneous compute resources.
Control Plane#
The OSMO control plane manages the entire lifecycle of your workflows:
- API Server
Provides RESTful APIs for workflow submission, monitoring, and management. Accessible through both CLI and Web UI interfaces.
- Workflow Engine
Parses YAML workflow specifications, orchestrates task execution, handles dependencies, and manages the workflow state machine.
- Authentication & Authorization
Integrates with external identity providers (OIDC, SAML) to manage user access and permissions.
Compute Layer#
OSMO connects to multiple Kubernetes clusters as compute backends:
- Pools & Platforms
Resources are organized into pools (logical groupings) and platforms (specific hardware types) allowing precise targeting of workloads.
- Heterogeneous Support
Connect cloud clusters (AKS, EKS, GKE), on-premise bare-metal clusters, and edge devices (NVIDIA Jetson) simultaneously.
- Scheduler
Leverages NVIDIA Run:AI to intelligently allocate GPU and CPU resources across workflows, optimizing for utilization and fairness.
- Container Orchestration
Each task in a workflow runs as a Kubernetes pod with specified container images, resource requirements, and environment configurations.
Data Layer#
OSMO manages data through an abstraction layer:
- Storage Backends
Supports S3-compatible object storage and Azure Blob Storage with configurable credentials.
- Data Injection
Automatically injects data into task containers at specified paths, enabling seamless access to inputs and outputs.
- Working with Control Plane
Integrates with the control plane’s Dataset Service to enable version-controlled storage for training data, models, and artifacts using content-addressable storage .
How It Works#
Submitting#
%%{init: {'theme':'base'}}%%
sequenceDiagram
autonumber
participant User
box Control Plane
participant API as API Server
participant Workflow as Workflow Engine
end
box Compute Layer
participant Scheduler as Scheduler
end
User->>API: Submit workflow<br/>(CLI or Web UI)
API->>API: Validate<br/>specification
API->>Workflow: Generates<br/>execution graph
Workflow->>Scheduler: Enqueues task(s)
User submits a workflow via the CLI or Web UI
API Server authenticates the user and validates the workflow specification
Workflow Engine parses and builds the workflow execution graph (DAG of tasks)
Workflow Engine enqueues tasks to the Scheduler for resource allocation
Scheduling#
%%{init: {'theme':'base'}}%%
sequenceDiagram
box Control Plane
participant Workflow as Workflow Engine
end
box Compute Layer
participant Scheduler as Scheduler
participant Cluster as Compute Nodes
end
Workflow-->>Scheduler: Enqueues a task
autonumber 1
Scheduler->>Scheduler: Evaluates<br/>requirements
Scheduler->>Cluster: Assigns resources
Cluster->>Cluster: Provisions pod
After a task is enqueued to the Scheduler:
Scheduler evaluates the task’s resource requirements
Scheduler assigns the task to appropriate pools and platforms with available capacity
Compute Cluster creates a pod with specified container image, credentials, and environment variables
Executing#
%%{init: {'theme':'base'}}%%
sequenceDiagram
box Control Plane
participant Dataset as Dataset Service
end
box Compute Layer
participant Cluster as Compute Nodes
end
box Data Layer
participant Storage as Storage
end
opt When using a dataset⁺
Cluster-->>Dataset: Fetches data metadata
end
autonumber 1
Storage->>Cluster: Injects inputs
Cluster->>Cluster: Executes task
Cluster->>Storage: Persists outputs
autonumber off
opt When using a dataset⁺
Cluster-->>Dataset: Writes data metadata
end
Storage injects task inputs into the container
Compute Nodes executes the task using the injected inputs
Task outputs are persisted to Storage for downstream tasks and external access
+ For
datasets, a task uses the Dataset Service to index the data for efficient storage and retrieval.
Orchestrating#
Serial Workflows#
Tasks with dependencies execute sequentially in topological order:
%%{init: {'theme':'base'}}%%
sequenceDiagram
box Control Plane
participant Workflow as Workflow Engine
end
box Compute Layer
participant Scheduler as Scheduler
participant Cluster as Compute Nodes
end
Workflow->>Scheduler: Enqueues Task 1 first
Scheduler->>Cluster: Executing Task 1
activate Cluster
Cluster-->>Workflow: Task 1 completed
deactivate Cluster
Workflow->>Scheduler: Enqueues Task 2 next
Scheduler->>Cluster: Executing Task 2
activate Cluster
Cluster-->>Workflow: Task 2 completed
deactivate Cluster
Parallel Workflows#
Independent tasks execute in parallel for maximum throughput:
%%{init: {'theme':'base'}}%%
sequenceDiagram
box Control Plane
participant Workflow as Workflow Engine
end
box Compute Layer
participant Scheduler as Scheduler
participant Cluster as Compute Nodes
end
Workflow->>Scheduler: Enqueues Task 1 & Task 2
par
Scheduler->>Cluster: Executing Task 1
activate Cluster
Cluster-->>Workflow: Task 1 completed
deactivate Cluster
and
Scheduler->>Cluster: Executing Task 2
activate Cluster
Cluster-->>Workflow: Task 2 completed
deactivate Cluster
end
Status Monitoring#
Workflow Engine monitors task states and orchestrates execution based on dependencies and outcomes:
%%{init: {'theme':'base'}}%%
sequenceDiagram
box Control Plane
participant Workflow as Workflow Engine
end
box Compute Layer
participant Scheduler as Scheduler
participant Cluster as Compute Nodes
end
alt
autonumber 1
Cluster-->>Workflow: Task completed
autonumber off
Workflow->>Scheduler: Enqueues<br/>downstream tasks
else
autonumber 2
Cluster-->>Workflow: Task failed (retryable)
autonumber off
Workflow->>Scheduler: Re-enqueues task
Scheduler->>Cluster: Task is rescheduled
else
autonumber 3
Cluster-->>Workflow: Task failed (non-retryable)
autonumber off
Workflow-xWorkflow: Blocks downstream tasks
end
Completed tasks trigger their dependent downstream tasks
Failed retryable tasks are re-enqueued for automatic recovery
Failed non-retryable tasks block their downstream dependencies
Interacting#
Observability#
Tasks publishes logs and metrics to the API Server, enabling real-time observability.
%%{init: {'theme':'base'}}%%
sequenceDiagram
participant User
box Control Plane
participant API as API Server
end
box Compute Layer
participant Cluster as Compute Nodes
end
Cluster->>API: Publishes logs & metrics
API-->>User: Streams logs & metrics
Interactive Access#
User can access RUNNING tasks for real-time interactive development and debugging:
%%{init: {'theme':'base'}}%%
sequenceDiagram
autonumber
participant User
box Control Plane
participant API as API Server
end
box Compute Layer
participant Cluster as Compute Nodes
end
User->>API: Initiates handshake
API->>Cluster: Establishes handshake
User<<-->>Cluster: Bi-directional streaming
User initiates interactive access through the API Server
API Server coordinates with the Compute Nodes to establish a secured connection
User can interact with the Compute Nodes using the bi-directional streaming connection (e.g.
SSH,VSCode Remote,Jupyter, etc.).
Summary#
This architecture enables OSMO to scale from a single developer workstation to massive cloud deployments while maintaining a consistent interface and workflow experience.
See also
For detailed deployment procedures, refer to the Deployment Guide.