Resource Pools#

After successfully configuring the default pool, you can create additional pools to organize and control how users access your compute resources.

Why Create Multiple Pools?#

Pools divide your compute backend into logical resource groupings that enable:

✓ Simplified User Experience: Apply Pod Templates to pools so users don’t repeat Kubernetes specifications in every workflow. Templates automatically handle node selectors, tolerations, and other scheduling requirements.
✓ Resource Guardrails: Use Resource Validation rules to reject workflows that request more resources than available on your nodes, preventing scheduling failures.
✓ Hardware Differentiation: For heterogeneous clusters with multiple GPU types, create platforms within pools to route workflows to specific hardware (A100, H100, L40S, etc.).
✓ User Access Control: Integrate pools with user groups and roles to manage permissions. See Authentication and Authorization for authentication and authorization details. For example, control which user groups can access specific compute resources based on workload type (training, simulation, inference) or project teams.

Pool Architecture#

Pools organize compute resources in a hierarchical structure:

Backend (Kubernetes Cluster)
├── Pool: training-pool
│   ├── Platform: a100
│   └── Platform: h100
├── Pool: simulation-pool
│   ├── Platform: l40s
│   └── Platform: l40
└── Pool: inference-pool
    └── Platform: jetson-agx-orin

Workflow Submission Flow:

1. Access Control 🔐

Check user permissions

Verify pool access rights

2. Resource Check ⚖️

Validate requests

Ensure node capacity

3. Apply Templates 📋

Build K8s specs

Merge pod templates

4. Select Platform 🎯

Route to hardware

A100, H100, L40S, etc.

5. Schedule & Run ▶️

Identify a node in cluster

Pod is running on the node

Note

For detailed pool and platform configuration fields, see /api/configs/pool in the API reference documentation.

Practical Guide#

📄 Edit in your Helm values file

Everything on this section is to be added in your Helm values file under services.configs. Apply changes with helm upgrade.

Heterogeneous Pools#

For clusters with multiple GPU types (L40S, A100, H100, etc.), use platforms to route workflows to specific hardware.

Step 1: Identify Node Labels

Discover node labels and tolerations for your hardware:

$ kubectl get nodes -o jsonpath='{.items[*].metadata.labels}' | jq -r 'to_entries[] | select(.key | startswith("nvidia.com/gpu.product")) | .value'
$ kubectl get nodes -o jsonpath='{.items[*].metadata.tolerations}'

Step 2: Define Pod Templates for Each GPU Type

Pod templates are reusable Kubernetes pod fragments — they provide node selectors, tolerations, and other scheduling hints. Define them under services.configs.podTemplates:

services:
  configs:
    enabled: true
    podTemplates:
      l40s:
        spec:
          nodeSelector:
            nvidia.com/gpu.product: NVIDIA-L40S
      a100:
        spec:
          nodeSelector:
            nvidia.com/gpu.product: NVIDIA-A100
          tolerations:
            - key: nvidia.com/gpu.product
              operator: Equal
              value: NVIDIA-A100
              effect: NoSchedule

Step 3: Define the Pool with Platforms

Add the pool under services.configs.pools and reference the pod templates via override_pod_template. The example below also references default_user and default_ctrl, default_cpu, default_memory, and default_storage — these are built-in pod templates and resource validations that ship with the Helm chart defaults (see services.configs.podTemplates and services.configs.resourceValidations in the chart’s values.yaml). You can use them as-is, override them by re-declaring with the same name in your values, or drop the references from the pool.

services:
  configs:
    pools:
      heterogeneous_pool:
        backend: default
        default_platform: l40s_platform
        description: Simulation and training pool
        common_default_variables:
          USER_CPU: 1
          USER_GPU: 0
          USER_MEMORY: 1Gi
          USER_STORAGE: 1Gi
        common_resource_validations:
          - default_cpu
          - default_memory
          - default_storage
        common_pod_template:
          - default_user
          - default_ctrl
        platforms:
          l40s_platform:
            description: L40S platform
            host_network_allowed: false
            privileged_allowed: false
            default_variables: {}
            resource_validations: []
            override_pod_template:
              - l40s
            allowed_mounts: []
          a100_platform:
            description: A100 platform
            host_network_allowed: false
            privileged_allowed: false
            default_variables: {}
            resource_validations: []
            override_pod_template:
              - a100
            allowed_mounts: []

Step 4: Apply the Helm values

helm upgrade osmo deployments/charts/service -f my-values.yaml

Once the roll out completes, verify the pool:

$ osmo resource list --pool heterogeneous_pool

Step 5: Create a Role for the Pool

Add a role under services.configs.roles to grant access to the pool:

services:
  configs:
    roles:
      osmo-heterogeneous_pool:
        description: Submit workflows to heterogeneous_pool
        policies:
          - action: workflow:Create
            resource: pool/heterogeneous_pool
          - action: workflow:Read
            resource: pool/heterogeneous_pool

Users with this role can submit workflows to the new pool.

Note

For more info on role conventions, see Pool Role Naming Convention.

Step 6: Assign the role to users

Assign the role osmo-heterogeneous_pool to users so they can access the pool:

Without an IdP: Use the OSMO user and role APIs (e.g. create users with POST /api/auth/user, then assign the role with POST /api/auth/user/{id}/roles). See Roles and Policies.
With an IdP: Map an IdP group to this role using IdP Role Mapping and Sync Modes so that users in that group get the role when they log in.

Additional Examples#

Enabling Topology-Aware Scheduling#

Topology-aware scheduling ensures that tasks requiring high-bandwidth or low-latency communication are placed on physically co-located nodes—such as the same NVLink rack, spine switch, or availability zone. This requires KAI Scheduler v0.12 or later and nodes with the appropriate Kubernetes labels applied.

Note

topology_keys can only be configured on pools backed by a KAI Scheduler backend. Configuring it on a pool with an unsupported scheduler will be rejected.

📄 Edit in your Helm values file

Everything on this section is to be added in your Helm values file under services.configs. Apply changes with helm upgrade.

Step 1: Verify Node Labels

Confirm that your cluster nodes have labels for each topology level you want to expose. The label keys must match what you will configure in topology_keys:

$ kubectl get nodes -o jsonpath='{.items[*].metadata.labels}' | jq -r 'to_entries[] | select(.key | test("topology.kubernetes.io|nvidia.com/gpu-clique")) | "\(.key)=\(.value)"'

Step 2: Add Topology Keys to the Pool Config

Add a topology_keys list to your pool configuration, ordered from coarsest to finest granularity. Each entry maps a user-friendly key name (which users reference in their workflow specs) to the actual Kubernetes node label:

services:
  configs:
    pools:
      my-pool:
        backend: my-backend
        topology_keys:
          - key: zone
            label: topology.kubernetes.io/zone
          - key: spine
            label: topology.kubernetes.io/spine
          - key: rack
            label: topology.kubernetes.io/rack
          - key: gpu-clique
            label: nvidia.com/gpu-clique

Step 3: Apply the Helm values

helm upgrade osmo deployments/charts/service -f my-values.yaml

OSMO will create a KAI Topology CRD in the cluster for this pool. Users can then reference the configured key names when specifying topology requirements in their workflow specs.

Troubleshooting#

Pool Access Denied

Verify user’s group membership matches pool naming convention
Check role configuration includes correct pool path

Resource Validation Failures

Ensure validation rules match node capacity
Verify resource requests don’t exceed platform limits

Template Conflicts

Review template merge order (later templates override earlier ones)
Check for conflicting fields in merged templates

Platform Not Available

Verify platform name is correctly specified in pool configuration
Ensure referenced pod templates exist

Debugging Tips

Start with simple configurations and add complexity gradually
Test access with different user accounts
Examine OSMO service logs for detailed error messages

Warning

Deleting or modifying pools used by running workflows may cause scheduling issues. Always verify pools are not in use before making changes.