Resource Pools#

After successfully configuring the default pool, you can create additional pools to organize and control how users access your compute resources.

Why Create Multiple Pools?#

Pools divide your compute backend into logical resource groupings that enable:

Simplified User Experience

Apply Pod Templates to pools so users don’t repeat Kubernetes specifications in every workflow. Templates automatically handle node selectors, tolerations, and other scheduling requirements.

Resource Guardrails

Use Resource Validation rules to reject workflows that request more resources than available on your nodes, preventing scheduling failures.

Hardware Differentiation

For heterogeneous clusters with multiple GPU types, create platforms within pools to route workflows to specific hardware (A100, H100, L40S, etc.).

User Access Control

Integrate pools with user groups and roles to manage permissions. See Authentication and Authorization for authentication and authorization details. For example, control which user groups can access specific compute resources based on workload type (training, simulation, inference) or project teams.

Pool Architecture#

Pools organize compute resources in a hierarchical structure:

Backend (Kubernetes Cluster)
├── Pool: training-pool
│   ├── Platform: a100
│   └── Platform: h100
├── Pool: simulation-pool
│   ├── Platform: l40s
│   └── Platform: l40
└── Pool: inference-pool
    └── Platform: jetson-agx-orin

Workflow Submission Flow:

1. Access Control 🔐

Check user permissions

2. Resource Check ⚖️

Validate requests

3. Apply Templates 📋

Build K8s specs

4. Select Platform 🎯

Route to hardware

5. Schedule & Run ▶️

Identify a node in cluster

Note

For detailed pool and platform configuration fields, see /api/configs/pool in the API reference documentation.

Practical Guide#

📄 Edit in your Helm values file

Everything on this section is to be added in your Helm values file under services.configs. Apply changes with helm upgrade.

Heterogeneous Pools#

For clusters with multiple GPU types (L40S, A100, H100, etc.), use platforms to route workflows to specific hardware.

Step 1: Identify Node Labels

Discover node labels and tolerations for your hardware:

$ kubectl get nodes -o jsonpath='{.items[*].metadata.labels}' | jq -r 'to_entries[] | select(.key | startswith("nvidia.com/gpu.product")) | .value'
$ kubectl get nodes -o jsonpath='{.items[*].metadata.tolerations}'

Step 2: Define Pod Templates for Each GPU Type

Pod templates are reusable Kubernetes pod fragments — they provide node selectors, tolerations, and other scheduling hints. Define them under services.configs.podTemplates:

services:
  configs:
    enabled: true
    podTemplates:
      l40s:
        spec:
          nodeSelector:
            nvidia.com/gpu.product: NVIDIA-L40S
      a100:
        spec:
          nodeSelector:
            nvidia.com/gpu.product: NVIDIA-A100
          tolerations:
            - key: nvidia.com/gpu.product
              operator: Equal
              value: NVIDIA-A100
              effect: NoSchedule

Step 3: Define the Pool with Platforms

Add the pool under services.configs.pools and reference the pod templates via override_pod_template. The example below also references default_user and default_ctrl, default_cpu, default_memory, and default_storage — these are built-in pod templates and resource validations that ship with the Helm chart defaults (see services.configs.podTemplates and services.configs.resourceValidations in the chart’s values.yaml). You can use them as-is, override them by re-declaring with the same name in your values, or drop the references from the pool.

services:
  configs:
    pools:
      heterogeneous_pool:
        backend: default
        default_platform: l40s_platform
        description: Simulation and training pool
        common_default_variables:
          USER_CPU: 1
          USER_GPU: 0
          USER_MEMORY: 1Gi
          USER_STORAGE: 1Gi
        common_resource_validations:
          - default_cpu
          - default_memory
          - default_storage
        common_pod_template:
          - default_user
          - default_ctrl
        platforms:
          l40s_platform:
            description: L40S platform
            host_network_allowed: false
            privileged_allowed: false
            default_variables: {}
            resource_validations: []
            override_pod_template:
              - l40s
            allowed_mounts: []
          a100_platform:
            description: A100 platform
            host_network_allowed: false
            privileged_allowed: false
            default_variables: {}
            resource_validations: []
            override_pod_template:
              - a100
            allowed_mounts: []

Step 4: Apply the Helm values

helm upgrade osmo deployments/charts/service -f my-values.yaml

Once the roll out completes, verify the pool:

$ osmo resource list --pool heterogeneous_pool

Step 5: Create a Role for the Pool

Add a role under services.configs.roles to grant access to the pool:

services:
  configs:
    roles:
      osmo-heterogeneous_pool:
        description: Submit workflows to heterogeneous_pool
        policies:
          - action: workflow:Create
            resource: pool/heterogeneous_pool
          - action: workflow:Read
            resource: pool/heterogeneous_pool

Users with this role can submit workflows to the new pool.

Note

For more info on role conventions, see Pool Role Naming Convention.

Step 6: Assign the role to users

Assign the role osmo-heterogeneous_pool to users so they can access the pool:

  • Without an IdP: Use the OSMO user and role APIs (e.g. create users with POST /api/auth/user, then assign the role with POST /api/auth/user/{id}/roles). See Roles and Policies.

  • With an IdP: Map an IdP group to this role using IdP Role Mapping and Sync Modes so that users in that group get the role when they log in.

Additional Examples#

Training Pool - High-Performance GPU Pool

Configure a pool for training workloads with GB200 platform:

services:
  configs:
    pools:
      robotics-training:
        description: High-performance GPU pool for robotics model training
        backend: gpu-cluster-01
        default_platform: h100-platform
        common_default_variables:
          USER_CPU: 16
          USER_GPU: 1
          USER_MEMORY: 64Gi
          USER_STORAGE: 500Gi
        common_resource_validations:
          - default_cpu
          - default_memory
          - default_storage
          - gpu_training_validation
        common_pod_template:
          - default_amd64
          - training_optimized
          - high_memory
        platforms:
          gb200-platform:
            description: GB200 GPUs for high performance training
            override_pod_template:
              - training_gb200_template
            default_variables:
              USER_MEMORY: 80Gi
Simulation Pool - Graphics-Optimized Pool

Configure a pool for simulation workloads with L40/L40S platforms:

services:
  configs:
    pools:
      robotics-simulation:
        description: Graphics-optimized pool for robotics simulation
        backend: graphics-cluster-01
        default_platform: l40-platform
        common_default_variables:
          USER_CPU: 8
          USER_GPU: 1
          USER_MEMORY: 32Gi
          USER_STORAGE: 200Gi
        common_resource_validations:
          - default_cpu
          - default_memory
          - default_storage
          - simulation_gpu_validation
        common_pod_template:
          - default_amd64
          - simulation_optimized
          - graphics_drivers
        platforms:
          l40-platform:
            description: L40 GPUs for standard simulation
            override_pod_template:
              - simulation_l40_template
          l40s-platform:
            description: L40S GPUs for high-fidelity simulation
            override_pod_template:
              - simulation_l40s_template
            default_variables:
              USER_MEMORY: 48Gi
Inference Pool - NVIDIA Jetsons Pool

Configure a pool for inference workloads with NVIDIA Jetsons:

services:
  configs:
    pools:
      robotics-inference:
        description: NVIDIA Jetsons pool for model inference
        backend: inference-cluster-01
        default_platform: jetson-thor-platform
        common_default_variables:
          USER_CPU: 4
          USER_GPU: 0
          USER_MEMORY: 16Gi
          USER_STORAGE: 50Gi
        common_resource_validations:
          - default_cpu
          - default_memory
          - default_storage
          - inference_validation
        common_pod_template:
          - default_amd64
          - inference_optimized
          - low_latency
        platforms:
          jetson-thor-platform:
            description: Jetson Thor platform for edge AI inference
            override_pod_template:
              - inference_jetson_thor_template
            default_variables:
              USER_GPU: 1
              USER_MEMORY: 8Gi

Enabling Topology-Aware Scheduling#

Topology-aware scheduling ensures that tasks requiring high-bandwidth or low-latency communication are placed on physically co-located nodes—such as the same NVLink rack, spine switch, or availability zone. This requires KAI Scheduler v0.12 or later and nodes with the appropriate Kubernetes labels applied.

Note

topology_keys can only be configured on pools backed by a KAI Scheduler backend. Configuring it on a pool with an unsupported scheduler will be rejected.

📄 Edit in your Helm values file

Everything on this section is to be added in your Helm values file under services.configs. Apply changes with helm upgrade.

Step 1: Verify Node Labels

Confirm that your cluster nodes have labels for each topology level you want to expose. The label keys must match what you will configure in topology_keys:

$ kubectl get nodes -o jsonpath='{.items[*].metadata.labels}' | jq -r 'to_entries[] | select(.key | test("topology.kubernetes.io|nvidia.com/gpu-clique")) | "\(.key)=\(.value)"'

Step 2: Add Topology Keys to the Pool Config

Add a topology_keys list to your pool configuration, ordered from coarsest to finest granularity. Each entry maps a user-friendly key name (which users reference in their workflow specs) to the actual Kubernetes node label:

services:
  configs:
    pools:
      my-pool:
        backend: my-backend
        topology_keys:
          - key: zone
            label: topology.kubernetes.io/zone
          - key: spine
            label: topology.kubernetes.io/spine
          - key: rack
            label: topology.kubernetes.io/rack
          - key: gpu-clique
            label: nvidia.com/gpu-clique

Step 3: Apply the Helm values

helm upgrade osmo deployments/charts/service -f my-values.yaml

OSMO will create a KAI Topology CRD in the cluster for this pool. Users can then reference the configured key names when specifying topology requirements in their workflow specs.

See also

See Topology-Aware Scheduling for how users specify topology requirements in their workflows.

Troubleshooting#

Pool Access Denied
  • Verify user’s group membership matches pool naming convention

  • Check role configuration includes correct pool path

Resource Validation Failures
  • Ensure validation rules match node capacity

  • Verify resource requests don’t exceed platform limits

Template Conflicts
  • Review template merge order (later templates override earlier ones)

  • Check for conflicting fields in merged templates

Platform Not Available
  • Verify platform name is correctly specified in pool configuration

  • Ensure referenced pod templates exist

Debugging Tips
  • Start with simple configurations and add complexity gradually

  • Test access with different user accounts

  • Examine OSMO service logs for detailed error messages

Warning

Deleting or modifying pools used by running workflows may cause scheduling issues. Always verify pools are not in use before making changes.