Resource Pools#
After successfully configuring the default pool, you can create additional pools to organize and control how users access your compute resources.
Why Create Multiple Pools?#
Pools divide your compute backend into logical resource groupings that enable:
- ✓ Simplified User Experience
Apply Pod Templates to pools so users don’t repeat Kubernetes specifications in every workflow. Templates automatically handle node selectors, tolerations, and other scheduling requirements.
- ✓ Resource Guardrails
Use Resource Validation rules to reject workflows that request more resources than available on your nodes, preventing scheduling failures.
- ✓ Hardware Differentiation
For heterogeneous clusters with multiple GPU types, create platforms within pools to route workflows to specific hardware (A100, H100, L40S, etc.).
- ✓ User Access Control
Integrate pools with user groups and roles to manage permissions. See Authentication and Authorization for authentication and authorization details. For example, control which user groups can access specific compute resources based on workload type (training, simulation, inference) or project teams.
Pool Architecture#
Pools organize compute resources in a hierarchical structure:
Backend (Kubernetes Cluster)
├── Pool: training-pool
│ ├── Platform: a100
│ └── Platform: h100
├── Pool: simulation-pool
│ ├── Platform: l40s
│ └── Platform: l40
└── Pool: inference-pool
└── Platform: jetson-agx-orin
Workflow Submission Flow:
1. Access Control 🔐
Check user permissions
2. Resource Check ⚖️
Validate requests
3. Apply Templates 📋
Build K8s specs
4. Select Platform 🎯
Route to hardware
5. Schedule & Run ▶️
Identify a node in cluster
Note
For detailed pool and platform configuration fields, see /api/configs/pool in the API reference documentation.
Practical Guide#
📄 Edit in your Helm values file
Everything on this section is to be added in your Helm values file under services.configs.
Apply changes with helm upgrade.
Heterogeneous Pools#
For clusters with multiple GPU types (L40S, A100, H100, etc.), use platforms to route workflows to specific hardware.
Step 1: Identify Node Labels
Discover node labels and tolerations for your hardware:
$ kubectl get nodes -o jsonpath='{.items[*].metadata.labels}' | jq -r 'to_entries[] | select(.key | startswith("nvidia.com/gpu.product")) | .value'
$ kubectl get nodes -o jsonpath='{.items[*].metadata.tolerations}'
Step 2: Define Pod Templates for Each GPU Type
Pod templates are reusable Kubernetes pod fragments — they provide node selectors, tolerations, and other scheduling hints. Define them under services.configs.podTemplates:
services:
configs:
enabled: true
podTemplates:
l40s:
spec:
nodeSelector:
nvidia.com/gpu.product: NVIDIA-L40S
a100:
spec:
nodeSelector:
nvidia.com/gpu.product: NVIDIA-A100
tolerations:
- key: nvidia.com/gpu.product
operator: Equal
value: NVIDIA-A100
effect: NoSchedule
Step 3: Define the Pool with Platforms
Add the pool under services.configs.pools and reference the pod templates via override_pod_template. The example below also references default_user and default_ctrl, default_cpu, default_memory, and default_storage — these are built-in pod templates and resource validations that ship with the Helm chart defaults (see services.configs.podTemplates and services.configs.resourceValidations in the chart’s values.yaml). You can use them as-is, override them by re-declaring with the same name in your values, or drop the references from the pool.
services:
configs:
pools:
heterogeneous_pool:
backend: default
default_platform: l40s_platform
description: Simulation and training pool
common_default_variables:
USER_CPU: 1
USER_GPU: 0
USER_MEMORY: 1Gi
USER_STORAGE: 1Gi
common_resource_validations:
- default_cpu
- default_memory
- default_storage
common_pod_template:
- default_user
- default_ctrl
platforms:
l40s_platform:
description: L40S platform
host_network_allowed: false
privileged_allowed: false
default_variables: {}
resource_validations: []
override_pod_template:
- l40s
allowed_mounts: []
a100_platform:
description: A100 platform
host_network_allowed: false
privileged_allowed: false
default_variables: {}
resource_validations: []
override_pod_template:
- a100
allowed_mounts: []
Step 4: Apply the Helm values
helm upgrade osmo deployments/charts/service -f my-values.yaml
Once the roll out completes, verify the pool:
$ osmo resource list --pool heterogeneous_pool
Step 5: Create a Role for the Pool
Add a role under services.configs.roles to grant access to the pool:
services:
configs:
roles:
osmo-heterogeneous_pool:
description: Submit workflows to heterogeneous_pool
policies:
- action: workflow:Create
resource: pool/heterogeneous_pool
- action: workflow:Read
resource: pool/heterogeneous_pool
Users with this role can submit workflows to the new pool.
Note
For more info on role conventions, see Pool Role Naming Convention.
Step 6: Assign the role to users
Assign the role osmo-heterogeneous_pool to users so they can access the pool:
Without an IdP: Use the OSMO user and role APIs (e.g. create users with
POST /api/auth/user, then assign the role withPOST /api/auth/user/{id}/roles). See Roles and Policies.With an IdP: Map an IdP group to this role using IdP Role Mapping and Sync Modes so that users in that group get the role when they log in.
Additional Examples#
Training Pool - High-Performance GPU Pool
Configure a pool for training workloads with GB200 platform:
services:
configs:
pools:
robotics-training:
description: High-performance GPU pool for robotics model training
backend: gpu-cluster-01
default_platform: h100-platform
common_default_variables:
USER_CPU: 16
USER_GPU: 1
USER_MEMORY: 64Gi
USER_STORAGE: 500Gi
common_resource_validations:
- default_cpu
- default_memory
- default_storage
- gpu_training_validation
common_pod_template:
- default_amd64
- training_optimized
- high_memory
platforms:
gb200-platform:
description: GB200 GPUs for high performance training
override_pod_template:
- training_gb200_template
default_variables:
USER_MEMORY: 80Gi
Simulation Pool - Graphics-Optimized Pool
Configure a pool for simulation workloads with L40/L40S platforms:
services:
configs:
pools:
robotics-simulation:
description: Graphics-optimized pool for robotics simulation
backend: graphics-cluster-01
default_platform: l40-platform
common_default_variables:
USER_CPU: 8
USER_GPU: 1
USER_MEMORY: 32Gi
USER_STORAGE: 200Gi
common_resource_validations:
- default_cpu
- default_memory
- default_storage
- simulation_gpu_validation
common_pod_template:
- default_amd64
- simulation_optimized
- graphics_drivers
platforms:
l40-platform:
description: L40 GPUs for standard simulation
override_pod_template:
- simulation_l40_template
l40s-platform:
description: L40S GPUs for high-fidelity simulation
override_pod_template:
- simulation_l40s_template
default_variables:
USER_MEMORY: 48Gi
Inference Pool - NVIDIA Jetsons Pool
Configure a pool for inference workloads with NVIDIA Jetsons:
services:
configs:
pools:
robotics-inference:
description: NVIDIA Jetsons pool for model inference
backend: inference-cluster-01
default_platform: jetson-thor-platform
common_default_variables:
USER_CPU: 4
USER_GPU: 0
USER_MEMORY: 16Gi
USER_STORAGE: 50Gi
common_resource_validations:
- default_cpu
- default_memory
- default_storage
- inference_validation
common_pod_template:
- default_amd64
- inference_optimized
- low_latency
platforms:
jetson-thor-platform:
description: Jetson Thor platform for edge AI inference
override_pod_template:
- inference_jetson_thor_template
default_variables:
USER_GPU: 1
USER_MEMORY: 8Gi
Enabling Topology-Aware Scheduling#
Topology-aware scheduling ensures that tasks requiring high-bandwidth or low-latency communication are placed on physically co-located nodes—such as the same NVLink rack, spine switch, or availability zone. This requires KAI Scheduler v0.12 or later and nodes with the appropriate Kubernetes labels applied.
Note
topology_keys can only be configured on pools backed by a KAI Scheduler backend.
Configuring it on a pool with an unsupported scheduler will be rejected.
📄 Edit in your Helm values file
Everything on this section is to be added in your Helm values file under services.configs.
Apply changes with helm upgrade.
Step 1: Verify Node Labels
Confirm that your cluster nodes have labels for each topology level you want to expose. The
label keys must match what you will configure in topology_keys:
$ kubectl get nodes -o jsonpath='{.items[*].metadata.labels}' | jq -r 'to_entries[] | select(.key | test("topology.kubernetes.io|nvidia.com/gpu-clique")) | "\(.key)=\(.value)"'
Step 2: Add Topology Keys to the Pool Config
Add a topology_keys list to your pool configuration, ordered from coarsest to finest
granularity. Each entry maps a user-friendly key name (which users reference in their
workflow specs) to the actual Kubernetes node label:
services:
configs:
pools:
my-pool:
backend: my-backend
topology_keys:
- key: zone
label: topology.kubernetes.io/zone
- key: spine
label: topology.kubernetes.io/spine
- key: rack
label: topology.kubernetes.io/rack
- key: gpu-clique
label: nvidia.com/gpu-clique
Step 3: Apply the Helm values
helm upgrade osmo deployments/charts/service -f my-values.yaml
OSMO will create a KAI Topology CRD in the cluster for this pool. Users can then reference the configured key names when specifying topology requirements in their workflow specs.
See also
See Topology-Aware Scheduling for how users specify topology requirements in their workflows.
Troubleshooting#
- Pool Access Denied
Verify user’s group membership matches pool naming convention
Check role configuration includes correct pool path
- Resource Validation Failures
Ensure validation rules match node capacity
Verify resource requests don’t exceed platform limits
- Template Conflicts
Review template merge order (later templates override earlier ones)
Check for conflicting fields in merged templates
- Platform Not Available
Verify platform name is correctly specified in pool configuration
Ensure referenced pod templates exist
- Debugging Tips
Start with simple configurations and add complexity gradually
Test access with different user accounts
Examine OSMO service logs for detailed error messages
Warning
Deleting or modifying pools used by running workflows may cause scheduling issues. Always verify pools are not in use before making changes.