Resource Pools#
After successfully configuring the default pool, you can create additional pools to organize and control how users access your compute resources.
Why Create Multiple Pools?#
Pools divide your compute backend into logical resource groupings that enable:
- ✓ Simplified User Experience
Apply Pod Templates to pools so users don’t repeat Kubernetes specifications in every workflow. Templates automatically handle node selectors, tolerations, and other scheduling requirements.
- ✓ Resource Guardrails
Use Resource Validation rules to reject workflows that request more resources than available on your nodes, preventing scheduling failures.
- ✓ Hardware Differentiation
For heterogeneous clusters with multiple GPU types, create platforms within pools to route workflows to specific hardware (A100, H100, L40S, etc.).
- ✓ User Access Control
Integrate pools with user groups and roles to manage permissions. See Authentication and Authorization for authentication and authorization details. For example, control which user groups can access specific compute resources based on workload type (training, simulation, inference) or project teams.
Pool Architecture#
Pools organize compute resources in a hierarchical structure:
Backend (Kubernetes Cluster)
├── Pool: training-pool
│ ├── Platform: a100
│ └── Platform: h100
├── Pool: simulation-pool
│ ├── Platform: l40s
│ └── Platform: l40
└── Pool: inference-pool
└── Platform: jetson-agx-orin
Workflow Submission Flow:
1. Access Control 🔐
Check user permissions
2. Resource Check ⚖️
Validate requests
3. Apply Templates 📋
Build K8s specs
4. Select Platform 🎯
Route to hardware
5. Schedule & Run ▶️
Identify a node in cluster
Note
For detailed pool and platform configuration fields, see /api/configs/pool in the API reference documentation.
Practical Guide#
Heterogeneous Pools#
For clusters with multiple GPU types (L40S, A100, H100, etc.), use platforms to route workflows to specific hardware.
Step 1: Identify Node Labels
Discover node labels and tolerations for your hardware:
$ kubectl get nodes -o jsonpath='{.items[*].metadata.labels}' | jq -r 'to_entries[] | select(.key | startswith("nvidia.com/gpu.product")) | .value'
$ kubectl get nodes -o jsonpath='{.items[*].metadata.tolerations}'
Step 2: Create Pod Templates for Each GPU Type
Create pod templates that target specific hardware using node selectors and tolerations:
# L40S pod template
$ cat << EOF > l40s_pod_template.json
{
"l40s": {
"spec": {
"nodeSelector": {
"nvidia.com/gpu.product": "NVIDIA-L40S"
}
}
}
}
EOF
# A100 pod template with tolerations
$ cat << EOF > a100_pod_template.json
{
"a100": {
"spec": {
"nodeSelector": {
"nvidia.com/gpu.product": "NVIDIA-A100"
},
"tolerations": [
{
"key": "nvidia.com/gpu.product",
"operator": "Equal",
"value": "NVIDIA-A100",
"effect": "NoSchedule"
}
}
}
}
}
EOF
$ osmo config update POD_TEMPLATE l40s --file l40s_pod_template.json
$ osmo config update POD_TEMPLATE a100 --file a100_pod_template.json
Step 3: Create Pool with Platforms
Configure the pool that references both pod templates via platforms:
$ cat << EOF > platform_config.json
{
"pools": {
"heterogeneous_pool": {
"name": "heterogeneous_pool",
"backend": "default",
"default_platform": "l40s_platform",
"description": "Simulation and training pool",
"common_default_variables": {
"USER_CPU": 1,
"USER_GPU": 0,
"USER_MEMORY": "1Gi",
"USER_STORAGE": "1Gi"
},
"common_resource_validations": [
"default_cpu",
"default_memory",
"default_storage"
],
"common_pod_template": [
"default_user",
"default_ctrl"
],
"platforms": {
"l40s_platform": {
"description": "L40S platform",
"host_network_allowed": false,
"privileged_allowed": false,
"default_variables": {},
"resource_validations": [],
"override_pod_template": ["l40s"],
"allowed_mounts": []
},
"a100_platform": {
"description": "A100 platform",
"host_network_allowed": false,
"privileged_allowed": false,
"default_variables": {},
"resource_validations": [],
"override_pod_template": ["a100"],
"allowed_mounts": []
}
}
}
}
}
EOF
Apply the pool configuration:
$ osmo config update POOL --file platform_config.json
Validate the pool configuration:
$ osmo resource list --pool heterogeneous_pool
Step 4: Create a Role for the Pool
Create a role to allow submitting to the pool using the osmo config set CLI:
$ osmo config set ROLE osmo-heterogeneous_pool pool
Users that have this role will now be able to submit workflows to the newly created pool.
Note
For more info, see Auto-Generating Pool Roles.
Step 5: Assign the role to users
Assign the role osmo-heterogeneous_pool to users so they can access the pool:
Without an IdP: Use the OSMO user and role APIs (e.g. create users with
POST /api/auth/user, then assign the role withPOST /api/auth/user/{id}/roles). See Roles and Policies and the user management design (e.g.external/projects/PROJ-148-auth-rework/PROJ-148-user-management.md).With an IdP: You can assign the role via the same APIs, and/or map an IdP group to this role using IdP Role Mapping and Sync Modes so that users in that group get the role when they log in.
Additional Examples#
Training Pool - High-Performance GPU Pool
Configure a pool for training workloads with GB200 platform:
{
"robotics-training": {
"description": "High-performance GPU pool for robotics model training",
"backend": "gpu-cluster-01",
"default_platform": "h100-platform",
"common_default_variables": {
"USER_CPU": 16,
"USER_GPU": 1,
"USER_MEMORY": "64Gi",
"USER_STORAGE": "500Gi"
},
"common_resource_validations": [
"default_cpu",
"default_memory",
"default_storage",
"gpu_training_validation"
],
"common_pod_template": [
"default_amd64",
"training_optimized",
"high_memory"
],
"platforms": {
"gb200-platform": {
"description": "GB200 GPUs for high performance training",
"override_pod_template": [
"training_gb200_template"
],
"default_variables": {
"USER_MEMORY": "80Gi"
}
}
}
}
}
}
Simulation Pool - Graphics-Optimized Pool
Configure a pool for simulation workloads with L40/L40S platforms:
{
"robotics-simulation": {
"description": "Graphics-optimized pool for robotics simulation",
"backend": "graphics-cluster-01",
"default_platform": "l40-platform",
"common_default_variables": {
"USER_CPU": 8,
"USER_GPU": 1,
"USER_MEMORY": "32Gi",
"USER_STORAGE": "200Gi"
},
"common_resource_validations": [
"default_cpu",
"default_memory",
"default_storage",
"simulation_gpu_validation"
],
"common_pod_template": [
"default_amd64",
"simulation_optimized",
"graphics_drivers"
],
"platforms": {
"l40-platform": {
"description": "L40 GPUs for standard simulation",
"override_pod_template": [
"simulation_l40_template"
]
},
"l40s-platform": {
"description": "L40S GPUs for high-fidelity simulation",
"override_pod_template": [
"simulation_l40s_template"
],
"default_variables": {
"USER_MEMORY": "48Gi"
}
}
}
}
}
}
Inference Pool - NVIDIA Jetsons Pool
Configure a pool for inference workloads with NVIDIA Jetsons:
{
"robotics-inference": {
"description": "NVIDIA Jetsons pool for model inference",
"backend": "inference-cluster-01",
"default_platform": "jetson-thor-platform",
"common_default_variables": {
"USER_CPU": 4,
"USER_GPU": 0,
"USER_MEMORY": "16Gi",
"USER_STORAGE": "50Gi"
},
"common_resource_validations": [
"default_cpu",
"default_memory",
"default_storage",
"inference_validation"
],
"common_pod_template": [
"default_amd64",
"inference_optimized",
"low_latency"
],
"platforms": {
"jetson-thor-platform": {
"description": "Jetson Thor platform for edge AI inference",
"override_pod_template": [
"inference_jetson_thor_template"
],
"default_variables": {
"USER_GPU": 1,
"USER_MEMORY": "8Gi"
}
}
}
}
}
Enabling Topology-Aware Scheduling#
Topology-aware scheduling ensures that tasks requiring high-bandwidth or low-latency communication are placed on physically co-located nodes—such as the same NVLink rack, spine switch, or availability zone. This requires KAI Scheduler v0.12 or later and nodes with the appropriate Kubernetes labels applied.
Note
topology_keys can only be configured on pools backed by a KAI Scheduler backend.
Configuring it on a pool with an unsupported scheduler will be rejected.
Step 1: Verify Node Labels
Confirm that your cluster nodes have labels for each topology level you want to expose. The
label keys must match what you will configure in topology_keys:
$ kubectl get nodes -o jsonpath='{.items[*].metadata.labels}' | jq -r 'to_entries[] | select(.key | test("topology.kubernetes.io|nvidia.com/gpu-clique")) | "\(.key)=\(.value)"'
Step 2: Add Topology Keys to the Pool Config
Add a topology_keys list to your pool configuration, ordered from coarsest to finest
granularity. Each entry maps a user-friendly key name (which users reference in their
workflow specs) to the actual Kubernetes node label:
$ cat << EOF > topology_pool.json
{
"pools": {
"my-pool": {
"name": "my-pool",
"backend": "my-backend",
"topology_keys": [
{"key": "zone", "label": "topology.kubernetes.io/zone"},
{"key": "spine", "label": "topology.kubernetes.io/spine"},
{"key": "rack", "label": "topology.kubernetes.io/rack"},
{"key": "gpu-clique", "label": "nvidia.com/gpu-clique"}
]
}
}
}
EOF
Step 3: Apply the Pool Configuration
$ osmo config update POOL --file topology_pool.json
OSMO will create a KAI Topology CRD in the cluster for this pool. Users can then reference the configured key names when specifying topology requirements in their workflow specs.
See also
See Topology-Aware Scheduling for how users specify topology requirements in their workflows.
Troubleshooting#
- Pool Access Denied
Verify user’s group membership matches pool naming convention
Check role configuration includes correct pool path
- Resource Validation Failures
Ensure validation rules match node capacity
Verify resource requests don’t exceed platform limits
- Template Conflicts
Review template merge order (later templates override earlier ones)
Check for conflicting fields in merged templates
- Platform Not Available
Verify platform name is correctly specified in pool configuration
Ensure referenced pod templates exist
- Debugging Tips
Start with simple configurations and add complexity gradually
Test access with different user accounts
Examine OSMO service logs for detailed error messages
Warning
Deleting or modifying pools used by running workflows may cause scheduling issues. Always verify pools are not in use before making changes.