Resource Pools#

After successfully configuring the default pool, you can create additional pools to organize and control how users access your compute resources.

Why Create Multiple Pools?#

Pools divide your compute backend into logical resource groupings that enable:

✓ Simplified User Experience: Apply Pod Templates to pools so users don’t repeat Kubernetes specifications in every workflow. Templates automatically handle node selectors, tolerations, and other scheduling requirements.
✓ Resource Guardrails: Use Resource Validation rules to reject workflows that request more resources than available on your nodes, preventing scheduling failures.
✓ Hardware Differentiation: For heterogeneous clusters with multiple GPU types, create platforms within pools to route workflows to specific hardware (A100, H100, L40S, etc.).
✓ User Access Control: Integrate pools with user groups and roles to manage permissions. See Authentication and Authorization for authentication and authorization details. For example, control which user groups can access specific compute resources based on workload type (training, simulation, inference) or project teams.

Pool Architecture#

Pools organize compute resources in a hierarchical structure:

Backend (Kubernetes Cluster)
├── Pool: training-pool
│   ├── Platform: a100
│   └── Platform: h100
├── Pool: simulation-pool
│   ├── Platform: l40s
│   └── Platform: l40
└── Pool: inference-pool
    └── Platform: jetson-agx-orin

Workflow Submission Flow:

1. Access Control 🔐

Check user permissions

Verify pool access rights

2. Resource Check ⚖️

Validate requests

Ensure node capacity

3. Apply Templates 📋

Build K8s specs

Merge pod templates

4. Select Platform 🎯

Route to hardware

A100, H100, L40S, etc.

5. Schedule & Run ▶️

Identify a node in cluster

Pod is running on the node

Note

For detailed pool and platform configuration fields, see /api/configs/pool in the API reference documentation.

Practical Guide#

Heterogeneous Pools#

For clusters with multiple GPU types (L40S, A100, H100, etc.), use platforms to route workflows to specific hardware.

Step 1: Identify Node Labels

Discover node labels and tolerations for your hardware:

$ kubectl get nodes -o jsonpath='{.items[*].metadata.labels}' | jq -r 'to_entries[] | select(.key | startswith("nvidia.com/gpu.product")) | .value'
$ kubectl get nodes -o jsonpath='{.items[*].metadata.tolerations}'

Step 2: Create Pod Templates for Each GPU Type

Create pod templates that target specific hardware using node selectors and tolerations:

# L40S pod template
$ cat << EOF > l40s_pod_template.json
{
  "l40s": {
    "spec": {
      "nodeSelector": {
        "nvidia.com/gpu.product": "NVIDIA-L40S"
      }
    }
  }
}
EOF

# A100 pod template with tolerations
$ cat << EOF > a100_pod_template.json
{
  "a100": {
    "spec": {
      "nodeSelector": {
        "nvidia.com/gpu.product": "NVIDIA-A100"
      },
      "tolerations": [
        {
          "key": "nvidia.com/gpu.product",
          "operator": "Equal",
          "value": "NVIDIA-A100",
          "effect": "NoSchedule"
        }
      }
    }
  }
}
EOF

$ osmo config update POD_TEMPLATE l40s --file l40s_pod_template.json

$ osmo config update POD_TEMPLATE a100 --file a100_pod_template.json

Step 3: Create Pool with Platforms

Configure the pool that references both pod templates via platforms:

$ cat << EOF > platform_config.json
{
  "pools": {
    "heterogeneous_pool": {
      "name": "heterogeneous_pool",
      "backend": "default",
      "default_platform": "l40s_platform",
      "description": "Simulation and training pool",
      "common_default_variables": {
          "USER_CPU": 1,
          "USER_GPU": 0,
          "USER_MEMORY": "1Gi",
          "USER_STORAGE": "1Gi"
      },
      "common_resource_validations": [
          "default_cpu",
          "default_memory",
          "default_storage"
      ],
      "common_pod_template": [
          "default_user",
          "default_ctrl"
      ],
      "platforms": {
          "l40s_platform": {
              "description": "L40S platform",
              "host_network_allowed": false,
              "privileged_allowed": false,
              "default_variables": {},
              "resource_validations": [],
              "override_pod_template": ["l40s"],
              "allowed_mounts": []
          },
          "a100_platform": {
              "description": "A100 platform",
              "host_network_allowed": false,
              "privileged_allowed": false,
              "default_variables": {},
              "resource_validations": [],
              "override_pod_template": ["a100"],
              "allowed_mounts": []
          }
      }
    }
  }
}
EOF

Apply the pool configuration:

$ osmo config update POOL --file platform_config.json

Validate the pool configuration:

$ osmo resource list --pool heterogeneous_pool

Step 4: Create a Role for the Pool

Create a role to allow submitting to the pool using the osmo config set CLI:

$ osmo config set ROLE osmo-heterogeneous_pool pool

Users that have this role will now be able to submit workflows to the newly created pool.

Note

For more info, see Auto-Generating Pool Roles.

Step 5: Assign the role to users

Assign the role osmo-heterogeneous_pool to users so they can access the pool:

Without an IdP: Use the OSMO user and role APIs (e.g. create users with POST /api/auth/user, then assign the role with POST /api/auth/user/{id}/roles). See Roles and Policies and the user management design (e.g. external/projects/PROJ-148-auth-rework/PROJ-148-user-management.md).
With an IdP: You can assign the role via the same APIs, and/or map an IdP group to this role using IdP Role Mapping and Sync Modes so that users in that group get the role when they log in.

Additional Examples#

Enabling Topology-Aware Scheduling#

Topology-aware scheduling ensures that tasks requiring high-bandwidth or low-latency communication are placed on physically co-located nodes—such as the same NVLink rack, spine switch, or availability zone. This requires KAI Scheduler v0.12 or later and nodes with the appropriate Kubernetes labels applied.

Note

topology_keys can only be configured on pools backed by a KAI Scheduler backend. Configuring it on a pool with an unsupported scheduler will be rejected.

Step 1: Verify Node Labels

Confirm that your cluster nodes have labels for each topology level you want to expose. The label keys must match what you will configure in topology_keys:

$ kubectl get nodes -o jsonpath='{.items[*].metadata.labels}' | jq -r 'to_entries[] | select(.key | test("topology.kubernetes.io|nvidia.com/gpu-clique")) | "\(.key)=\(.value)"'

Step 2: Add Topology Keys to the Pool Config

Add a topology_keys list to your pool configuration, ordered from coarsest to finest granularity. Each entry maps a user-friendly key name (which users reference in their workflow specs) to the actual Kubernetes node label:

$ cat << EOF > topology_pool.json
{
  "pools": {
    "my-pool": {
      "name": "my-pool",
      "backend": "my-backend",
      "topology_keys": [
        {"key": "zone",       "label": "topology.kubernetes.io/zone"},
        {"key": "spine",      "label": "topology.kubernetes.io/spine"},
        {"key": "rack",       "label": "topology.kubernetes.io/rack"},
        {"key": "gpu-clique", "label": "nvidia.com/gpu-clique"}
      ]
    }
  }
}
EOF

Step 3: Apply the Pool Configuration

$ osmo config update POOL --file topology_pool.json

OSMO will create a KAI Topology CRD in the cluster for this pool. Users can then reference the configured key names when specifying topology requirements in their workflow specs.

Troubleshooting#

Pool Access Denied

Verify user’s group membership matches pool naming convention
Check role configuration includes correct pool path

Resource Validation Failures

Ensure validation rules match node capacity
Verify resource requests don’t exceed platform limits

Template Conflicts

Review template merge order (later templates override earlier ones)
Check for conflicting fields in merged templates

Platform Not Available

Verify platform name is correctly specified in pool configuration
Ensure referenced pod templates exist

Debugging Tips

Start with simple configurations and add complexity gradually
Test access with different user accounts
Examine OSMO service logs for detailed error messages

Warning

Deleting or modifying pools used by running workflows may cause scheduling issues. Always verify pools are not in use before making changes.