Resource Pools#

After successfully configuring the default pool, you can create additional pools to organize and control how users access your compute resources.

Why Create Multiple Pools?#

Pools divide your compute backend into logical resource groupings that enable:

Simplified User Experience

Apply Pod Templates to pools so users don’t repeat Kubernetes specifications in every workflow. Templates automatically handle node selectors, tolerations, and other scheduling requirements.

Resource Guardrails

Use Resource Validation rules to reject workflows that request more resources than available on your nodes, preventing scheduling failures.

Hardware Differentiation

For heterogeneous clusters with multiple GPU types, create platforms within pools to route workflows to specific hardware (A100, H100, L40S, etc.).

User Access Control

Integrate pools with user groups and roles to manage permissions. See Authentication and Authorization for authentication and authorization details. For example, control which user groups can access specific compute resources based on workload type (training, simulation, inference) or project teams.

Pool Architecture#

Pools organize compute resources in a hierarchical structure:

Backend (Kubernetes Cluster)
├── Pool: training-pool
│   ├── Platform: a100
│   └── Platform: h100
├── Pool: simulation-pool
│   ├── Platform: l40s
│   └── Platform: l40
└── Pool: inference-pool
    └── Platform: jetson-agx-orin

Workflow Submission Flow:

1. Access Control 🔐

Check user permissions

2. Resource Check ⚖️

Validate requests

3. Apply Templates 📋

Build K8s specs

4. Select Platform 🎯

Route to hardware

5. Schedule & Run ▶️

Identify a node in cluster

Note

For detailed pool and platform configuration fields, see /api/configs/pool in the API reference documentation.

Practical Guide#

Heterogeneous Pools#

For clusters with multiple GPU types (L40S, A100, H100, etc.), use platforms to route workflows to specific hardware.

Step 1: Identify Node Labels

Discover node labels and tolerations for your hardware:

$ kubectl get nodes -o jsonpath='{.items[*].metadata.labels}' | jq -r 'to_entries[] | select(.key | startswith("nvidia.com/gpu.product")) | .value'
$ kubectl get nodes -o jsonpath='{.items[*].metadata.tolerations}'

Step 2: Create Pod Templates for Each GPU Type

Create pod templates that target specific hardware using node selectors and tolerations:

# L40S pod template
$ cat << EOF > l40s_pod_template.json
{
  "l40s": {
    "spec": {
      "nodeSelector": {
        "nvidia.com/gpu.product": "NVIDIA-L40S"
      }
    }
  }
}
EOF
# A100 pod template with tolerations
$ cat << EOF > a100_pod_template.json
{
  "a100": {
    "spec": {
      "nodeSelector": {
        "nvidia.com/gpu.product": "NVIDIA-A100"
      },
      "tolerations": [
        {
          "key": "nvidia.com/gpu.product",
          "operator": "Equal",
          "value": "NVIDIA-A100",
          "effect": "NoSchedule"
        }
      }
    }
  }
}
EOF
$ osmo config update POD_TEMPLATE l40s --file l40s_pod_template.json

$ osmo config update POD_TEMPLATE a100 --file a100_pod_template.json

Step 3: Create Pool with Platforms

Configure the pool that references both pod templates via platforms:

$ cat << EOF > platform_config.json
{
  "pools": {
    "heterogeneous_pool": {
      "name": "heterogeneous_pool",
      "backend": "default",
      "default_platform": "l40s_platform",
      "description": "Simulation and training pool",
      "common_default_variables": {
          "USER_CPU": 1,
          "USER_GPU": 0,
          "USER_MEMORY": "1Gi",
          "USER_STORAGE": "1Gi"
      },
      "common_resource_validations": [
          "default_cpu",
          "default_memory",
          "default_storage"
      ],
      "common_pod_template": [
          "default_user",
          "default_ctrl"
      ],
      "platforms": {
          "l40s_platform": {
              "description": "L40S platform",
              "host_network_allowed": false,
              "privileged_allowed": false,
              "default_variables": {},
              "resource_validations": [],
              "override_pod_template": ["l40s"],
              "allowed_mounts": []
          },
          "a100_platform": {
              "description": "A100 platform",
              "host_network_allowed": false,
              "privileged_allowed": false,
              "default_variables": {},
              "resource_validations": [],
              "override_pod_template": ["a100"],
              "allowed_mounts": []
          }
      }
    }
  }
}
EOF

Apply the pool configuration:

$ osmo config update POOL --file platform_config.json

Validate the pool configuration:

$ osmo resource list --pool heterogeneous_pool

Step 4: Create a Role for the Pool

Create a role to allow submitting to the pool using the osmo config set CLI:

$ osmo config set ROLE osmo-heterogeneous_pool pool

Users that have this role will now be able to submit workflows to the newly created pool.

Note

For more info, see Auto-Generating Pool Roles.

Step 5: Assign the role to users

Assign the role osmo-heterogeneous_pool to users so they can access the pool:

  • Without an IdP: Use the OSMO user and role APIs (e.g. create users with POST /api/auth/user, then assign the role with POST /api/auth/user/{id}/roles). See Roles and Policies and the user management design (e.g. external/projects/PROJ-148-auth-rework/PROJ-148-user-management.md).

  • With an IdP: You can assign the role via the same APIs, and/or map an IdP group to this role using IdP Role Mapping and Sync Modes so that users in that group get the role when they log in.

Additional Examples#

Training Pool - High-Performance GPU Pool

Configure a pool for training workloads with GB200 platform:

{
  "robotics-training": {
    "description": "High-performance GPU pool for robotics model training",
    "backend": "gpu-cluster-01",
    "default_platform": "h100-platform",
    "common_default_variables": {
      "USER_CPU": 16,
      "USER_GPU": 1,
      "USER_MEMORY": "64Gi",
      "USER_STORAGE": "500Gi"
    },
    "common_resource_validations": [
      "default_cpu",
      "default_memory",
      "default_storage",
      "gpu_training_validation"
    ],
    "common_pod_template": [
      "default_amd64",
      "training_optimized",
      "high_memory"
    ],
    "platforms": {
      "gb200-platform": {
        "description": "GB200 GPUs for high performance training",
        "override_pod_template": [
          "training_gb200_template"
        ],
        "default_variables": {
          "USER_MEMORY": "80Gi"
        }
      }
    }
    }
  }
}
Simulation Pool - Graphics-Optimized Pool

Configure a pool for simulation workloads with L40/L40S platforms:

{
  "robotics-simulation": {
    "description": "Graphics-optimized pool for robotics simulation",
    "backend": "graphics-cluster-01",
    "default_platform": "l40-platform",
    "common_default_variables": {
      "USER_CPU": 8,
      "USER_GPU": 1,
      "USER_MEMORY": "32Gi",
      "USER_STORAGE": "200Gi"
    },
    "common_resource_validations": [
      "default_cpu",
      "default_memory",
      "default_storage",
      "simulation_gpu_validation"
    ],
    "common_pod_template": [
      "default_amd64",
      "simulation_optimized",
      "graphics_drivers"
    ],
    "platforms": {
      "l40-platform": {
        "description": "L40 GPUs for standard simulation",
        "override_pod_template": [
          "simulation_l40_template"
        ]
      },
      "l40s-platform": {
        "description": "L40S GPUs for high-fidelity simulation",
        "override_pod_template": [
          "simulation_l40s_template"
        ],
        "default_variables": {
          "USER_MEMORY": "48Gi"
        }
      }
    }
    }
  }
}
Inference Pool - NVIDIA Jetsons Pool

Configure a pool for inference workloads with NVIDIA Jetsons:

{
  "robotics-inference": {
    "description": "NVIDIA Jetsons pool for model inference",
    "backend": "inference-cluster-01",
    "default_platform": "jetson-thor-platform",
    "common_default_variables": {
      "USER_CPU": 4,
      "USER_GPU": 0,
      "USER_MEMORY": "16Gi",
      "USER_STORAGE": "50Gi"
    },
    "common_resource_validations": [
      "default_cpu",
      "default_memory",
      "default_storage",
      "inference_validation"
    ],
    "common_pod_template": [
      "default_amd64",
      "inference_optimized",
      "low_latency"
    ],
    "platforms": {
      "jetson-thor-platform": {
        "description": "Jetson Thor platform for edge AI inference",
        "override_pod_template": [
          "inference_jetson_thor_template"
        ],
        "default_variables": {
          "USER_GPU": 1,
          "USER_MEMORY": "8Gi"
        }
      }
    }
  }
}

Enabling Topology-Aware Scheduling#

Topology-aware scheduling ensures that tasks requiring high-bandwidth or low-latency communication are placed on physically co-located nodes—such as the same NVLink rack, spine switch, or availability zone. This requires KAI Scheduler v0.10 or later and nodes with the appropriate Kubernetes labels applied.

Note

topology_keys can only be configured on pools backed by a KAI Scheduler backend. Configuring it on a pool with an unsupported scheduler will be rejected.

Step 1: Verify Node Labels

Confirm that your cluster nodes have labels for each topology level you want to expose. The label keys must match what you will configure in topology_keys:

$ kubectl get nodes -o jsonpath='{.items[*].metadata.labels}' | jq -r 'to_entries[] | select(.key | test("topology.kubernetes.io|nvidia.com/gpu-clique")) | "\(.key)=\(.value)"'

Step 2: Add Topology Keys to the Pool Config

Add a topology_keys list to your pool configuration, ordered from coarsest to finest granularity. Each entry maps a user-friendly key name (which users reference in their workflow specs) to the actual Kubernetes node label:

$ cat << EOF > topology_pool.json
{
  "pools": {
    "my-pool": {
      "name": "my-pool",
      "backend": "my-backend",
      "topology_keys": [
        {"key": "zone",       "label": "topology.kubernetes.io/zone"},
        {"key": "spine",      "label": "topology.kubernetes.io/spine"},
        {"key": "rack",       "label": "topology.kubernetes.io/rack"},
        {"key": "gpu-clique", "label": "nvidia.com/gpu-clique"}
      ]
    }
  }
}
EOF

Step 3: Apply the Pool Configuration

$ osmo config update POOL --file topology_pool.json

OSMO will create a KAI Topology CRD in the cluster for this pool. Users can then reference the configured key names when specifying topology requirements in their workflow specs.

See also

See Topology-Aware Scheduling for how users specify topology requirements in their workflows.

Troubleshooting#

Pool Access Denied
  • Verify user’s group membership matches pool naming convention

  • Check role configuration includes correct pool path

Resource Validation Failures
  • Ensure validation rules match node capacity

  • Verify resource requests don’t exceed platform limits

Template Conflicts
  • Review template merge order (later templates override earlier ones)

  • Check for conflicting fields in merged templates

Platform Not Available
  • Verify platform name is correctly specified in pool configuration

  • Ensure referenced pod templates exist

Debugging Tips
  • Start with simple configurations and add complexity gradually

  • Test access with different user accounts

  • Examine OSMO service logs for detailed error messages

Warning

Deleting or modifying pools used by running workflows may cause scheduling issues. Always verify pools are not in use before making changes.