Scheduler Configuration#
After configuring pools, you can enable advanced scheduling features using the KAI scheduler . This configuration controls how workflows compete for resources, enabling co-scheduling, preemption, and fair sharing across teams.
Why Use KAI Scheduler?#
The KAI scheduler provides enterprise-grade resource management capabilities:
- ✓ Co-Scheduling
Schedule multiple tasks together for distributed training, hardware-in-the-loop simulations, and parallel synthetic data generation.
- ✓ Priority & Preemption
High-priority workflows can preempt low-priority ones, ensuring critical work proceeds even when clusters are fully utilized.
- ✓ Fair Resource Sharing
Guarantee minimum resources per pool while allowing teams to burst above their baseline when capacity is available.
- ✓ Maximize Utilization
Reclaim idle resources and redistribute them across pools based on configurable weights, minimizing waste.
How It Works#
GPU Allocation Model#
Guarantee 🔒
Minimum resources
Weight ⚖️
Fair share ratio
Maximum 🚧
Upper limit
Key Concepts#
Guarantee: Minimum GPUs/resources reserved for a pool (non-preemptible workflows)
Weight: Proportional share when pools exceed their guarantee (e.g., 1:3 ratio)
Maximum: Hard cap on total resources a pool can use (-1 means unlimited)
Preemptible Workflows: Use
LOWpriority; can be stopped to free resourcesNon-Preemptible Workflows: Use
HIGH/NORMALpriority; protected from preemption
Note
For detailed configuration fields, see Resource Constraint in the API reference.
Warning
To enable preemption, ALL pools sharing the same nodes must configure guarantee, weight, and maximum. Partial configuration disables preemption.
Practical Guide#
GPU Allocation#
Example Cluster: Assume a cluster with 100 GPUs total divided into two pools: Training (A) and Simulation (B).
Pool |
Guarantee |
Weight |
Maximum |
|---|---|---|---|
Training (A) |
30 GPUs |
1 |
70 GPUs |
Simulation (B) |
50 GPUs |
3 |
Unlimited (-1) |
Basic Allocation Behavior:
Pool A gets 30 GPUs guaranteed (non-preemptible workflows)
Pool B gets 50 GPUs guaranteed (non-preemptible workflows)
Pool A can burst up to 70 GPUs total (including preemptible)
Pool B can use unlimited GPUs (including preemptible)
Warning
When both pools exceed guarantees, Pool B gets 3x Pool A’s allocation (weight ratio 1:3)
Weight Ratio Example:
When 20 GPUs become available and both pools want more:
Pool A gets 5 GPUs (1 part)
Pool B gets 15 GPUs (3 parts)
Preemption Scenarios#
Scenario 1: Pool Below Guarantee Preempts
Current State:
Pool A: 70 GPUs (30 non-preemptible + 40 preemptible)
Pool B: 30 GPUs (30 non-preemptible)
Cluster: Fully utilized (100/100)
New Workflow: Pool B submits 4-GPU non-preemptible workflow
Result: Pool B preempts 4 GPUs from Pool A’s preemptible workflows
Why?
Pool B is below its guarantee (30/50)
Non-preemptible workflows have priority over preemptible
Cluster is full, so preemption is necessary
Scenario 2: Pool Cannot Preempt Itself with Low Priority
Current State:
Pool A: 65 GPUs (25 non-preemptible + 40 preemptible)
Pool B: 35 GPUs (35 non-preemptible)
Cluster: Fully utilized (100/100)
New Workflow: Pool A submits 5-GPU preemptible workflow
Result: Workflow stays pending
Why?
Preemptible workflows cannot preempt any other workflows
Must wait for resources to free up naturally
Scenario 3: Pool Preempts Own Workflows
Current State:
Pool A: 65 GPUs (25 non-preemptible + 40 preemptible)
Pool B: 35 GPUs (35 non-preemptible)
Cluster: Fully utilized (100/100)
New Workflow: Pool A submits 5-GPU non-preemptible workflow
Result: Pool A preempts 5 GPUs from its own preemptible workflows
Why?
Pool A is below its guarantee (25/30 non-preemptible)
Non-preemptible workflows take priority
Pool preempts its own low-priority work
Troubleshooting#
- Preemption Not Working
Verify ALL pools have
guarantee,weight, andmaximumconfiguredCheck pools share the same compute nodes
Ensure workflows use correct priority levels (
HIGH/NORMAL/LOW)
- Unfair Resource Distribution
Review weight ratios across pools
Verify guarantee values don’t exceed cluster capacity
Check if pools are hitting their maximum limits
- Workflows Stuck in Pending
Confirm total guarantees don’t exceed cluster capacity
Check if pool has reached its maximum limit
Verify preemptible workflows are marked with
LOWpriority
Tip
Best Practices
Set guarantees to cover baseline workload for each team
Use weights to reflect team priorities (higher weight = more burst capacity)
Set reasonable maximums to prevent one team from monopolizing resources
Mark exploratory/dev work as
LOWpriority (preemptible)Reserve
HIGH/NORMALpriority for production workloadsMonitor pool utilization and adjust settings quarterly
See also
Learn more about KAI scheduler
Learn more about scheduling in OSMO