Install Dependencies#

Prerequisites

An operational Kubernetes cluster with recommended instance types (see Create K8s Cluster)
Helm CLI installed

Install KAI Scheduler#

OSMO uses KAI scheduler to run AI workflows at very large scale with Workflow Groups.

For more information on the scheduler, see Scheduler Configuration.

Create a file called kai-selectors.yaml with node selectors / tolerations to specify which nodes the KAI scheduler should run on.

global:
  # Modify the node selectors and tolerations to match your cluster
  nodeSelector: {}
  tolerations: []

scheduler:
  additionalArgs:
  - --default-staleness-grace-period=-1s  # Disable stalegangeviction
  - --update-pod-eviction-condition=true  # Enable OSMO to read preemption conditions

Next, install KAI using helm

helm fetch oci://ghcr.io/nvidia/kai-scheduler/kai-scheduler --version <insert-chart-version>
helm upgrade --install kai-scheduler kai-scheduler-<insert-chart-version>.tgz \
  --create-namespace -n kai-scheduler \
  --values kai-selectors.yaml

Note

Replace <insert-chart-version> with the actual chart version. OSMO supports up to date chart versions. For more information on the chart version, refer to the official KAI scheduler release notes .

Install the GPU Operator#

The NVIDIA GPU-Operator is required for GPU workloads to be discovered and scheduled on the Kubernetes cluster.

helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update
helm install gpu-operator nvidia/gpu-operator --namespace gpu-operator --create-namespace

Note

For optional observability components such as Grafana, Prometheus, and Kubernetes Dashboard, see Add Observability (Optional).