Install Dependencies#
Prerequisites
An operational Kubernetes cluster with recommended instance types (see Create K8s Cluster)
Helm CLI installed
Install KAI Scheduler#
OSMO uses KAI scheduler to run AI workflows at very large scale with Workflow Groups.
For more information on the scheduler, see Scheduler Configuration.
Create a file called kai-selectors.yaml with node selectors / tolerations to specify which
nodes the KAI scheduler should run on.
global:
# Modify the node selectors and tolerations to match your cluster
nodeSelector: {}
tolerations: []
scheduler:
additionalArgs:
- --default-staleness-grace-period=-1s # Disable stalegangeviction
- --update-pod-eviction-condition=true # Enable OSMO to read preemption conditions
Next, install KAI using helm
helm fetch oci://ghcr.io/nvidia/kai-scheduler/kai-scheduler --version <insert-chart-version>
helm upgrade --install kai-scheduler kai-scheduler-<insert-chart-version>.tgz \
--create-namespace -n kai-scheduler \
--values kai-selectors.yaml
Note
Replace <insert-chart-version> with the actual chart version.
OSMO supports up to date chart versions.
For more information on the chart version, refer to the official KAI scheduler release notes .
Install the GPU Operator#
The NVIDIA GPU-Operator is required for GPU workloads to be discovered and scheduled on the Kubernetes cluster.
helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update
helm install gpu-operator nvidia/gpu-operator --namespace gpu-operator --create-namespace
Note
For optional observability components such as Grafana, Prometheus, and Kubernetes Dashboard, see Add Observability (Optional).