TorchRun: Training Deep Learning Models#

These tutorials guide you through training deep neural networks (DNNs) with OSMO, starting from single-node training and progressing to distributed multi-node training and fault-tolerant setups:

Single Node Training: Learn the fundamentals of running DNN training jobs on OSMO, including launching training scripts, selecting resources, managing data with OSMO datasets, saving outputs and checkpoints, and monitoring training progress with TensorBoard or Weights & Biases.
Multi-Node Distributed Training: Scale your training across multiple nodes using TorchRun. This tutorial covers configuring task groups for distributed training, using OSMO tokens to coordinate master and worker nodes, and templating workflows to easily scale to arbitrary numbers of nodes.
Fault-Tolerant Training with Rescheduling: Implement production-ready training that can automatically recover from transient backend errors (e.g., NCCL failures). Learn how to configure checkpoint resumption, catch reschedulable errors, and use exit actions to automatically restart failed tasks without losing training progress.