Usage guide
Terms
Fault Tolerance
,FT
is thefault_tolerance
package.FT callback
,FaultToleranceCallback
is a PTL callback that integrates FT with PTL.ft_launcher
is a launcher tool included in FT, which is based ontorchrun
.heartbeat
is a lightweight message sent from a rank to its rank monitor that indicates that a rank is alive.section
is a user code segment with a custom name assigned.rank monitor
is a special side process started byft_launcher
that monitors its rank.timeouts
are time intervals used by a rank monitor to detect that a rank is not alive.launcher script
is a bash script that invokesft_launcher
.PTL
is PyTorch Lightning.
Design Overview
Each node runs a single
ft_launcher
.FT configuration is passed to
ft_launcher
and propagated to other FT components.ft_launcher
spawns rank monitors (once).ft_launcher
spawns ranks (can also respawn if--max-restarts
is greater than 0).Each rank uses
RankMonitorClient
to connect to its monitor (RankMonitorServer
).Each rank periodically sends updates to its rank monitor (e.g., during each training and evaluation step).
In case of a hang, the rank monitor detects missing updates from its rank and terminates it.
If any ranks disappear,
ft_launcher
detects that and terminates or restarts the workload.ft_launcher
instances communicate via thetorchrun
“rendezvous” mechanism.Rank monitors do not communicate with each other.
# Processes structure on a single node.
# NOTE: each rank has its own separate rank monitor.
[Rank_N]----(IPC)----[Rank Monitor_N]
| |
| |
(re/spawns) (spawns)
| |
| |
[ft_launcher]-------------
Usage Overview
FT launcher
Fault tolerance includes a launcher tool called ft_launcher
, which is based on torchrun
and supports most torchrun
command-line parameters. FT configuration can be specified either
via a YAML file using --fault-tol-cfg-path
or through command-line parameters
using --ft-param-<parameter-name>
.
If --max-restarts
is specified, the launcher restarts failed workers.
The restart behavior depends on the --restart-policy
parameter, which supports two modes:
any-failed
(default) All workers are restarted if any worker fails.min-healthy
Workers are restarted when the number of healthy nodes (nodes where all worker processes are running) falls below the minimum specified in--nnodes
. This allows for some worker failures to be handled without restarting remaining workers, e.g., with the Inprocess Restart. For details on howmin-healthy
policy interacts with Inprocess Restart see FT Launcher & Inprocess integration.
Hang detection
The FT package provides two fully independent mechanisms for detecting hangs in user code. Users can choose the API that is best suited for their needs, or use both APIs at the same time.
Heartbeats API
The training script periodically sends heartbeats to the monitor. If no heartbeat arrives in a defined time, the workload is considered hung. This API is the simplest to use but might require coarse timeouts that need to cover a wide range of possible intervals between heartbeats. Please find more details in Heartbeats API Integration.
Sections API
Some parts of the training scripts are wrapped in sections. If any section is opened for too long, the workload is considered hung. The sections-based API requires more changes in the user code, but timeouts can be defined more precisely, and hangs can be detected quicker. Please find more details in Sections API Integration.
Workload control
In some cases, it might be useful to control the ft_launcher
behavior based on a rank state.
For example, if an irrecoverable error is encountered in a rank, it might be reasonable to break
the launcher restarting loop and exit instead of restarting; for other exception types, one might
want to exclude the current node from subsequent restart attempts. RankMonitorClient
exposes the
nvidia_resiliency_ext.fault_tolerance.rank_monitor_client.RankMonitorClient.send_workload_control_request()
API, which can be used to control the workload restarting logic implemented in the launcher.
Note
Please note that only the ft_launcher behavior is affected by this call. The fault tolerance package is job scheduler-agnostic, i.e., it does not control underlying SLURM job allocations.