Usage guide
Terms
PTL
is PyTorch Lightning.Fault Tolerance
,FT
is thefault_tolerance
package.FT callback
,FaultToleranceCallback
is a PTL callback that integrates FT with PTL.ft_launcher
is a launcher tool included in FT, which is based ontorchrun
.heartbeat
is a lightweight message sent from a rank to its rank monitor that indicates that a rank is alive.rank monitor
is a special side-process started byft_launcher
that monitors heartbeats from its rank.timeouts
are time intervals used by a rank monitor to detect that a rank is not alive. There are two separate timeouts: for the initial heartbeat and the subsequent heartbeats.launcher script
is a bash script that invokesft_launcher
.
FT Package Design Overview
Each node runs a single
ft_launcher
.ft_launcher
spawns rank monitors (once).ft_launcher
spawns ranks (can also respawn if--max-restarts
is greater than 0).Each rank uses
RankMonitorClient
to connect to its monitor (RankMonitorServer
).Each rank periodically sends heartbeats to its rank monitor (e.g. after each training and evaluation step).
In case of a hang, the rank monitor detects missing heartbeats from its rank and terminates it.
If any ranks disappear,
ft_launcher
detects that and terminates or restarts the workload.ft_launcher
instances communicate via thetorchrun
“rendezvous” mechanism.Rank monitors do not communicate with each other.
# Processes structure on a single node.
# NOTE: each rank has its separate rank monitor.
[Rank_N]----(IPC)----[Rank Monitor_N]
| |
| |
(re/spawns) (spawns)
| |
| |
[ft_launcher]-------------
FT Integration Guide for PyTorch
1. Prerequisites:
Run ranks using ft_launcher
. The command line is mostly compatible with torchrun
.
Note
Some clusters (e.g. SLURM) use SIGTERM as a default method of requesting a graceful workload shutdown. It is recommended to implement appropriate signal handling in a fault-tolerant workload. To avoid deadlocks and other unintended side effects, signal handling should be synchronized across all ranks. Please refer to the train_ddp.py example for a basic signal handling implementation.
2. FT configuration:
FT configuration is passed to ft_launcher
via YAML file --fault-tol-cfg-path
or CLI arguments --ft-param-...
,
from where it’s propagated to other FT components.
- Timeouts for fault detection need to be adjusted for a given workload:
initial_rank_heartbeat_timeout
should be long enough to allow for workload initialization.rank_heartbeat_timeout
should be at least as long as the longest possible interval between steps.
Importantly, heartbeats are not sent during checkpoint loading and saving, so time for checkpointing-related operations should be taken into account.
Summary of all FT configuration items:
- class nvidia_resiliency_ext.fault_tolerance.config.FaultToleranceConfig(workload_check_interval=5.0, initial_rank_heartbeat_timeout=3600.0, rank_heartbeat_timeout=2700.0, safety_factor=5.0, rank_termination_signal=Signals.SIGKILL, log_level=20)[source]
- Configuration of fault tolerance
workload_check_interval [float] periodic rank check interval (in seconds) in rank monitors.
initial_rank_heartbeat_timeout [float] timeout (in seconds) for the first heartbeat from a rank.
Usually, it takes a bit longer for the first heartbeat to be sent, as the rank needs to initialize. If rank does not send the first heartbeat within initial_rank_heartbeat_timeout, failure is detected. If None this timeout needs to be deduced and set during runtime, based on the observed heartbeat intervals. - rank_heartbeat_timeout [float] timeout (in seconds) for subsequent heartbeats from a rank. If no rank heartbeat is received within rank_heartbeat_timeout, failure is detected. If None this timeout needs to be deduced and set during runtime, based on the observed heartbeat intervals. - safety_factor [float] when deducing the timeouts, observed heartbeat intervals are multiplied by this factor to obtain the timeouts. - rank_termination_signal signal used to terminate the rank when failure is detected. - log_level log level of fault tolerance components
3. Integration with a PyTorch workload:
Initialize a
RankMonitorClient
instance on each rank withRankMonitorClient.init_workload_monitoring()
.(Optional) Restore the state of
RankMonitorClient
instances usingRankMonitorClient.load_state_dict()
.Periodically send heartbeats from ranks using
RankMonitorClient.send_heartbeat()
.(Optional) After a sufficient range of heartbeat intervals has been observed, call
RankMonitorClient.calculate_and_set_timeouts()
to estimate timeouts.(Optional) Save the
RankMonitorClient
instance’sstate_dict()
to a file so that computed timeouts can be reused in the next run.Shut down
RankMonitorClient
instances usingRankMonitorClient.shutdown_workload_monitoring()
.
FT Integration Guide for PyTorch Lightning
This section describes Fault Tolerance integration with a PTL-based workload (i.e., NeMo) using FaultToleranceCallback
.
1. Use ft_launcher
to start the workload
Fault tolerance relies on a special launcher (ft_launcher
), which is a modified torchrun
.
If you are using NeMo, the NeMo-Framework-Launcher can be used to generate SLURM batch scripts with the FT support.
2. Add FT callback to the PTL trainer
Add the FT callback to PTL callbacks.
from nvidia_resiliency_ext.ptl_resiliency import FaultToleranceCallback
fault_tol_cb = FaultToleranceCallback(
autoresume=True,
calculate_timeouts=True,
logger_name="test_logger",
exp_dir=tmp_path,
)
trainer = pl.Trainer(
...
callbacks=[..., fault_tol_cb],
resume_from_checkpoint=True,
)
- Core FT callback functionality includes:
Establishing a connection with a rank monitor.
Sending heartbeats during training and evaluation steps.
Disconnecting from a rank monitor.
- Optionally, it can also:
Compute timeouts that will be used instead of timeouts defined in the FT config.
Create a flag file when the training is completed.
FT callback initialization parameters:
- FaultToleranceCallback.__init__(autoresume, calculate_timeouts, simulated_fault_params=None, exp_dir=None, logger_name='nemo_logger.FaultToleranceCallback')[source]
Initialize callback instance.
This is a lightweight initialization. Most of the initialization is conducted in the ‘setup’ hook.
- Parameters:
autoresume (bool) – Set to True if the FT auto-resume feature is used (e.g., there are multiple training jobs to be run).
calculate_timeouts (bool) – Set to True if FT timeouts should be calculated based on observed heartbeat intervals. Calculated timeouts overwrite the timeouts from the FT config. Timeouts are computed at the end of a training job, if there was checkpoint loading and saving. For example, for training started from scratch, the timeouts are computed at the end of the second job.
simulated_fault_params (SimulatedFaultParams, dict, DictConfig, None) – Simulated fault spec. It’s for debugging only. Defaults to None. Should be a SimulatedFaultParams instance or any object that can be used for SimulatedFaultParams initialization with SimulatedFaultParams(**obj).
exp_dir (Union[str, pathlib.Path, None], optional) – Directory where the FT state should be saved. Must be available for all training jobs. NOTE: Beware that PTL can move files written to its trainer.log_dir. Defaults to None, in which case it defaults to trainer.log_dir/ft_state.
logger_name (Optional[str], optional) – Logger name to be used. Defaults to “nemo_logger.FaultToleranceCallback”.
3. Implementing auto-resume
Auto-resume is a feature that simplifies running training consisting of multiple subsequent training jobs.
Note
Auto-resume is not a part of the FT package. It is entirely implemented in a launcher script and the FaultToleranceCallback
.
FaultToleranceCallback
exposes an “interface” that allows implementing an auto-resume launcher script. Specifically, if autoresume=True
,
the FT callback creates a special marker file when training is completed. The marker file location is expected to be set in the FAULT_TOL_FINISHED_FLAG_FILE
environment variable.
- The following mechanism can be used to implement an auto-resuming launcher script:
The launcher script starts ranks with
ft_launcher
.FAULT_TOL_FINISHED_FLAG_FILE
should be passed to rank processes.When a
ft_launcher
exits, the launcher script checks if theFAULT_TOL_FINISHED_FLAG_FILE
file was created.If
FAULT_TOL_FINISHED_FLAG_FILE
exists, the auto-resume loop can be broken, as the training is completed.If
FAULT_TOL_FINISHED_FLAG_FILE
does not exist, the continuation job can be issued (other conditions can be checked, e.g., if the maximum number of failures is not reached).