Usage guide

Terms

  • PTL is PyTorch Lightning.

  • Fault Tolerance, FT is the fault_tolerance package.

  • FT callback, FaultToleranceCallback is a PTL callback that integrates FT with PTL.

  • ft_launcher is a launcher tool included in FT, which is based on torchrun.

  • heartbeat is a lightweight message sent from a rank to its rank monitor that indicates that a rank is alive.

  • rank monitor is a special side-process started by ft_launcher that monitors heartbeats from its rank.

  • timeouts are time intervals used by a rank monitor to detect that a rank is not alive. There are two separate timeouts: for the initial heartbeat and the subsequent heartbeats.

  • launcher script is a bash script that invokes ft_launcher.

FT Package Design Overview

  • Each node runs a single ft_launcher.

  • ft_launcher spawns rank monitors (once).

  • ft_launcher spawns ranks (can also respawn if --max-restarts is greater than 0).

  • Each rank uses RankMonitorClient to connect to its monitor (RankMonitorServer).

  • Each rank periodically sends heartbeats to its rank monitor (e.g. after each training and evaluation step).

  • In case of a hang, the rank monitor detects missing heartbeats from its rank and terminates it.

  • If any ranks disappear, ft_launcher detects that and terminates or restarts the workload.

  • ft_launcher instances communicate via the torchrun “rendezvous” mechanism.

  • Rank monitors do not communicate with each other.

# Processes structure on a single node.
# NOTE: each rank has its separate rank monitor.

[Rank_N]----(IPC)----[Rank Monitor_N]
   |                      |
   |                      |
(re/spawns)            (spawns)
   |                      |
   |                      |
[ft_launcher]-------------

FT Integration Guide for PyTorch

1. Prerequisites:

Run ranks using ft_launcher. The command line is mostly compatible with torchrun.

Note

Some clusters (e.g. SLURM) use SIGTERM as a default method of requesting a graceful workload shutdown. It is recommended to implement appropriate signal handling in a fault-tolerant workload. To avoid deadlocks and other unintended side effects, signal handling should be synchronized across all ranks. Please refer to the train_ddp.py example for a basic signal handling implementation.

2. FT configuration:

FT configuration is passed to ft_launcher via YAML file --fault-tol-cfg-path or CLI arguments --ft-param-..., from where it’s propagated to other FT components.

Timeouts for fault detection need to be adjusted for a given workload:
  • initial_rank_heartbeat_timeout should be long enough to allow for workload initialization.

  • rank_heartbeat_timeout should be at least as long as the longest possible interval between steps.

Importantly, heartbeats are not sent during checkpoint loading and saving, so time for checkpointing-related operations should be taken into account.

Summary of all FT configuration items:

class nvidia_resiliency_ext.fault_tolerance.config.FaultToleranceConfig(workload_check_interval=5.0, initial_rank_heartbeat_timeout=3600.0, rank_heartbeat_timeout=2700.0, safety_factor=5.0, rank_termination_signal=Signals.SIGKILL, log_level=20)[source]
Configuration of fault tolerance
  • workload_check_interval [float] periodic rank check interval (in seconds) in rank monitors.

  • initial_rank_heartbeat_timeout [float] timeout (in seconds) for the first heartbeat from a rank.

Usually, it takes a bit longer for the first heartbeat to be sent, as the rank needs to initialize. If rank does not send the first heartbeat within initial_rank_heartbeat_timeout, failure is detected. If None this timeout needs to be deduced and set during runtime, based on the observed heartbeat intervals. - rank_heartbeat_timeout [float] timeout (in seconds) for subsequent heartbeats from a rank. If no rank heartbeat is received within rank_heartbeat_timeout, failure is detected. If None this timeout needs to be deduced and set during runtime, based on the observed heartbeat intervals. - safety_factor [float] when deducing the timeouts, observed heartbeat intervals are multiplied by this factor to obtain the timeouts. - rank_termination_signal signal used to terminate the rank when failure is detected. - log_level log level of fault tolerance components

Parameters:
  • workload_check_interval (float)

  • initial_rank_heartbeat_timeout (float | None)

  • rank_heartbeat_timeout (float | None)

  • safety_factor (float)

  • rank_termination_signal (Signals)

  • log_level (int)

3. Integration with a PyTorch workload:

  1. Initialize a RankMonitorClient instance on each rank with RankMonitorClient.init_workload_monitoring().

  2. (Optional) Restore the state of RankMonitorClient instances using RankMonitorClient.load_state_dict().

  3. Periodically send heartbeats from ranks using RankMonitorClient.send_heartbeat().

  4. (Optional) After a sufficient range of heartbeat intervals has been observed, call RankMonitorClient.calculate_and_set_timeouts() to estimate timeouts.

  5. (Optional) Save the RankMonitorClient instance’s state_dict() to a file so that computed timeouts can be reused in the next run.

  6. Shut down RankMonitorClient instances using RankMonitorClient.shutdown_workload_monitoring().

FT Integration Guide for PyTorch Lightning

This section describes Fault Tolerance integration with a PTL-based workload (i.e., NeMo) using FaultToleranceCallback.

1. Use ft_launcher to start the workload

Fault tolerance relies on a special launcher (ft_launcher), which is a modified torchrun. If you are using NeMo, the NeMo-Framework-Launcher can be used to generate SLURM batch scripts with the FT support.

2. Add FT callback to the PTL trainer

Add the FT callback to PTL callbacks.

from nvidia_resiliency_ext.ptl_resiliency import FaultToleranceCallback

fault_tol_cb = FaultToleranceCallback(
   autoresume=True,
   calculate_timeouts=True,
   logger_name="test_logger",
   exp_dir=tmp_path,
)

trainer = pl.Trainer(
   ...
   callbacks=[..., fault_tol_cb],
   resume_from_checkpoint=True,
)
Core FT callback functionality includes:
  • Establishing a connection with a rank monitor.

  • Sending heartbeats during training and evaluation steps.

  • Disconnecting from a rank monitor.

Optionally, it can also:
  • Compute timeouts that will be used instead of timeouts defined in the FT config.

  • Create a flag file when the training is completed.

FT callback initialization parameters:

FaultToleranceCallback.__init__(autoresume, calculate_timeouts, simulated_fault_params=None, exp_dir=None, logger_name='nemo_logger.FaultToleranceCallback')[source]

Initialize callback instance.

This is a lightweight initialization. Most of the initialization is conducted in the ‘setup’ hook.

Parameters:
  • autoresume (bool) – Set to True if the FT auto-resume feature is used (e.g., there are multiple training jobs to be run).

  • calculate_timeouts (bool) – Set to True if FT timeouts should be calculated based on observed heartbeat intervals. Calculated timeouts overwrite the timeouts from the FT config. Timeouts are computed at the end of a training job, if there was checkpoint loading and saving. For example, for training started from scratch, the timeouts are computed at the end of the second job.

  • simulated_fault_params (SimulatedFaultParams, dict, DictConfig, None) – Simulated fault spec. It’s for debugging only. Defaults to None. Should be a SimulatedFaultParams instance or any object that can be used for SimulatedFaultParams initialization with SimulatedFaultParams(**obj).

  • exp_dir (Union[str, pathlib.Path, None], optional) – Directory where the FT state should be saved. Must be available for all training jobs. NOTE: Beware that PTL can move files written to its trainer.log_dir. Defaults to None, in which case it defaults to trainer.log_dir/ft_state.

  • logger_name (Optional[str], optional) – Logger name to be used. Defaults to “nemo_logger.FaultToleranceCallback”.

3. Implementing auto-resume

Auto-resume is a feature that simplifies running training consisting of multiple subsequent training jobs.

Note

Auto-resume is not a part of the FT package. It is entirely implemented in a launcher script and the FaultToleranceCallback.

FaultToleranceCallback exposes an “interface” that allows implementing an auto-resume launcher script. Specifically, if autoresume=True, the FT callback creates a special marker file when training is completed. The marker file location is expected to be set in the FAULT_TOL_FINISHED_FLAG_FILE environment variable.

The following mechanism can be used to implement an auto-resuming launcher script:
  • The launcher script starts ranks with ft_launcher.

  • FAULT_TOL_FINISHED_FLAG_FILE should be passed to rank processes.

  • When a ft_launcher exits, the launcher script checks if the FAULT_TOL_FINISHED_FLAG_FILE file was created.

    • If FAULT_TOL_FINISHED_FLAG_FILE exists, the auto-resume loop can be broken, as the training is completed.

    • If FAULT_TOL_FINISHED_FLAG_FILE does not exist, the continuation job can be issued (other conditions can be checked, e.g., if the maximum number of failures is not reached).