Usage guide ############ Terms ***** * ``PTL`` is PyTorch Lightning. * ``Fault Tolerance``, ``FT`` is the ``fault_tolerance`` package. * ``FT callback``, ``FaultToleranceCallback`` is a PTL callback that integrates FT with PTL. * ``ft_launcher`` is a launcher tool included in FT, which is based on ``torchrun``. * ``heartbeat`` is a lightweight message sent from a rank to its rank monitor that indicates that a rank is alive. * ``rank monitor`` is a special side-process started by ``ft_launcher`` that monitors heartbeats from its rank. * ``timeouts`` are time intervals used by a rank monitor to detect that a rank is not alive. There are two separate timeouts: for the initial heartbeat and the subsequent heartbeats. * ``launcher script`` is a bash script that invokes ``ft_launcher``. FT Package Design Overview ************************** * Each node runs a single ``ft_launcher``. * ``ft_launcher`` spawns rank monitors (once). * ``ft_launcher`` spawns ranks (can also respawn if ``--max-restarts`` is greater than 0). * Each rank uses ``RankMonitorClient`` to connect to its monitor (``RankMonitorServer``). * Each rank periodically sends heartbeats to its rank monitor (e.g. after each training and evaluation step). * In case of a hang, the rank monitor detects missing heartbeats from its rank and terminates it. * If any ranks disappear, ``ft_launcher`` detects that and terminates or restarts the workload. * ``ft_launcher`` instances communicate via the ``torchrun`` "rendezvous" mechanism. * Rank monitors do not communicate with each other. .. code-block:: text # Processes structure on a single node. # NOTE: each rank has its separate rank monitor. [Rank_N]----(IPC)----[Rank Monitor_N] | | | | (re/spawns) (spawns) | | | | [ft_launcher]------------- FT Integration Guide for PyTorch ******************************** 1. Prerequisites: ================= Run ranks using ``ft_launcher``. The command line is mostly compatible with ``torchrun``. .. note:: Some clusters (e.g. SLURM) use SIGTERM as a default method of requesting a graceful workload shutdown. It is recommended to implement appropriate signal handling in a fault-tolerant workload. To avoid deadlocks and other unintended side effects, signal handling should be synchronized across all ranks. Please refer to the :doc:`train_ddp.py example ` for a basic signal handling implementation. 2. FT configuration: ==================== FT configuration is passed to ``ft_launcher`` via YAML file ``--fault-tol-cfg-path`` or CLI arguments ``--ft-param-...``, from where it's propagated to other FT components. Timeouts for fault detection need to be adjusted for a given workload: * ``initial_rank_heartbeat_timeout`` should be long enough to allow for workload initialization. * ``rank_heartbeat_timeout`` should be at least as long as the longest possible interval between steps. **Importantly, heartbeats are not sent during checkpoint loading and saving**, so time for checkpointing-related operations should be taken into account. Summary of all FT configuration items: .. autoclass:: nvidia_resiliency_ext.fault_tolerance.config.FaultToleranceConfig 3. Integration with a PyTorch workload: ======================================= 1. Initialize a ``RankMonitorClient`` instance on each rank with ``RankMonitorClient.init_workload_monitoring()``. 2. *(Optional)* Restore the state of ``RankMonitorClient`` instances using ``RankMonitorClient.load_state_dict()``. 3. Periodically send heartbeats from ranks using ``RankMonitorClient.send_heartbeat()``. 4. *(Optional)* After a sufficient range of heartbeat intervals has been observed, call ``RankMonitorClient.calculate_and_set_timeouts()`` to estimate timeouts. 5. *(Optional)* Save the ``RankMonitorClient`` instance's ``state_dict()`` to a file so that computed timeouts can be reused in the next run. 6. Shut down ``RankMonitorClient`` instances using ``RankMonitorClient.shutdown_workload_monitoring()``. FT Integration Guide for PyTorch Lightning ****************************************** This section describes Fault Tolerance integration with a PTL-based workload (i.e., NeMo) using ``FaultToleranceCallback``. 1. Use ``ft_launcher`` to start the workload ============================================ Fault tolerance relies on a special launcher (``ft_launcher``), which is a modified ``torchrun``. If you are using NeMo, the `NeMo-Framework-Launcher `_ can be used to generate SLURM batch scripts with the FT support. 2. Add FT callback to the PTL trainer ===================================== Add the FT callback to PTL callbacks. .. code-block:: python from nvidia_resiliency_ext.ptl_resiliency import FaultToleranceCallback fault_tol_cb = FaultToleranceCallback( autoresume=True, calculate_timeouts=True, logger_name="test_logger", exp_dir=tmp_path, ) trainer = pl.Trainer( ... callbacks=[..., fault_tol_cb], resume_from_checkpoint=True, ) Core FT callback functionality includes: * Establishing a connection with a rank monitor. * Sending heartbeats during training and evaluation steps. * Disconnecting from a rank monitor. Optionally, it can also: * Compute timeouts that will be used instead of timeouts defined in the FT config. * Create a flag file when the training is completed. FT callback initialization parameters: .. automethod:: nvidia_resiliency_ext.ptl_resiliency.fault_tolerance_callback.FaultToleranceCallback.__init__ 3. Implementing auto-resume =========================== Auto-resume is a feature that simplifies running training consisting of multiple subsequent training jobs. .. note:: Auto-resume is not a part of the FT package. It is entirely implemented in a launcher script and the ``FaultToleranceCallback``. ``FaultToleranceCallback`` exposes an "interface" that allows implementing an auto-resume launcher script. Specifically, if ``autoresume=True``, the FT callback creates a special marker file when training is completed. The marker file location is expected to be set in the ``FAULT_TOL_FINISHED_FLAG_FILE`` environment variable. The following mechanism can be used to implement an auto-resuming launcher script: * The launcher script starts ranks with ``ft_launcher``. * ``FAULT_TOL_FINISHED_FLAG_FILE`` should be passed to rank processes. * When a ``ft_launcher`` exits, the launcher script checks if the ``FAULT_TOL_FINISHED_FLAG_FILE`` file was created. * If ``FAULT_TOL_FINISHED_FLAG_FILE`` exists, the auto-resume loop can be broken, as the training is completed. * If ``FAULT_TOL_FINISHED_FLAG_FILE`` does not exist, the continuation job can be issued (other conditions can be checked, e.g., if the maximum number of failures is not reached).