Heartbeats API Integration ************************** 1. Prerequisites ================= * Run ranks using ``ft_launcher``. The command line is mostly compatible with ``torchrun``. * Pass the FT config to the ``ft_launcher``. .. note:: Some clusters (e.g., SLURM) use SIGTERM as a default method of requesting a graceful workload shutdown. It is recommended to implement appropriate signal handling in a fault-tolerant workload. To avoid deadlocks and other unintended side effects, signal handling should be synchronized across all ranks. 2. FT configuration ==================== Timeouts for fault detection need to be adjusted for each workload: * ``initial_rank_heartbeat_timeout`` should be long enough to allow for workload initialization. * ``rank_heartbeat_timeout`` should be at least as long as the longest possible interval between steps. **Importantly, heartbeats are not sent during checkpoint loading and saving**, so the time for checkpoint-related operations should be taken into account. Fixed timeout values can be used throughout the training runs, or timeouts can be calculated based on observed heartbeat intervals. `null` timeout values are interpreted as infinite timeouts. In such cases, values need to be calculated to make the FT usable. .. note:: When --ft-param-initial_rank_heartbeat_timeout and --ft-param-rank_heartbeat_timeout are not provided in the command-line arguments, the launcher defaults to FT's predefined values. These are not null/None; currently, the defaults are 60 minutes for --ft-param-initial_rank_heartbeat_timeout and 45 minutes for --ft-param-rank_heartbeat_timeout. Configuration file example: .. literalinclude:: ../../../../examples/fault_tolerance/fault_tol_cfg_heartbeats.yaml :language: yaml :linenos: A summary of all FT configuration items can be found in :class:`nvidia_resiliency_ext.fault_tolerance.config.FaultToleranceConfig` 3. Integration with PyTorch workload code ============================================ 1. Initialize a ``RankMonitorClient`` instance on each rank with ``RankMonitorClient.init_workload_monitoring()``. 2. *(Optional)* Restore the state of ``RankMonitorClient`` instances using ``RankMonitorClient.load_state_dict()``. 3. Periodically send heartbeats from ranks using ``RankMonitorClient.send_heartbeat()``. 4. *(Optional)* After a sufficient range of heartbeat intervals has been observed, call ``RankMonitorClient.calculate_and_set_hb_timeouts()`` to estimate timeouts. 5. *(Optional)* Save the ``RankMonitorClient`` instance's ``state_dict()`` to a file so that computed timeouts can be reused in the next run. 6. Shut down ``RankMonitorClient`` instances using ``RankMonitorClient.shutdown_workload_monitoring()``. Please refer to the :doc:`../examples/train_ddp_heartbeats` for an implementation example.