Sections API Integration ************************ 1. Prerequisites ================= * Run ranks using ``ft_launcher``. The command line is mostly compatible with ``torchrun``. * Pass the FT config to the ``ft_launcher``. .. note:: Some clusters (e.g., SLURM) use SIGTERM as a default method of requesting a graceful workload shutdown. It is recommended to implement appropriate signal handling in a fault-tolerant workload. To avoid deadlocks and other unintended side effects, signal handling should be synchronized across all ranks. 2. FT configuration ==================== With the section-based API, timeouts must be set for all defined sections, which wrap operations like training/eval steps, checkpoint saving, and initialization. Additionally, an out-of-section timeout applies when no section is active. .. note:: Ensure out-of-section timeout is long enough to accommodate restart overhead, as excessively small values can cause imbalance. If needed, consider merging sections (e.g., moving 'init' into 'out-of-section') to provide more buffer time. Relevant FT configuration items are: * ``rank_section_timeouts`` is a map of a section name to its timeout. * ``rank_out_of_section_timeout`` is the out-of-section timeout. Fixed timeout values can be used throughout the training runs, or timeouts can be calculated based on observed intervals. `null` timeout values are interpreted as infinite timeouts. In such cases, values need to be calculated to make the FT usable. Config file example: .. literalinclude:: ../../../../examples/fault_tolerance/fault_tol_cfg_sections.yaml :language: yaml :linenos: A summary of all FT configuration items can be found in :class:`nvidia_resiliency_ext.fault_tolerance.config.FaultToleranceConfig` 3. Integration with PyTorch workload code ============================================ 1. Initialize a ``RankMonitorClient`` instance on each rank with ``RankMonitorClient.init_workload_monitoring()``. 2. *(Optional)* Restore the state of ``RankMonitorClient`` instances using ``RankMonitorClient.load_state_dict()``. 3. Mark some sections of the code with ``RankMonitorClient.start_section('
')`` and ``RankMonitorClient.end_section('
')``. 4. *(Optional)* After a sufficient range of section intervals has been observed, call ``RankMonitorClient.calculate_and_set_section_timeouts()`` to estimate timeouts. 5. *(Optional)* Save the ``RankMonitorClient`` instance's ``state_dict()`` to a file so that computed timeouts can be reused in the next run. 6. Shut down ``RankMonitorClient`` instances using ``RankMonitorClient.shutdown_workload_monitoring()``. Please refer to the :doc:`../examples/train_ddp_sections` for an implementation example.