Heartbeats API Integration

1. Prerequisites

  • Run ranks using ft_launcher. The command line is mostly compatible with torchrun.

  • Pass the FT config to the ft_launcher.

Note

Some clusters (e.g., SLURM) use SIGTERM as a default method of requesting a graceful workload shutdown. It is recommended to implement appropriate signal handling in a fault-tolerant workload. To avoid deadlocks and other unintended side effects, signal handling should be synchronized across all ranks.

2. FT configuration

Timeouts for fault detection need to be adjusted for each workload:
  • initial_rank_heartbeat_timeout should be long enough to allow for workload initialization.

  • rank_heartbeat_timeout should be at least as long as the longest possible interval between steps.

Importantly, heartbeats are not sent during checkpoint loading and saving, so the time for checkpoint-related operations should be taken into account.

Fixed timeout values can be used throughout the training runs, or timeouts can be calculated based on observed heartbeat intervals. null timeout values are interpreted as infinite timeouts. In such cases, values need to be calculated to make the FT usable.

Note

When –ft-param-initial_rank_heartbeat_timeout and –ft-param-rank_heartbeat_timeout are not provided in the command-line arguments, the launcher defaults to FT’s predefined values. These are not null/None; currently, the defaults are 60 minutes for –ft-param-initial_rank_heartbeat_timeout and 45 minutes for –ft-param-rank_heartbeat_timeout.

Configuration file example:

1fault_tolerance:
2    initial_rank_heartbeat_timeout: null
3    rank_heartbeat_timeout: null
4    log_level: "DEBUG"

A summary of all FT configuration items can be found in nvidia_resiliency_ext.fault_tolerance.config.FaultToleranceConfig

3. Integration with PyTorch workload code

  1. Initialize a RankMonitorClient instance on each rank with RankMonitorClient.init_workload_monitoring().

  2. (Optional) Restore the state of RankMonitorClient instances using RankMonitorClient.load_state_dict().

  3. Periodically send heartbeats from ranks using RankMonitorClient.send_heartbeat().

  4. (Optional) After a sufficient range of heartbeat intervals has been observed, call RankMonitorClient.calculate_and_set_hb_timeouts() to estimate timeouts.

  5. (Optional) Save the RankMonitorClient instance’s state_dict() to a file so that computed timeouts can be reused in the next run.

  6. Shut down RankMonitorClient instances using RankMonitorClient.shutdown_workload_monitoring().

Please refer to the Heartbeat API usage example with DDP for an implementation example.