PyTorch Lightning Integration

This section describes Fault Tolerance integration with a PTL-based workload (i.e., NeMo) using FaultToleranceCallback.

1. Use `ft_launcher` to start the workload

Fault tolerance relies on a special launcher (ft_launcher), which is a modified torchrun. If you are using NeMo, the NeMo-Framework-Launcher can be used to generate SLURM batch scripts with FT support.

2. Add the FT callback to the PTL trainer

Add the FT callback to the PTL callbacks.

from nvidia_resiliency_ext.ptl_resiliency import FaultToleranceCallback

fault_tol_cb = FaultToleranceCallback(
   autoresume=True,
   calculate_timeouts=True,
   logger_name="test_logger",
   exp_dir=tmp_path,
)

trainer = pl.Trainer(
   ...
   callbacks=[..., fault_tol_cb],
   resume_from_checkpoint=True,
)

Core FT callback functionality includes:

Establishing a connection with a rank monitor.
Sending heartbeats during training and evaluation steps.
Disconnecting from a rank monitor.

Optionally, it can also:

Compute timeouts that will be used instead of timeouts defined in the FT config.
Create a flag file when the training is completed.

FT callback initialization parameters are described in the FaultToleranceCallback constructor docstring: nvidia_resiliency_ext.ptl_resiliency.fault_tolerance_callback.FaultToleranceCallback

3. Implementing auto-resume

Auto-resume simplifies running training jobs that consist of multiple sequential runs.

Note

Auto-resume is not part of the FT package. It is entirely implemented in a launcher script and the FaultToleranceCallback.

FaultToleranceCallback exposes an “interface” that allows implementing an auto-resume launcher script. Specifically, if autoresume=True, the FT callback creates a special marker file when training is completed. The marker file location is expected to be set in the FAULT_TOL_FINISHED_FLAG_FILE environment variable.

The following steps can be used to implement an auto-resume launcher script:

The launcher script starts ranks with ft_launcher.
FAULT_TOL_FINISHED_FLAG_FILE should be passed to rank processes.
When ft_launcher exits, the launcher script checks if the FAULT_TOL_FINISHED_FLAG_FILE file was created.
- If FAULT_TOL_FINISHED_FLAG_FILE exists, the auto-resume loop stops, as training is complete.
- If FAULT_TOL_FINISHED_FLAG_FILE does not exist, the continuation job can be issued (other conditions can be checked, e.g., if the maximum number of failures is not reached).

PyTorch Lightning Integration

1. Use ft_launcher to start the workload

2. Add the FT callback to the PTL trainer

3. Implementing auto-resume

1. Use `ft_launcher` to start the workload