Fault Tolerance

Fault Tolerance is a Python package that features:
  • Workload hang detection.

  • Automatic calculation of timeouts used for hang detection.

  • Detection of rank(s) terminated due to an error.

  • Workload respawning in case of a failure.

The nvidia-resiliency-ext package also includes the PTL callback FaultToleranceCallback that simplifies FT package integration with PyTorch Lightning-based workloads.

Fault Tolerance is included in the nvidia_resiliency_ext.fault_tolerance package. FaultToleranceCallback is included in the nvidia_resiliency_ext.ptl_resiliency package.