Straggler Detection

The Straggler Detection package’s purpose is to detect slower ranks participating in a PyTorch distributed workload. The nvidia-resiliency-ext package also includes the PTL callback StragglerDetectionCallback that simplifies integration with PyTorch Lightning-based workloads.

Straggler Detection is included in the nvidia_resiliency_ext.straggler package. StragglerDetectionCallback is included in the nvidia_resiliency_ext.ptl_resiliency package.