Straggler Detection
The Straggler Detection package’s purpose is to detect slower ranks participating in a PyTorch distributed workload.
The nvidia-resiliency-ext package also includes the PTL callback StragglerDetectionCallback that simplifies integration with PyTorch Lightning-based workloads.
Straggler Detection is included in the nvidia_resiliency_ext.straggler package.
StragglerDetectionCallback is included in the nvidia_resiliency_ext.ptl_resiliency package.