nvidia-resiliency-ext v0.2.0
nvidia-resiliency-ext is a set of tools developed by NVIDIA to improve large-scale distributed training resiliency.
Features
Hang detection and automatic in-job restarting
In-process restarting
Async checkpointing
Local checkpointing
Straggler (slower ranks) detection