Abort
- class nvidia_resiliency_ext.inprocess.abort.AbortTorchDistributed[source]
Aborts PyTorch distributed collectives, and destroys all PyTorch distributed process groups.
This functionality is implemented by invoking
torch.distributed.destroy_process_group()
in a separate Python thread for each distributed group that has been created.