Abort

class nvidia_resiliency_ext.inprocess.abort.AbortTorchDistributed[source]

Aborts PyTorch distributed collectives, and destroys all PyTorch distributed process groups.

This functionality is implemented by invoking torch.distributed.destroy_process_group() in a separate Python thread for each distributed group that has been created.