Abort

class nvidia_resiliency_ext.inprocess.abort.Abort[source]

Abstract base class for abort argument for inprocess.Wrapper.

An instance of Abort is triggered by a separate monitoring thread within inprocess.Wrapper as part of the termination mechanism when a fault is detected. Its primary purpose is to unblock the main thread, which might be waiting for results from other distributed ranks that are either already terminated or unresponsive. For example, this could occur during a distributed collective operation attempting to communicate with a terminated rank.

Multiple instances of Abort could be composed with inprocess.Compose to achieve the desired behavior.

class nvidia_resiliency_ext.inprocess.abort.AbortTorchDistributed[source]

Aborts PyTorch distributed collectives, and destroys all PyTorch distributed process groups.

This functionality is implemented by invoking torch.distributed.destroy_process_group() in a separate Python thread for each distributed group that has been created.