Abort
- class nvidia_resiliency_ext.inprocess.abort.Abort[source]
Abstract base class for
abort
argument forinprocess.Wrapper
.An instance of
Abort
is triggered by a separate monitoring thread withininprocess.Wrapper
as part of the termination mechanism when a fault is detected. Its primary purpose is to unblock the main thread, which might be waiting for results from other distributed ranks that are either already terminated or unresponsive. For example, this could occur during a distributed collective operation attempting to communicate with a terminated rank.Multiple instances of
Abort
could be composed withinprocess.Compose
to achieve the desired behavior.
- class nvidia_resiliency_ext.inprocess.abort.AbortTorchDistributed(torch_fr_trace_path=None)[source]
Aborts PyTorch distributed collectives, and destroys all PyTorch distributed process groups.
This functionality is implemented by invoking
torch.distributed.destroy_process_group()
in a separate Python thread for each distributed group that has been created.This class is used to abort PyTorch distributed collectives, and destroy all PyTorch distributed process groups.
It also collects PyTorch Flight Recorder traces, which can be used to analyze the behavior of the distributed collectives.
PyTorch Flight Recorder traces are collected by setting the TORCH_NCCL_TRACE_BUFFER_SIZE environment variable to a non-zero value. We disable the collection of stack traces by default, which requires GIL, leading to a deadlock. This feature is still experimental and need to be used with care.
The traces are collected in the directory specified by torch_fr_trace_path. If the directory does not exist, it will be created. If the directory is not provided, the traces will not be collected.
- Parameters:
torch_fr_trace_path (str) – Path to collect PyTorch Flight Recorder traces.