Health Check
- class nvidia_resiliency_ext.inprocess.health_check.CudaHealthCheck(timeout=datetime.timedelta(seconds=30))[source]
Ensures that CUDA context for the current process is in a healthy state.
Synchronizes with the GPU. Uses the device corresponding to
LOCAL_RANK
environment variable, or the main thread’s default CUDA device ifLOCAL_RANK
was not specified in the environment.- Parameters:
timeout – timeout for synchronization with the GPU
- class nvidia_resiliency_ext.inprocess.health_check.FaultCounter(max_rank_faults=None)[source]
FaultCounter
counts faults caused by the current process. The process is terminated if total number of faults exceeds themax_rank_faults
threshold.- Parameters:
max_rank_faults – maximum number of faults cause by the process
- exception nvidia_resiliency_ext.inprocess.health_check.FaultCounterExceeded[source]
Exception raised by
FaultCounter
when number of faults on the current rank exceeds the threshold.
- class nvidia_resiliency_ext.inprocess.health_check.HealthCheck[source]
Abstract base class for
health_check
argument forinprocess.Wrapper
.HealthCheck
ensures the worker is in a healthy state and can execute the workload.Health checks are executed after the target function failure was discovered (on local, or other distributed ranks), local distributed group was destroyed, and after the user-provided
inprocess.finalize.Finalize
finished.HealthCheck
is executed to filter out unhealthy ranks (e.g. due to corrupted CUDA context). The execution should be local to a given rank, other ranks may have already been terminated, lost or still executing the wrapped function.Unhealthy state is reported by raising an
Exception
. The exception is reraised by theinprocess.Wrapper
, and should lead to termination of the main Python interpreter process.Multiple instances of
HealthCheck
could be composed withinprocess.Compose
to achieve the desired behavior.