Health Check
- class nvidia_resiliency_ext.inprocess.health_check.CudaHealthCheck(timeout=datetime.timedelta(seconds=30))[source]
Ensures that CUDA context for the current process is in a healthy state.
Synchronizes with the GPU. Uses the device corresponding to
LOCAL_RANK
environment variable, or the main thread’s default CUDA device ifLOCAL_RANK
was not specified in the environment.- Parameters:
timeout – timeout for synchronization with the GPU
- class nvidia_resiliency_ext.inprocess.health_check.FaultCounter(max_rank_faults=None)[source]
FaultCounter
counts faults caused by the current process. The process is terminated if total number of faults exceeds themax_rank_faults
threshold.- Parameters:
max_rank_faults – maximum number of faults cause by the process
- exception nvidia_resiliency_ext.inprocess.health_check.FaultCounterExceeded[source]
Exception raised by
FaultCounter
when number of faults on the current rank exceeds the threshold.