Health Check

class nvidia_resiliency_ext.inprocess.health_check.CudaHealthCheck(timeout=datetime.timedelta(seconds=30))[source]

Ensures that CUDA context for the current process is in a healthy state.

Synchronizes with the GPU. Uses the device corresponding to LOCAL_RANK environment variable, or the main thread’s default CUDA device if LOCAL_RANK was not specified in the environment.

Parameters:

timeout – timeout for synchronization with the GPU

class nvidia_resiliency_ext.inprocess.health_check.FaultCounter(max_rank_faults=None)[source]

FaultCounter counts faults caused by the current process. The process is terminated if total number of faults exceeds the max_rank_faults threshold.

Parameters:

max_rank_faults – maximum number of faults cause by the process

exception nvidia_resiliency_ext.inprocess.health_check.FaultCounterExceeded[source]

Exception raised by FaultCounter when number of faults on the current rank exceeds the threshold.