Health Check
- class nvidia_resiliency_ext.inprocess.health_check.HealthCheck[source]
Abstract base class for
health_checkargument forinprocess.Wrapper.HealthCheckensures the worker is in a healthy state and can execute the workload.Health checks are executed after the target function failure was discovered (on local, or other distributed ranks), local distributed group was destroyed, and after the user-provided
inprocess.finalize.Finalizefinished.HealthCheckis executed to filter out unhealthy ranks (e.g. due to corrupted CUDA context). The execution should be local to a given rank, other ranks may have already been terminated, lost or still executing the wrapped function.Unhealthy state is reported by raising an
Exception. The exception is reraised by theinprocess.Wrapper, and should lead to termination of the main Python interpreter process.Multiple instances of
HealthCheckcould be composed withinprocess.Composeto achieve the desired behavior.- abstract __call__(state)[source]
Implementation of a
HealthCheck.- Parameters:
state (FrozenState) – read-only
Wrapperstate- Returns:
Forwarded read-only input
state.- Return type:
FrozenState
- class nvidia_resiliency_ext.inprocess.health_check.ChainedGPUHealthCheck(device_index=None)[source]
Ensures that GPU devices are in a healthy state by checking GPU recovery actions.
Uses the GPUHealthCheck from shared_utils to perform comprehensive GPU health checks. This health check is executed after a fault to ensure the GPU is in a recoverable state.
- Parameters:
device_index – Optional GPU device index to check. If None, checks all GPUs.
- class nvidia_resiliency_ext.inprocess.health_check.ChainedNVLHealthCheck(device_index=None)[source]
Ensures that NVL (NVLink) connections are in a healthy state.
Uses the NVLHealthCheck from shared_utils to perform comprehensive NVL link health checks. This health check is executed after the fault to ensure NVL links are functioning properly.
- Parameters:
device_index – Optional GPU device index to check. If None, checks all GPUs.
- class nvidia_resiliency_ext.inprocess.health_check.ChainedNicHealthCheck(device_index=None)[source]
Ensures that NIC (Network Interface Card) connections are in a healthy state.
Uses the NicHealthCheck from shared_utils to perform comprehensive NIC health checks. This health check is executed after the fault to ensure NIC links are functioning properly.
The NicHealthCheck constructor automatically handles device_index and baseline initialization, making this wrapper much simpler and more reliable.
- Parameters:
device_index – Optional GPU device index to check. If None, checks all GPUs.
- class nvidia_resiliency_ext.inprocess.health_check.CudaHealthCheck(timeout=datetime.timedelta(seconds=30))[source]
Ensures that CUDA context for the current process is in a healthy state.
Synchronizes with the GPU. Uses the device corresponding to
LOCAL_RANKenvironment variable, or the main thread’s default CUDA device ifLOCAL_RANKwas not specified in the environment.- Parameters:
timeout – timeout for synchronization with the GPU
- class nvidia_resiliency_ext.inprocess.health_check.FaultCounter(max_rank_faults=None)[source]
FaultCountercounts faults caused by the current process. The process is terminated if total number of faults exceeds themax_rank_faultsthreshold.- Parameters:
max_rank_faults – maximum number of faults cause by the process
- exception nvidia_resiliency_ext.inprocess.health_check.FaultCounterExceeded[source]
Exception raised by
FaultCounterwhen number of faults on the current rank exceeds the threshold.
Enhanced Health Check Features
The InProcess wrapper automatically includes three types of health checks when LOCAL_RANK is available:
ChainedGPUHealthCheck: Monitors GPU device health and recovery actions. ChainedNVLHealthCheck: Monitors NVLink connectivity and link health. ChainedNicHealthCheck: Monitors network interface card connectivity and link down events.
All chained health checks automatically use the device_index from LOCAL_RANK and are ready to use
immediately after construction. The underlying health check classes handle device assignment and baseline
initialization automatically.
Note
The NicHealthCheck constructor now accepts an optional device_index parameter that automatically
sets the NIC device and initializes the baseline link down counter during construction. This eliminates
the need for manual setup and ensures accurate health monitoring from the first check.