Server
- class nvidia_resiliency_ext.fault_tolerance.rank_monitor_server.RankMonitorLogger(name='RankMonServer', level=20, connected_rank=None, is_restarter_logger=False)[source]
Bases:
Logger
Logger used in a rank monitor process
Initialize the logger with a name and an optional level.
- log_restarter_event(message, *args, **kwargs)[source]
Log a restart event that should always be visible, but only if restarter logging is enabled. :param is_restarter_logger: Whether restarter logging is enabled :param message: The message to log :param *args: Additional arguments for logging :param **kwargs: Additional arguments for logging
- class nvidia_resiliency_ext.fault_tolerance.rank_monitor_server.RankMonitorServer(cfg, ipc_socket_path, rank_monitor_ready_event, logger, is_restarter_logger)[source]
Bases:
object
RankMonitorServer, running in a separate process, is responsible for monitoring the ranks. RankMonitorClient is intialized in each rank and is used to communicate with the RankMonitorServer.
Initializes the RankMonitorServer object.
- Parameters:
cfg (FaultToleranceConfig) – The configuration object for fault tolerance.
ipc_socket_path (str) – Path of the IPC socket connecting this monitor with its rank
rank_monitor_ready_event (mp.Event) – The event indicating that the rank monitor is ready.
is_restarter_logger (bool) – True if this monitor writes state transition logs
logger (logging.Logger) – The logger object for logging.