Server
- class nvidia_resiliency_ext.fault_tolerance.rank_monitor_server.RankMonitorLogger(name='RankMonServer', level=20, connected_rank=None, is_restarter_logger=False)[source]
Bases:
Logger
Logger used in a rank monitor process
Initialize the logger with a name and an optional level.
- class nvidia_resiliency_ext.fault_tolerance.rank_monitor_server.RankMonitorServer(cfg, ipc_socket_path, rank_monitor_ready_event, is_restarter_logger, logger)[source]
Bases:
object
RankMonitorServer, running in a separate process, is responsible for monitoring the ranks. RankMonitorClient is intialized in each rank and is used to communicate with the RankMonitorServer.
Initializes the RankMonitorServer object.
- Parameters:
cfg (FaultToleranceConfig) – The configuration object for fault tolerance.
ipc_socket_path (str) – Path of the IPC socket connecting this monitor with its rank
rank_monitor_ready_event (mp.Event) – The event indicating that the rank monitor is ready.
is_restarter_logger (bool) – True if this monitor writes state transition logs
logger (Logger.Logger) – The logger object for logging.