Server

class nvidia_resiliency_ext.fault_tolerance.rank_monitor_server.RankMonitorLogger(name='RankMonServer', level=20, connected_rank=None, is_restarter_logger=False)[source]

Bases: Logger

Logger used in a rank monitor process

Initialize the logger with a name and an optional level.

class nvidia_resiliency_ext.fault_tolerance.rank_monitor_server.RankMonitorServer(cfg, ipc_socket_path, rank_monitor_ready_event, is_restarter_logger, logger)[source]

Bases: object

RankMonitorServer, running in a separate process, is responsible for monitoring the ranks. RankMonitorClient is intialized in each rank and is used to communicate with the RankMonitorServer.

Initializes the RankMonitorServer object.

Parameters:
  • cfg (FaultToleranceConfig) – The configuration object for fault tolerance.

  • ipc_socket_path (str) – Path of the IPC socket connecting this monitor with its rank

  • rank_monitor_ready_event (mp.Event) – The event indicating that the rank monitor is ready.

  • is_restarter_logger (bool) – True if this monitor writes state transition logs

  • logger (Logger.Logger) – The logger object for logging.