Server

class nvidia_resiliency_ext.fault_tolerance.rank_monitor_server.RankMonitorServer(cfg, parent_rank, rank_monitor_ready_event, logger)[source]

Bases: object

RankMonitorServer, running in a separate process, is responsible for monitoring the ranks. RankMonitorClient is intialized in each rank and is used to communicate with the RankMonitorServer.

Initializes the RankMonitorServer object.

Parameters:
  • cfg (FaultToleranceConfig) – The configuration object for fault tolerance.

  • parent_rank (int) – which rank is being monitored by this RankMonitorServer instance.

  • rank_monitor_ready_event (mp.Event) – The event indicating that the rank monitor is ready.

  • logger (Logger.Logger) – The logger object for logging.