Server
- class nvidia_resiliency_ext.fault_tolerance.rank_monitor_server.RankMonitorServer(cfg, parent_rank, rank_monitor_ready_event, logger)[source]
Bases:
object
RankMonitorServer, running in a separate process, is responsible for monitoring the ranks. RankMonitorClient is intialized in each rank and is used to communicate with the RankMonitorServer.
Initializes the RankMonitorServer object.
- Parameters:
cfg (FaultToleranceConfig) – The configuration object for fault tolerance.
parent_rank (int) – which rank is being monitored by this RankMonitorServer instance.
rank_monitor_ready_event (mp.Event) – The event indicating that the rank monitor is ready.
logger (Logger.Logger) – The logger object for logging.