Client

class nvidia_resiliency_ext.fault_tolerance.rank_monitor_client.HeartbeatTimeouts(initial=None, subsequent=None, were_calculated=None)[source]

Bases: object

Contains hearbeat related timeouts used by FT. - initial is the timeout for the first heartbeat. - subsequent is the timeout for the subsequent heartbeats.

Usually, the first heartbeat takes longer to be sent, hence there are 2 separate timeouts. - were_calculated indicates whether the timeouts were calculated from

the observed heartbeat intervals or defined in the (YAML) config.

Parameters:
  • initial (float | None)

  • subsequent (float | None)

  • were_calculated (bool | None)

class nvidia_resiliency_ext.fault_tolerance.rank_monitor_client.RankMonitorClient[source]

Bases: object

RankMonitorClient is a client for RankMonitorServer. Its instances are created in each rank process. After creation, IPC connection can be established with RankMonitorServer using .init_workload_monitoring. The client should send heartbeats to the server, which monitor its health. Heartbeats are sent with .send_heartbeat.

RankMonitorServer monitors time between heartbeats and can detect hangs. RankMonitorClient can estimate suitable timeouts for the heartbeats, that will be used instead of values provided in the FT config. If there are timeouts predefined in the FT config and timeouts calculated, the calculated timeouts always take precedence. Currently used timeouts can be read from timeouts field. New timeouts can be calculated and set with .calculate_and_set_timeouts.

Stateful protocol (.state_dict() .load_state_dict()) is used to persist the state of the client, e.g. calculated timeouts.

RankMonitorClient logger is used for logging.

Basic initialization of RankMonitorClient instance. .init_workload_monitoring() and .load_state_dict() need to be called to fully initialize. Full FT configuration will be obtained from the server via IPC.

calculate_and_set_timeouts(skip_if_not_ready=False)[source]

Calculates and sets the timeouts used for hang detection.

NOTE: this call synchronizes the calculated timeouts across all ranks. NOTE: if calculated timeout value is smaller that currently used, the new value is ignored.

Parameters:

skip_if_not_ready (bool, optional) – If True, silently skips the calculation if there is not enough data collected. Otherwise error will be raised. Defaults to False.

Returns:

True if the timeouts were calculated and set successfully. False is returned only

if calculation was not possible and skip_if_not_ready was True.

Return type:

bool

init_workload_monitoring()[source]

Initializes the fault tolerance and connects to the RankMonitorServer.

Return type:

None

load_state_dict(state)[source]

Loads the state of the RankMonitorClient from a dictionary.

Can be called at any momemnt e.g. before init_workload_monitoring or after.

Parameters:

state (Mapping[str, Any]) – (Mapping[str, Any]): The state as returend from the state_dict method.

Return type:

None

send_heartbeat()[source]

Sends a empty (not containing a state) heartbeat message to the rank monitor server.

Return type:

None

shutdown_workload_monitoring()[source]

Shutdown the workload monitoring and close the connection to the RankMonitorServer.

state_dict()[source]

Returns the state dictionary of this RankMonitorClient object.

NOTE: this method returns the same values on all ranks,

there are no rank-specific values in RankMonitorClient state.

Returns:

The state dictionary containing the current state.

Return type:

Mapping[str, Any]

exception nvidia_resiliency_ext.fault_tolerance.rank_monitor_client.RankMonitorClientError[source]

Bases: Exception