Client

class nvidia_resiliency_ext.fault_tolerance.rank_monitor_client.RankMonitorClient[source]

Bases: object

RankMonitorClient is a client for RankMonitorServer. Its instances are created in each rank process. After creation, IPC connection can be established with RankMonitorServer using .init_workload_monitoring.

RankMonitorClient provides 2 independent functionalities that can be used for hang detection:

1. Heartbeat mechanism: .send_heartbeat method sends a heartbeat to the server. RankMonitorServer monitors time elapsed between heartbeats and detects hangs, based on two timeouts: - initial timeout, is a maximum time between client initialization and the first heartbeat - subsequent timeout, is a maximum time between two consecutive heartbeats

2. Section mechanism: .start_section and .end_section are used to wrap some sections of user code. Sections are identified by a user-provided name. User can configure timeouts for each section in the FT config. Hang is detected if a section is open for too long. To ensure that code that is not wrapped in any section is not running for too long (hung), there is additional “out-of-section” timeout, which is active while no section is open.

RankMonitorClient can estimate suitable timeouts for the heartbeats and sections, that will be used instead of values provided in the FT config. If there are timeouts predefined in the FT config and timeouts calculated, the calculated timeouts always take precedence.

If a timeout value is None it means it’s not used (as if it was +inf).

Currently used timeouts can be read from .hb_timeouts and .section_timeouts fields. New timeouts can be calculated and set with .calculate_and_set_hb_timeouts and .calculate_and_set_section_timeouts.

Stateful protocol (.state_dict() .load_state_dict()) is used to persist the state of the client, e.g. calculated timeouts.

RankMonitorClient logger is used for logging.

Basic initialization of RankMonitorClient instance. .init_workload_monitoring() and .load_state_dict() need to be called to fully initialize. Full FT configuration will be obtained from the server via IPC.

calculate_and_set_hb_timeouts(skip_if_not_ready=False)[source]

Calculates and sets heartbeat timeouts used for hang detection.

NOTE: this call synchronizes the calculated timeouts across all ranks.

Parameters:

skip_if_not_ready (bool, optional) – If True, silently skips the calculation if there is not enough data collected. Otherwise error will be raised. Defaults to False.

Returns:

True if the timeouts were calculated and set successfully. False is returned only

if calculation was not possible and skip_if_not_ready was True.

Return type:

bool

calculate_and_set_section_timeouts(selected_sections=None, calc_out_of_section=True, skip_if_not_ready=False)[source]

Calculates and sets section timeouts used for hang detection.

NOTE: this call synchronizes the calculated timeouts across all ranks.

Parameters:
  • selected_sections (List[str], optional) – List of sections which timeouts should be updated, If not provided (None) all section timeouts will be updated

  • calc_out_of_section (bool) – (bool): Determines if “out of section” timeout should be updated. Defaults to True.

  • skip_if_not_ready (bool, optional) – If True, silently skips the calculation if there is not enough data collected. Otherwise error will be raised. Defaults to False.

Returns:

True if the timeouts were calculated and set successfully. False is returned only

if calculation was not possible and skip_if_not_ready was True.

Return type:

bool

end_all_sections()[source]

Closes all currently opened sections. Does nothing if there are no sections open.

Return type:

None

end_section(section)[source]

Close the section with the given name.

NOTE: The section must be opened.

Parameters:

section (str) – User defined name of the section.

Return type:

None

init_workload_monitoring()[source]

Initializes the fault tolerance and connects to the RankMonitorServer.

Return type:

None

load_state_dict(state)[source]

Loads the state of the RankMonitorClient from a dictionary.

Can be called at any momemnt e.g. before init_workload_monitoring or after.

Parameters:

state (Mapping[str, Any]) – (Mapping[str, Any]): The state as returend from the state_dict method.

Return type:

None

send_heartbeat()[source]

Sends a empty (not containing a state) heartbeat message to the rank monitor server.

Return type:

None

send_workload_control_request(req)[source]

Request an workload related action. It is sent to the ft_launcher and affects the subsequent rendezvous.

Parameters:

req (WorkloadControlRequest) – request specification

shutdown_workload_monitoring()[source]

Shutdown the workload monitoring and close the connection to the RankMonitorServer.

start_section(section)[source]

Starts a new timed section with the given name.

NOTE: Different sections can be arbitraly aranged (nested, partially or fully overlapping).

but the same section name cannot be opened twice without closing it first.

Parameters:

section (str) – User defined name of the section.

Return type:

None

state_dict()[source]

Returns the state dictionary of this RankMonitorClient object.

NOTE: this method returns the same values on all ranks,

there are no rank-specific values in RankMonitorClient state.

Returns:

The state dictionary containing the current state.

Return type:

Mapping[str, Any]

exception nvidia_resiliency_ext.fault_tolerance.rank_monitor_client.RankMonitorClientError[source]

Bases: Exception