Config

class nvidia_resiliency_ext.fault_tolerance.config.FaultToleranceConfig(workload_check_interval=5.0, initial_rank_heartbeat_timeout=3600.0, rank_heartbeat_timeout=2700.0, rank_section_timeouts=<factory>, rank_out_of_section_timeout=None, node_health_check_interval=5.0, safety_factor=5.0, rank_termination_signal=Signals.SIGKILL, log_level=20, restart_check_interval=60.0, enable_nic_monitor=False, pci_topo_file=None, link_down_path_template=None, skip_section_response=True, use_infra_group_rank=True)[source]

Bases: object

Configuration of the fault tolerance

  • workload_check_interval [float] periodic rank check interval (in seconds) in rank monitors.

  • initial_rank_heartbeat_timeout [float|None] timeout (in seconds) for the first heartbeat from a rank.

Usually, it takes a bit longer for the first heartbeat to be sent, as the rank needs to initialize. If rank does not send the first heartbeat within initial_rank_heartbeat_timeout, failure is detected.

  • rank_heartbeat_timeout [float|None] timeout (in seconds) for subsequent heartbeats from a rank.

If no rank heartbeat is received within rank_heartbeat_timeout, failure is detected.

  • safety_factor [float] when deducing the timeouts, observed intervals are multiplied by this factor to obtain the timeouts.

  • rank_termination_signal signal used to terminate the rank when failure is detected.

  • log_level log level of fault tolerance components

  • rank_section_timeouts Mapping[str,float|None] timeouts for specific sections in user code. Only sections listed here will send IPC messages to the monitor server and collect timing data. Sections not in this mapping will have near-zero overhead (no IPC, no timing collection).

  • rank_out_of_section_timeout [float|None] the timeout used for implicit/default section, that spans code not wrapped in any other section.

  • restart_check_interval - interval between checks if restart is in progress, needed for layered restart protocol

  • enable_nic_monitor - Enable NIC health monitoring in training. Default: False.

  • pci_topo_file - PCI topo file that describes GPU and NIC topology.

  • link_down_path_template - Template path for NIC link down files. Should contain ‘{dev_name}’ placeholder which will be replaced with actual NIC device name.

  • skip_section_response - If True, section and heartbeat messages are sent without waiting for server response (unidirectional communication). This significantly reduces latency for high-frequency operations. Server logs errors instead of sending them back. Default: True (recommended for production). Set to False during development to catch errors immediately.

  • use_infra_group_rank - If True, always use infrastructure group rank for rank assignment. Reads from SLURM_PROCID (in SLURM environments) or GROUP_RANK (set by launcher). Previous rank assignments are ignored to ensure consistency with infrastructure’s rank assignment. Note: Hot spare/redundancy is NOT supported with this setting. Default: True.

If any timeout is None, it has no effect (as if it was +INF). All timeouts can be deduced and set during runtime.

Parameters:
  • workload_check_interval (float)

  • initial_rank_heartbeat_timeout (float | None)

  • rank_heartbeat_timeout (float | None)

  • rank_section_timeouts (Mapping[str, float | None])

  • rank_out_of_section_timeout (float | None)

  • node_health_check_interval (float)

  • safety_factor (float)

  • rank_termination_signal (Signals)

  • log_level (int)

  • restart_check_interval (float)

  • enable_nic_monitor (bool)

  • pci_topo_file (str | None)

  • link_down_path_template (str | None)

  • skip_section_response (bool)

  • use_infra_group_rank (bool)

static from_args(args)[source]

Init FT config object from parsed CLI args.

Implements the following logic: - Use default FT config as a base. - If there is a config file argument defined, first try to read the FT config from the file. - Update the FT config with FT args provided via CLI. - If can’t read from file and there are no related args in CLI, raise an exception.

Parameters:

args (argparse.Namespace) – Parsed arguments

static from_kwargs(ignore_not_recognized=True, **kwargs)[source]

Create a FaultToleranceConfig object from keyword arguments.

Parameters:
  • ignore_not_recognized (bool, optional) – Whether to ignore unrecognized arguments. Defaults to True.

  • **kwargs – Keyword arguments representing the fields of the FaultToleranceConfig object.

Returns:

The created FaultToleranceConfig object.

Return type:

FaultToleranceConfig

Raises:

ValueError – If there are unrecognized arguments and ignore_not_recognized is False.

static from_yaml_file(cfg_path, ignore_not_recognized=True)[source]

Load the fault tolerance configuration from a YAML file.

YAML file should contain fault_tolerance section. fault_tolerance section can be at the top level or nested in any other section.

Parameters:
  • cfg_path (str) – The path to the YAML configuration file.

  • ignore_not_recognized (bool, optional) – Whether to ignore unrecognized configuration options. Defaults to True.

Returns:

The fault tolerance configuration object.

Return type:

FaultToleranceConfig

Raises:

ValueError – If the ‘fault_tolerance’ section is not found in the config file.

to_yaml_file(cfg_path)[source]

Convert the configuration object to a YAML file and save it to the specified path.

Parameters:

cfg_path (str) – The path to save the YAML file.

Returns:

None

Return type:

None