Config

class nvidia_resiliency_ext.fault_tolerance.config.FaultToleranceConfig(workload_check_interval=5.0, initial_rank_heartbeat_timeout=3600.0, rank_heartbeat_timeout=2700.0, rank_section_timeouts=<factory>, rank_out_of_section_timeout=None, node_health_check_interval=5.0, safety_factor=5.0, rank_termination_signal=Signals.SIGKILL, log_level=20, restart_check_interval=60.0, enable_nic_monitor=True, pci_topo_file=None, link_down_path_template=None)[source]

Bases: object

Configuration of the fault tolerance

  • workload_check_interval [float] periodic rank check interval (in seconds) in rank monitors.

  • initial_rank_heartbeat_timeout [float|None] timeout (in seconds) for the first heartbeat from a rank.

Usually, it takes a bit longer for the first heartbeat to be sent, as the rank needs to initialize. If rank does not send the first heartbeat within initial_rank_heartbeat_timeout, failure is detected.

  • rank_heartbeat_timeout [float|None] timeout (in seconds) for subsequent heartbeats from a rank.

If no rank heartbeat is received within rank_heartbeat_timeout, failure is detected.

  • safety_factor [float] when deducing the timeouts, observed intervals are multiplied by this factor to obtain the timeouts.

  • rank_termination_signal signal used to terminate the rank when failure is detected.

  • log_level log level of fault tolerance components

  • rank_section_timeouts Mapping[str,float|None] timeouts for specific sections in user code.

  • rank_out_of_section_timeout [float|None] the timeout used for implicit/default section, that spans code not wrapped in any other section.

  • restart_check_interval - interval between checks if restart is in progress, needed for layered restart protocol

  • enable_nic_monitor - Enable NIC health monitoring in training.

  • pci_topo_file - PCI topo file that describes GPU and NIC topology.

  • link_down_path_template - Template path for NIC link down files. Should contain ‘{dev_name}’ placeholder which will be replaced with actual NIC device name.

If any timeout is None, it has no effect (as if it was +INF). All timeouts can be deduced and set during runtime.

Parameters:
  • workload_check_interval (float)

  • initial_rank_heartbeat_timeout (float | None)

  • rank_heartbeat_timeout (float | None)

  • rank_section_timeouts (Mapping[str, float | None])

  • rank_out_of_section_timeout (float | None)

  • node_health_check_interval (float)

  • safety_factor (float)

  • rank_termination_signal (Signals)

  • log_level (int)

  • restart_check_interval (float)

  • enable_nic_monitor (bool)

  • pci_topo_file (str | None)

  • link_down_path_template (str | None)

static from_args(args, cfg_file_arg=None, ft_args_prefix='')[source]

Init FT config object from parsed CLI args.

Implements the following logic: - Use default FT config as a base. - If there is a config file argument defined, first try to read the FT config from the file. - Update the FT config with FT args provided via CLI. - If can’t read from file and there are no related args in CLI, raise an exception.

Parameters:
  • args (argparse.Namespace) – Parsed arguments

  • cfg_file_arg (str, optional) – Name of the argument that contains the FT config YAML file. Defaults to None - do not try to read from file.

  • ft_args_prefix (str, optional) – Prefix of the FT related args. Defaults to empty str - assume no prefix.

static from_kwargs(ignore_not_recognized=True, **kwargs)[source]

Create a FaultToleranceConfig object from keyword arguments.

Parameters:
  • ignore_not_recognized (bool, optional) – Whether to ignore unrecognized arguments. Defaults to True.

  • **kwargs – Keyword arguments representing the fields of the FaultToleranceConfig object.

Returns:

The created FaultToleranceConfig object.

Return type:

FaultToleranceConfig

Raises:

ValueError – If there are unrecognized arguments and ignore_not_recognized is False.

static from_yaml_file(cfg_path, ignore_not_recognized=True)[source]

Load the fault tolerance configuration from a YAML file.

YAML file should contain fault_tolerance section. fault_tolerance section can be at the top level or nested in any other section.

Parameters:
  • cfg_path (str) – The path to the YAML configuration file.

  • ignore_not_recognized (bool, optional) – Whether to ignore unrecognized configuration options. Defaults to True.

Returns:

The fault tolerance configuration object.

Return type:

FaultToleranceConfig

Raises:

ValueError – If the ‘fault_tolerance’ section is not found in the config file.

to_yaml_file(cfg_path)[source]

Convert the configuration object to a YAML file and save it to the specified path.

Parameters:

cfg_path (str) – The path to save the YAML file.

Returns:

None

Return type:

None