Config
- class nvidia_resiliency_ext.fault_tolerance.config.FaultToleranceConfig(workload_check_interval=5.0, initial_rank_heartbeat_timeout=3600.0, rank_heartbeat_timeout=2700.0, rank_section_timeouts=<factory>, rank_out_of_section_timeout=None, node_health_check_interval=5.0, safety_factor=5.0, rank_termination_signal=Signals.SIGKILL, log_level=20, restart_check_interval=60.0, enable_nic_monitor=True, pci_topo_file=None, link_down_path_template=None)[source]
Bases:
object
Configuration of the fault tolerance
workload_check_interval [float] periodic rank check interval (in seconds) in rank monitors.
initial_rank_heartbeat_timeout [float|None] timeout (in seconds) for the first heartbeat from a rank.
Usually, it takes a bit longer for the first heartbeat to be sent, as the rank needs to initialize. If rank does not send the first heartbeat within initial_rank_heartbeat_timeout, failure is detected.
rank_heartbeat_timeout [float|None] timeout (in seconds) for subsequent heartbeats from a rank.
If no rank heartbeat is received within rank_heartbeat_timeout, failure is detected.
safety_factor [float] when deducing the timeouts, observed intervals are multiplied by this factor to obtain the timeouts.
rank_termination_signal signal used to terminate the rank when failure is detected.
log_level log level of fault tolerance components
rank_section_timeouts Mapping[str,float|None] timeouts for specific sections in user code.
rank_out_of_section_timeout [float|None] the timeout used for implicit/default section, that spans code not wrapped in any other section.
restart_check_interval - interval between checks if restart is in progress, needed for layered restart protocol
enable_nic_monitor - Enable NIC health monitoring in training.
pci_topo_file - PCI topo file that describes GPU and NIC topology.
link_down_path_template - Template path for NIC link down files. Should contain ‘{dev_name}’ placeholder which will be replaced with actual NIC device name.
If any timeout is None, it has no effect (as if it was +INF). All timeouts can be deduced and set during runtime.
- Parameters:
workload_check_interval (float)
initial_rank_heartbeat_timeout (float | None)
rank_heartbeat_timeout (float | None)
rank_out_of_section_timeout (float | None)
node_health_check_interval (float)
safety_factor (float)
rank_termination_signal (Signals)
log_level (int)
restart_check_interval (float)
enable_nic_monitor (bool)
pci_topo_file (str | None)
link_down_path_template (str | None)
- static from_args(args, cfg_file_arg=None, ft_args_prefix='')[source]
Init FT config object from parsed CLI args.
Implements the following logic: - Use default FT config as a base. - If there is a config file argument defined, first try to read the FT config from the file. - Update the FT config with FT args provided via CLI. - If can’t read from file and there are no related args in CLI, raise an exception.
- Parameters:
args (argparse.Namespace) – Parsed arguments
cfg_file_arg (str, optional) – Name of the argument that contains the FT config YAML file. Defaults to None - do not try to read from file.
ft_args_prefix (str, optional) – Prefix of the FT related args. Defaults to empty str - assume no prefix.
- static from_kwargs(ignore_not_recognized=True, **kwargs)[source]
Create a FaultToleranceConfig object from keyword arguments.
- Parameters:
ignore_not_recognized (bool, optional) – Whether to ignore unrecognized arguments. Defaults to True.
**kwargs – Keyword arguments representing the fields of the FaultToleranceConfig object.
- Returns:
The created FaultToleranceConfig object.
- Return type:
- Raises:
ValueError – If there are unrecognized arguments and ignore_not_recognized is False.
- static from_yaml_file(cfg_path, ignore_not_recognized=True)[source]
Load the fault tolerance configuration from a YAML file.
YAML file should contain fault_tolerance section. fault_tolerance section can be at the top level or nested in any other section.
- Parameters:
- Returns:
The fault tolerance configuration object.
- Return type:
- Raises:
ValueError – If the ‘fault_tolerance’ section is not found in the config file.