Config

class nvidia_resiliency_ext.fault_tolerance.config.FaultToleranceConfig(workload_check_interval=5.0, initial_rank_heartbeat_timeout=3600.0, rank_heartbeat_timeout=2700.0, rank_section_timeouts=<factory>, rank_out_of_section_timeout=None, node_health_check_interval=5.0, safety_factor=5.0, rank_termination_signal=Signals.SIGKILL, log_level=20, restart_check_interval=60.0, enable_nic_monitor=True, pci_topo_file=None, link_down_path_template=None)[source]

Bases: object

Configuration of the fault tolerance

workload_check_interval [float] periodic rank check interval (in seconds) in rank monitors.
initial_rank_heartbeat_timeout [float|None] timeout (in seconds) for the first heartbeat from a rank.

Usually, it takes a bit longer for the first heartbeat to be sent, as the rank needs to initialize. If rank does not send the first heartbeat within initial_rank_heartbeat_timeout, failure is detected.

rank_heartbeat_timeout [float|None] timeout (in seconds) for subsequent heartbeats from a rank.

If no rank heartbeat is received within rank_heartbeat_timeout, failure is detected.

safety_factor [float] when deducing the timeouts, observed intervals are multiplied by this factor to obtain the timeouts.
rank_termination_signal signal used to terminate the rank when failure is detected.
log_level log level of fault tolerance components
rank_section_timeouts Mapping[str,float|None] timeouts for specific sections in user code.
rank_out_of_section_timeout [float|None] the timeout used for implicit/default section, that spans code not wrapped in any other section.
restart_check_interval - interval between checks if restart is in progress, needed for layered restart protocol
enable_nic_monitor - Enable NIC health monitoring in training.
pci_topo_file - PCI topo file that describes GPU and NIC topology.
link_down_path_template - Template path for NIC link down files. Should contain ‘{dev_name}’ placeholder which will be replaced with actual NIC device name.

If any timeout is None, it has no effect (as if it was +INF). All timeouts can be deduced and set during runtime.

Parameters:

workload_check_interval (float)
initial_rank_heartbeat_timeout (float | None)
rank_heartbeat_timeout (float | None)
rank_section_timeouts (Mapping[str, float | None])
rank_out_of_section_timeout (float | None)
node_health_check_interval (float)
safety_factor (float)
rank_termination_signal (Signals)
log_level (int)
restart_check_interval (float)
enable_nic_monitor (bool)
pci_topo_file (str | None)
link_down_path_template (str | None)

static from_args(args)[source]

Init FT config object from parsed CLI args.

Implements the following logic: - Use default FT config as a base. - If there is a config file argument defined, first try to read the FT config from the file. - Update the FT config with FT args provided via CLI. - If can’t read from file and there are no related args in CLI, raise an exception.

Parameters:: args (argparse.Namespace) – Parsed arguments

static from_kwargs(ignore_not_recognized=True, **kwargs)[source]

Create a FaultToleranceConfig object from keyword arguments.

Parameters:

ignore_not_recognized (bool, optional) – Whether to ignore unrecognized arguments. Defaults to True.
**kwargs – Keyword arguments representing the fields of the FaultToleranceConfig object.

Returns:

The created FaultToleranceConfig object.

Return type:

FaultToleranceConfig

Raises:

ValueError – If there are unrecognized arguments and ignore_not_recognized is False.

static from_yaml_file(cfg_path, ignore_not_recognized=True)[source]

Load the fault tolerance configuration from a YAML file.

YAML file should contain fault_tolerance section. fault_tolerance section can be at the top level or nested in any other section.

Parameters:

cfg_path (str) – The path to the YAML configuration file.
ignore_not_recognized (bool, optional) – Whether to ignore unrecognized configuration options. Defaults to True.

Returns:

The fault tolerance configuration object.

Return type:

FaultToleranceConfig

Raises:

ValueError – If the ‘fault_tolerance’ section is not found in the config file.

to_yaml_file(cfg_path)[source]

Convert the configuration object to a YAML file and save it to the specified path.

Parameters:: cfg_path (str) – The path to save the YAML file.
Returns:: None
Return type:: None