Config
- class nvidia_resiliency_ext.fault_tolerance.config.FaultToleranceConfig(workload_check_interval=5.0, initial_rank_heartbeat_timeout=3600.0, rank_heartbeat_timeout=2700.0, safety_factor=5.0, rank_termination_signal=Signals.SIGKILL, log_level=20)[source]
Bases:
object
- Configuration of fault tolerance
workload_check_interval [float] periodic rank check interval (in seconds) in rank monitors.
initial_rank_heartbeat_timeout [float] timeout (in seconds) for the first heartbeat from a rank.
Usually, it takes a bit longer for the first heartbeat to be sent, as the rank needs to initialize. If rank does not send the first heartbeat within initial_rank_heartbeat_timeout, failure is detected. If None this timeout needs to be deduced and set during runtime, based on the observed heartbeat intervals. - rank_heartbeat_timeout [float] timeout (in seconds) for subsequent heartbeats from a rank. If no rank heartbeat is received within rank_heartbeat_timeout, failure is detected. If None this timeout needs to be deduced and set during runtime, based on the observed heartbeat intervals. - safety_factor [float] when deducing the timeouts, observed heartbeat intervals are multiplied by this factor to obtain the timeouts. - rank_termination_signal signal used to terminate the rank when failure is detected. - log_level log level of fault tolerance components
- Parameters:
- static from_args(args, cfg_file_arg=None, ft_args_prefix='')[source]
Init FT config object from parsed CLI args.
Implements the following logic: - Use default FT config as a base. - If there is a config file argument defined, first try to read the FT config from the file. - Update the FT config with FT args provided via CLI. - If can’t read from file and there are no related args in CLI, raise an exception.
- Parameters:
args (argparse.Namespace) – Parsed arguments
cfg_file_arg (str, optional) – Name of the argument that contains the FT config YAML file. Defaults to None - do not try to read from file.
ft_args_prefix (str, optional) – Prefix of the FT related args. Defaults to empty str - assume no prefix.
- static from_kwargs(ignore_not_recognized=True, **kwargs)[source]
Create a FaultToleranceConfig object from keyword arguments.
- Parameters:
ignore_not_recognized (bool, optional) – Whether to ignore unrecognized arguments. Defaults to True.
**kwargs – Keyword arguments representing the fields of the FaultToleranceConfig object.
- Returns:
The created FaultToleranceConfig object.
- Return type:
- Raises:
ValueError – If there are unrecognized arguments and ignore_not_recognized is False.
- static from_yaml_file(cfg_path, ignore_not_recognized=True)[source]
Load the fault tolerance configuration from a YAML file.
YAML file should contain fault_tolerance section. fault_tolerance section can be at the top level or nested in any other section.
- Parameters:
- Returns:
The fault tolerance configuration object.
- Return type:
- Raises:
ValueError – If the ‘fault_tolerance’ section is not found in the config file.