Config

class nvidia_resiliency_ext.fault_tolerance.config.FaultToleranceConfig(workload_check_interval=5.0, initial_rank_heartbeat_timeout=3600.0, rank_heartbeat_timeout=2700.0, safety_factor=5.0, rank_termination_signal=Signals.SIGKILL, log_level=20)[source]

Bases: object

Configuration of fault tolerance
  • workload_check_interval [float] periodic rank check interval (in seconds) in rank monitors.

  • initial_rank_heartbeat_timeout [float] timeout (in seconds) for the first heartbeat from a rank.

Usually, it takes a bit longer for the first heartbeat to be sent, as the rank needs to initialize. If rank does not send the first heartbeat within initial_rank_heartbeat_timeout, failure is detected. If None this timeout needs to be deduced and set during runtime, based on the observed heartbeat intervals. - rank_heartbeat_timeout [float] timeout (in seconds) for subsequent heartbeats from a rank. If no rank heartbeat is received within rank_heartbeat_timeout, failure is detected. If None this timeout needs to be deduced and set during runtime, based on the observed heartbeat intervals. - safety_factor [float] when deducing the timeouts, observed heartbeat intervals are multiplied by this factor to obtain the timeouts. - rank_termination_signal signal used to terminate the rank when failure is detected. - log_level log level of fault tolerance components

Parameters:
  • workload_check_interval (float)

  • initial_rank_heartbeat_timeout (float | None)

  • rank_heartbeat_timeout (float | None)

  • safety_factor (float)

  • rank_termination_signal (Signals)

  • log_level (int)

static from_args(args, cfg_file_arg=None, ft_args_prefix='')[source]

Init FT config object from parsed CLI args.

Implements the following logic: - Use default FT config as a base. - If there is a config file argument defined, first try to read the FT config from the file. - Update the FT config with FT args provided via CLI. - If can’t read from file and there are no related args in CLI, raise an exception.

Parameters:
  • args (argparse.Namespace) – Parsed arguments

  • cfg_file_arg (str, optional) – Name of the argument that contains the FT config YAML file. Defaults to None - do not try to read from file.

  • ft_args_prefix (str, optional) – Prefix of the FT related args. Defaults to empty str - assume no prefix.

static from_kwargs(ignore_not_recognized=True, **kwargs)[source]

Create a FaultToleranceConfig object from keyword arguments.

Parameters:
  • ignore_not_recognized (bool, optional) – Whether to ignore unrecognized arguments. Defaults to True.

  • **kwargs – Keyword arguments representing the fields of the FaultToleranceConfig object.

Returns:

The created FaultToleranceConfig object.

Return type:

FaultToleranceConfig

Raises:

ValueError – If there are unrecognized arguments and ignore_not_recognized is False.

static from_yaml_file(cfg_path, ignore_not_recognized=True)[source]

Load the fault tolerance configuration from a YAML file.

YAML file should contain fault_tolerance section. fault_tolerance section can be at the top level or nested in any other section.

Parameters:
  • cfg_path (str) – The path to the YAML configuration file.

  • ignore_not_recognized (bool, optional) – Whether to ignore unrecognized configuration options. Defaults to True.

Returns:

The fault tolerance configuration object.

Return type:

FaultToleranceConfig

Raises:

ValueError – If the ‘fault_tolerance’ section is not found in the config file.

to_yaml_file(cfg_path)[source]

Convert the configuration object to a YAML file and save it to the specified path.

Parameters:

cfg_path (str) – The path to save the YAML file.

Returns:

None

Return type:

None