Config

class nvidia_resiliency_ext.fault_tolerance.config.FaultToleranceConfig(workload_check_interval=5.0, initial_rank_heartbeat_timeout=3600.0, rank_heartbeat_timeout=2700.0, rank_section_timeouts=<factory>, rank_out_of_section_timeout=None, node_health_check_interval=5.0, safety_factor=5.0, rank_termination_signal=Signals.SIGKILL, log_level=20, restart_check_interval=60.0, enable_nic_monitor=False, enable_nic_healthcheck=False, enable_dist_storage_healthcheck=False, storage_healthcheck_path=None, pci_topo_file=None, link_down_path_template=None, link_state_path_template=None, skip_section_response=True, use_infra_group_rank=True, numa_bind_strict=False, gpu_memory_reclaim_timeout=50.0, gpu_memory_tolerance_mb=512.0, gpu_memory_poll_interval=2.0, check_remaining_processes=False, max_no_progress_restarts=3, min_progress_iterations=200, progress_update_interval=30.0, install_exception_hook=False)[source]

Bases: object

Configuration of the fault tolerance

  • workload_check_interval [float] periodic rank check interval (in seconds) in rank monitors.

  • initial_rank_heartbeat_timeout [float|None] timeout (in seconds) for the first heartbeat from a rank.

Usually, it takes a bit longer for the first heartbeat to be sent, as the rank needs to initialize. If rank does not send the first heartbeat within initial_rank_heartbeat_timeout, failure is detected.

  • rank_heartbeat_timeout [float|None] timeout (in seconds) for subsequent heartbeats from a rank.

If no rank heartbeat is received within rank_heartbeat_timeout, failure is detected.

  • safety_factor [float] when deducing the timeouts, observed intervals are multiplied by this factor to obtain the timeouts.

  • rank_termination_signal signal used to terminate the rank when failure is detected.

  • log_level log level of fault tolerance components

  • rank_section_timeouts Mapping[str,float|None] timeouts for specific sections in user code. Only sections listed here will send IPC messages to the monitor server and collect timing data. Sections not in this mapping will have near-zero overhead (no IPC, no timing collection).

  • rank_out_of_section_timeout [float|None] the timeout used for implicit/default section, that spans code not wrapped in any other section.

  • restart_check_interval - interval between checks if restart is in progress, needed for layered restart protocol

  • enable_nic_monitor - Enable NIC health monitoring in training. Default: False.

  • enable_nic_healthcheck - Enable NIC link state health check before rendezvous. This checks if network interface ports (RDMA/InfiniBand and Ethernet) are in ACTIVE state and fails if any port transitioned from ACTIVE to non-ACTIVE. Unlike enable_nic_monitor (which periodically monitors link_downed counters), this performs a one-time state check during rendezvous. Can be used independently or together with enable_nic_monitor. Default: False.

  • pci_topo_file - PCI topo file that describes GPU and NIC topology.

  • link_down_path_template - Template path for NIC link down files. Should contain ‘{dev_name}’ placeholder which will be replaced with actual NIC device name.

  • link_state_path_template - Template path for NIC link state files. Should contain ‘{nic}’ placeholder which will be replaced with actual NIC device name. Default: /sys/class/infiniband/{nic}/ports/1/state

  • enable_dist_storage_healthcheck - Enable distributed storage health check (Lustre + NFS) before rendezvous. Checks Lustre health and reachability of Lustre/NFS mounts. Default: False.

  • storage_healthcheck_path - Comma-separated absolute paths to validate for existence/readability before rendezvous. Used by the storage path health check. Default: None.

  • skip_section_response - If True, section and heartbeat messages are sent without waiting for server response (unidirectional communication). This significantly reduces latency for high-frequency operations. Server logs errors instead of sending them back. Default: True (recommended for production). Set to False during development to catch errors immediately.

  • use_infra_group_rank - If True, always use infrastructure group rank for rank assignment. Reads from SLURM_PROCID (in SLURM environments) or GROUP_RANK (set by launcher). Previous rank assignments are ignored to ensure consistency with infrastructure’s rank assignment. Note: Hot spare/redundancy is NOT supported with this setting. Default: True.

  • numa_bind_strict - If True, use strict NUMA binding with both CPU and memory bound to the same NUMA node (–cpunodebind=N –membind=N). If False (default), only bind CPU to NUMA node and allow local memory allocation (–cpunodebind=N –localalloc). Default: False.

  • gpu_memory_reclaim_timeout [float] timeout (in seconds) to wait for GPU memory to be reclaimed after worker shutdown before starting new workers. Default: 50.0.

  • gpu_memory_tolerance_mb [float] maximum allowed GPU memory usage (in MB) when checking if memory has been reclaimed. Default: 512.0.

  • gpu_memory_poll_interval [float] poll interval (in seconds) for checking GPU memory during reclaim process. Default: 2.0.

  • check_remaining_processes [bool] if True, check for and log any remaining worker processes after termination. Useful for debugging process cleanup issues. Default: False.

  • install_exception_hook [bool] if True, installs sys.excepthook to capture uncaught exceptions in training worker processes, format and log the traceback, and use os._exit() to exit the process reliably. Default: False.

If any timeout is None, it has no effect (as if it was +INF). All timeouts can be deduced and set during runtime.

Parameters:
  • workload_check_interval (float)

  • initial_rank_heartbeat_timeout (float | None)

  • rank_heartbeat_timeout (float | None)

  • rank_section_timeouts (Mapping[str, float | None])

  • rank_out_of_section_timeout (float | None)

  • node_health_check_interval (float)

  • safety_factor (float)

  • rank_termination_signal (Signals)

  • log_level (int)

  • restart_check_interval (float)

  • enable_nic_monitor (bool)

  • enable_nic_healthcheck (bool)

  • enable_dist_storage_healthcheck (bool)

  • storage_healthcheck_path (str | None)

  • pci_topo_file (str | None)

  • link_down_path_template (str | None)

  • link_state_path_template (str | None)

  • skip_section_response (bool)

  • use_infra_group_rank (bool)

  • numa_bind_strict (bool)

  • gpu_memory_reclaim_timeout (float)

  • gpu_memory_tolerance_mb (float)

  • gpu_memory_poll_interval (float)

  • check_remaining_processes (bool)

  • max_no_progress_restarts (int)

  • min_progress_iterations (int)

  • progress_update_interval (float)

  • install_exception_hook (bool)

static from_args(args)[source]

Init FT config object from parsed CLI args.

Implements the following logic: - Use default FT config as a base. - If there is a config file argument defined, first try to read the FT config from the file. - Update the FT config with FT args provided via CLI. - If can’t read from file and there are no related args in CLI, raise an exception.

Parameters:

args (argparse.Namespace) – Parsed arguments

static from_kwargs(ignore_not_recognized=True, **kwargs)[source]

Create a FaultToleranceConfig object from keyword arguments.

Parameters:
  • ignore_not_recognized (bool, optional) – Whether to ignore unrecognized arguments. Defaults to True.

  • **kwargs – Keyword arguments representing the fields of the FaultToleranceConfig object.

Returns:

The created FaultToleranceConfig object.

Return type:

FaultToleranceConfig

Raises:

ValueError – If there are unrecognized arguments and ignore_not_recognized is False.

static from_yaml_file(cfg_path, ignore_not_recognized=True)[source]

Load the fault tolerance configuration from a YAML file.

YAML file should contain fault_tolerance section. fault_tolerance section can be at the top level or nested in any other section.

Parameters:
  • cfg_path (str) – The path to the YAML configuration file.

  • ignore_not_recognized (bool, optional) – Whether to ignore unrecognized configuration options. Defaults to True.

Returns:

The fault tolerance configuration object.

Return type:

FaultToleranceConfig

Raises:

ValueError – If the ‘fault_tolerance’ section is not found in the config file.

property is_progress_tracking_enabled: bool

Check if progress tracking is enabled (controlled by max_no_progress_restarts > 0).

to_yaml_file(cfg_path)[source]

Convert the configuration object to a YAML file and save it to the specified path.

Parameters:

cfg_path (str) – The path to save the YAML file.

Returns:

None

Return type:

None