Usage guide
Terms
Fault Tolerance,FTis thefault_tolerancepackage.FT callback,FaultToleranceCallbackis a PTL callback that integrates FT with PTL.ft_launcheris a launcher tool included in FT, which is based ontorchrun.heartbeatis a lightweight message sent from a rank to its rank monitor that indicates that a rank is alive.sectionis a user code segment with a custom name assigned.rank monitoris a special side process started byft_launcherthat monitors its rank.timeoutsare time intervals used by a rank monitor to detect that a rank is not alive.launcher scriptis a bash script that invokesft_launcher.PTLis PyTorch Lightning.
Design Overview
Each node runs a single
ft_launcher.FT configuration is passed to
ft_launcherand propagated to other FT components.ft_launcherspawns rank monitors (once).ft_launcherspawns ranks (can also respawn if--max-restartsis greater than 0).Each rank uses
RankMonitorClientto connect to its monitor (RankMonitorServer).Each rank periodically sends updates to its rank monitor (e.g., during each training and evaluation step).
In case of a hang, the rank monitor detects missing updates from its rank and terminates it.
If any ranks disappear,
ft_launcherdetects that and terminates or restarts the workload.ft_launcherinstances communicate via thetorchrun“rendezvous” mechanism.Rank monitors do not communicate with each other.
# Processes structure on a single node.
# NOTE: each rank has its own separate rank monitor.
[Rank_N]----(IPC)----[Rank Monitor_N]
| |
| |
(re/spawns) (spawns)
| |
| |
[ft_launcher]-------------
Usage Overview
FT launcher
Fault tolerance includes a launcher tool called ft_launcher, which is based on torchrun
and supports most torchrun command-line parameters. FT configuration can be specified either
via a YAML file using --ft-cfg-path or through command-line parameters
using --ft-<parameter-name>.
Details:
--ft-node-health-check-endpoint(alias:--ft-node_health_check_endpoint) sets the node health check service endpoint used by InJob. Accepts Unix domain socket (UDS):/var/run/nodehealth.sockorunix:///var/run/nodehealth.sock.The rendezvous implementations call
NodeHealthCheckwhich will connect to the provided endpoint.Connectivity errors are treated as non-fatal (health passes); explicit RPC failures reported by the service mark the node unhealthy.
If --max-restarts is specified, the launcher restarts failed workers.
The restart behavior depends on the --ft-restart-policy parameter, which supports two modes:
any-failed(default) All workers are restarted if any worker fails.min-healthyWorkers are restarted when the number of healthy nodes (nodes where all worker processes are running) falls below the minimum specified in--nnodes. This allows for some worker failures to be handled without restarting remaining workers, e.g., with the Inprocess Restart. For details on howmin-healthypolicy interacts with Inprocess Restart see FT Launcher & Inprocess integration.
Node health check service
The launcher accepts an optional argument to point to the node health check service endpoint. When provided, the launcher exports the socket path to child processes and the rendezvous handlers will use it in their node health checks.
--ft-node-health-check-endpoint(alias:--ft-node_health_check_endpoint) sets the node health check service endpoint (UDS).The rendezvous implementations call
NodeHealthCheckwhich will connect to this UDS endpoint.Connectivity errors are treated as non-fatal (health passes); explicit RPC failures reported by the service mark the node unhealthy.
GPU Memory Reclaim
When --max-restarts is specified, ft_launcher can optionally wait for GPU memory to be
released before starting new workers after a restart. This helps ensure that GPU memory from
terminated workers has been fully reclaimed before starting new processes.
This feature is controlled by three parameters:
--ft-gpu-memory-reclaim-timeout(default: 50.0 seconds) Timeout for waiting for GPU memory to drop below the tolerance threshold. Set to 0 to disable the feature.--ft-gpu-memory-tolerance-mb(default: 512.0 MB) Maximum allowed GPU memory usage. The launcher waits until GPU memory drops below this threshold.--ft-gpu-memory-poll-interval(default: 2.0 seconds) Poll interval for checking GPU memory usage during the reclaim process.
On restarts, the launcher periodically checks GPU memory usage and waits until it drops below the tolerance threshold or the timeout is reached. Memory statistics for each GPU are collected and logged after the reclaim process completes. If the timeout is reached, an error is logged but the restart proceeds as a best effort.
Rank assignment
The ft_launcher assigns ranks to workers during the rendezvous process.
Infrastructure-based assignment (default):
By default (--ft-use-infra-group-rank=True), rank assignments always come from the infrastructure:
The launcher first checks
SLURM_PROCID(automatically set in SLURM environments)If not available, it falls back to
GROUP_RANK(set byft_launcheritself)
Infrastructure ranks are used for every rendezvous, including after failures/restarts. Previous rank assignments are ignored. This ensures consistency with the infrastructure’s rank assignment, which is important for static deployments and proper resource allocation.
Note
Hot spare/redundancy is NOT supported with use_infra_group_rank=True because dynamic
rendezvous cannot guarantee that lower infrastructure ranks will join as participants first.
Deterministic assignment (alternative):
Set --ft-use-infra-group-rank=False (or use_infra_group_rank: false in config) to use
deterministic sorted assignment based on node descriptors. In this mode:
Previous rank assignments are preserved when possible
New workers fill gaps left by failed workers
Ranks are reassigned based on sorted node descriptors
NUMA binding
The ft_launcher supports automatic NUMA node binding for workers through the NVRX_GPUS_PER_NUMA
environment variable. When set, the launcher automatically wraps each worker process with numactl
to bind it to the appropriate NUMA node based on its local rank.
Important
Prerequisites: This feature requires the numactl command-line tool to be installed and
available in the system PATH. The launcher will fail to start workers if numactl is not found.
To install on common Linux distributions:
Ubuntu/Debian:
sudo apt-get install numactlRHEL/CentOS/Rocky:
sudo yum install numactl
How it works:
Set
NVRX_GPUS_PER_NUMAto the number of GPUs per NUMA node on your systemThe launcher calculates the NUMA node as:
numa_node = local_rank // gpus_per_numaEach worker is automatically wrapped with:
numactl --cpunodebind=<numa_node> --membind=<numa_node>This applies only to binary/script entrypoints (not Python function entrypoints)
Example usage:
# For a system with 4 GPUs per NUMA node (8 GPUs total, 2 NUMA nodes)
export NVRX_GPUS_PER_NUMA=4
ft_launcher --nproc-per-node=8 train.py
# In this configuration:
# - Ranks 0-3 will be bound to NUMA node 0
# - Ranks 4-7 will be bound to NUMA node 1
Benefits:
Proper NUMA binding can significantly improve performance by ensuring memory locality and reducing cross-NUMA memory access overhead, which is especially important for multi-GPU training workloads.
Hang detection
The FT package provides two fully independent mechanisms for detecting hangs in user code. Users can choose the API that is best suited for their needs, or use both APIs at the same time.
Heartbeats API
The training script periodically sends heartbeats to the monitor. If no heartbeat arrives in a defined time, the workload is considered hung. This API is the simplest to use but might require coarse timeouts that need to cover a wide range of possible intervals between heartbeats. Please find more details in Heartbeats API Integration.
Sections API
Some parts of the training scripts are wrapped in sections. If any section is opened for too long, the workload is considered hung. The sections-based API requires more changes in the user code, but timeouts can be defined more precisely, and hangs can be detected quicker. Please find more details in Sections API Integration.
Workload control
In some cases, it might be useful to control the ft_launcher behavior based on a rank state.
For example, if an irrecoverable error is encountered in a rank, it might be reasonable to break
the launcher restarting loop and exit instead of restarting; for other exception types, one might
want to exclude the current node from subsequent restart attempts. RankMonitorClient exposes the
nvidia_resiliency_ext.fault_tolerance.rank_monitor_client.RankMonitorClient.send_workload_control_request()
API, which can be used to control the workload restarting logic implemented in the launcher.
Note
Please note that only the ft_launcher behavior is affected by this call. The fault tolerance package is job scheduler-agnostic, i.e., it does not control underlying SLURM job allocations.