Usage guide
Terms
Fault Tolerance,FTis thefault_tolerancepackage.FT callback,FaultToleranceCallbackis a PTL callback that integrates FT with PTL.ft_launcheris a launcher tool included in FT, which is based ontorchrun.heartbeatis a lightweight message sent from a rank to its rank monitor that indicates that a rank is alive.sectionis a user code segment with a custom name assigned.rank monitoris a special side process started byft_launcherthat monitors its rank.timeoutsare time intervals used by a rank monitor to detect that a rank is not alive.launcher scriptis a bash script that invokesft_launcher.PTLis PyTorch Lightning.
Design Overview
Each node runs a single
ft_launcher.FT configuration is passed to
ft_launcherand propagated to other FT components.ft_launcherspawns rank monitors (once).ft_launcherspawns ranks (can also respawn if--max-restartsis greater than 0).Each rank uses
RankMonitorClientto connect to its monitor (RankMonitorServer).Each rank periodically sends updates to its rank monitor (e.g., during each training and evaluation step).
In case of a hang, the rank monitor detects missing updates from its rank and terminates it.
If any ranks disappear,
ft_launcherdetects that and terminates or restarts the workload.ft_launcherinstances communicate via thetorchrun“rendezvous” mechanism.Rank monitors do not communicate with each other.
# Processes structure on a single node.
# NOTE: each rank has its own separate rank monitor.
[Rank_N]----(IPC)----[Rank Monitor_N]
| |
| |
(re/spawns) (spawns)
| |
| |
[ft_launcher]-------------
Usage Overview
FT launcher
Fault tolerance includes a launcher tool called ft_launcher, which is based on torchrun
and supports most torchrun command-line parameters. FT configuration can be specified either
via a YAML file using --ft-cfg-path or through command-line parameters
using --ft-<parameter-name>.
Details:
--ft-node-health-check-endpoint(alias:--ft-node_health_check_endpoint) sets the optional node health check service endpoint used by InJob. Accepts Unix domain socket (UDS):/var/run/nvhcd.sockorunix:///var/run/nvhcd.sock. See Node health check service for the BCM-backed and compatible-service usage model.
If --max-restarts is specified, the launcher restarts failed workers.
The --ft-restart-policy parameter is deprecated; only any-failed is supported: all workers
are restarted if any worker fails (torchrun-style behavior). This option may be removed in a future release.
Node health check service
The launcher can query an optional node-local health check service before workers enter
rendezvous. A practical public deployment model is to reuse NVIDIA Base Command
Manager (BCM) Slurm prolog or epilog health checks behind an nvhcd-compatible
daemon. The NVRx integration point is service-compatible: any equivalent daemon
can be used if it implements the expected gRPC API over a Unix domain socket
(UDS).
To enable the external node health check with BCM:
Build and deploy an
nvhcd-compatible daemon on every allocated node.Configure the daemon to invoke a BCM health check script, or a wrapper around an existing BCM prolog or epilog health check.
Ensure the wrapper translates the BCM result into JSON with
fail_count == 0for a healthy node and a nonzerofail_countfor an unhealthy node.Make the daemon’s UDS visible from the job environment or training container.
Pass the socket path to
ft_launcherwith--ft-node-health-check-endpoint(alias:--ft-node_health_check_endpoint).
For protocol details, see the nvhcd protobuf schema at
src/nvidia_resiliency_ext/shared_utils/proto/nvhcd.proto. The functional test
server at tests/fault_tolerance/func/nodehc_service.py is a minimal example
of a UDS gRPC service that implements this API.
Example:
ft_launcher \
--ft-node-health-check-endpoint unix:///var/run/nvhcd.sock \
train.py
Endpoint behavior:
UDS endpoints are supported. The value can be a path such as
/var/run/nvhcd.sockor aunix://URI such asunix:///var/run/nvhcd.sock.If the endpoint is omitted, NVRx skips the external node health check.
If the gRPC client dependencies are unavailable, the UDS socket is missing, or a connectivity error occurs, NVRx treats the external check as unavailable and does not fail the job for that reason.
Explicit failures reported by the service mark the node unhealthy.
Compatible service contract:
The service must implement
HealthCheckService.RunHealthCheckfrom the NVRxnvhcdprotobuf API and listen on the configured UDS.NVRx calls the service with
args=["--no-slurm"].The response must set
successand return JSON inoutput. A healthy node is reported withsuccess=trueand{"fail_count": 0}.If
successis false,fail_countis nonzero, oroutputcannot be parsed as JSON with afail_countfield, NVRx treats the node as unhealthy.
Example nvhcd configuration for BCM:
socket_path: /var/run/nvhcd.sock
healthcheck_path: /usr/local/sbin/nvrx-bcm-healthcheck-wrapper.sh
log_level: info
timeout: 120
Start the daemon on each node with this configuration, for example as a node-level service or directly with:
nvhcd -config /etc/nvhcd/config.yaml
If the training job runs inside a container, bind mount the UDS path into the
container so that ft_launcher can reach the daemon.
The wrapper can call the same reusable health check entry point that BCM uses for
Slurm prolog or epilog validation, then normalize the result for NVRx. When using
nvhcd, the gRPC request args are forwarded to the configured
healthcheck_path as command-line arguments, so NVRx’s --no-slurm argument
will appear in the wrapper’s "$@". For example:
#!/usr/bin/env bash
set -euo pipefail
if /path/to/bcm-healthcheck "$@"; then
printf '{"fail_count": 0, "failed_checks": []}\n'
exit 0
else
printf '{"fail_count": 1, "failed_checks": ["bcm_healthcheck"]}\n'
exit 1
fi
Because NVRx invokes this check during rendezvous rather than during Slurm’s actual prolog or epilog phase, the wrapper should also provide any inputs that the BCM script expects from the Slurm lifecycle environment, or call a reusable health check entry point from the local BCM deployment that does not depend on those lifecycle-only variables.
This lets NVRx run the same class of node validation during in-job restart rendezvous that cluster administrators may already run at Slurm prolog or epilog time.
Distributed storage health check (Lustre + NFS)
The launcher can perform a distributed storage health check before rendezvous. By default it is disabled. When enabled (via CLI or YAML), it:
Verifies Lustre health via
/sys/fs/lustre/health_check(fails if not healthy).Discovers distributed mount targets and checks that each mount is reachable.
--ft-enable-dist-storage-healthcheck(alias:--ft_enable_dist_storage_healthcheck) - Accepts a boolean-like value only to enable the mount checks(e.g.,
--ft-enable-dist-storage-healthcheck true).
Storage path health check
Validate specific absolute paths for existence and basic readability before rendezvous.
CLI:
--ft-storage-health-check-path(alias:--ft_storage_health_check_path) - Accepts a comma-separated list of absolute paths (each starting with/). - Example:--ft-storage-health-check-path '/data/checkpoints,/mnt/dataset'YAML:
storage_healthcheck_pathunder thefault_tolerancesection
fault_tolerance:
# Comma-separated absolute paths
storage_healthcheck_path: "/data/checkpoints,/mnt/dataset"
- Validation behavior:
Files: attempts to read a small block (up to 4KB)
Directories: lists directory contents
Other existing types (e.g., devices/symlinks): performs
stataccess
Attribution service integration
Per-cycle application logs do not enable attribution by themselves. To enable attribution, set
--ft-attribution-endpoint. The endpoint value localhost makes ft_launcher run the
attribution service on the TCPStore host; other endpoints are treated as externally managed
attribution services.
External endpoints may use schemes such as http://, grpc://, or unix://. The current
in-job attribution client submits logs over HTTP(S); non-HTTP endpoint strings are preserved but do
not add a new transport implementation.
If --ft-attribution-endpoint is set, --ft-per-cycle-applog-prefix is required because the
attribution service analyzes the per-cycle application logs.
The service code is included in the NVRx wheel, but the service dependencies are optional.
Install the wheel with the attribution extra before running a launcher-managed attribution
service:
python -m pip install 'nvidia_resiliency_ext-<version>-<tags>.whl[attribution]'
Plain python -m pip install nvidia_resiliency_ext-*.whl does not install the attribution
service dependencies.
CLI:
--ft-attribution-endpoint <ENDPOINT>(alias:--ft_attribution_endpoint), default disabled--ft-attribution-llm-api-key-file <PATH>(alias:--ft_attribution_llm_api_key_file)--ft-attribution-llm-base-url <URL>(alias:--ft_attribution_llm_base_url)--ft-attribution-llm-model <MODEL>(alias:--ft_attribution_llm_model)--ft-attribution-startup-timeout <SECONDS>(alias:--ft_attribution_startup_timeout), default20--ft-attribution-export-url <URL>(alias:--ft_attribution_export_url)
The managed attribution app-log directory is derived from
dirname(realpath(--ft-per-cycle-applog-prefix)). Its stdout/stderr log is derived from--ft-per-cycle-applog-prefixas*_attribution.log. The managed service listens on127.0.0.1:50050and is exposed to the in-job client ashttp://localhost:50050.The managed attribution API key must come from
--ft-attribution-llm-api-key-fileor inheritedLLM_API_KEY_FILE. If neither points to a readable file, the TCPStore-host launcher fails before starting the attribution service.To export managed attribution results, pass
--ft-attribution-export-urlor setattribution_export_urlin the fault tolerance YAML config.ft_launchersends job metadata with each attribution submission:useris read fromSLURM_JOB_USERorUSER, andjob_idis read fromSLURM_ARRAY_JOB_IDorSLURM_JOB_ID. If no corresponding environment variable is set, that field is omitted from the submission payload.Example:
ft_launcher \ --ft-per-cycle-applog-prefix /lustre/job123/train.log \ --ft-attribution-endpoint localhost \ --ft-attribution-llm-api-key-file /secure/llm_api_key \ --ft-attribution-llm-base-url https://integrate.api.nvidia.com/v1 \ --ft-attribution-llm-model nvidia/nemotron-3-super-120b-a12b \ --ft-attribution-export-url https://dataflow.example.test/dataflow2/example-index/posting \ train.py
To use an externally managed attribution service instead, specify an explicit endpoint:
ft_launcher \ --ft-per-cycle-applog-prefix /lustre/job123/train.log \ --ft-attribution-endpoint http://attribution.service.internal:8000 \ train.py
GPU Memory Reclaim
When --max-restarts is specified, ft_launcher can optionally wait for GPU memory to be
released before starting new workers after a restart. This helps ensure that GPU memory from
terminated workers has been fully reclaimed before starting new processes.
This feature is controlled by three parameters:
--ft-gpu-memory-reclaim-timeout(default: 50.0 seconds) Timeout for waiting for GPU memory to drop below the tolerance threshold. Set to 0 to disable the feature.--ft-gpu-memory-tolerance-mb(default: 512.0 MB) Maximum allowed GPU memory usage. The launcher waits until GPU memory drops below this threshold.--ft-gpu-memory-poll-interval(default: 2.0 seconds) Poll interval for checking GPU memory usage during the reclaim process.
On restarts, the launcher periodically checks GPU memory usage and waits until it drops below the tolerance threshold or the timeout is reached. Memory statistics for each GPU are collected and logged after the reclaim process completes. If the timeout is reached, an error is logged but the restart proceeds as a best effort.
Per-cycle logging and gRPC log aggregation
When using consolidated per-cycle application logs (for example via --ft-per-cycle-applog-prefix)
with optional gRPC log funneling (--ft-enable-log-server), worker and launcher output can be
merged through pipes and streamed to one or more aggregators on the rendezvous host before a single
writer appends to shared storage (for example Lustre).
Important
Best-effort semantics. Per-cycle gRPC log aggregation is best-effort. Logs around failure and restart may be incomplete; crash stack traces are not guaranteed to appear there. For critical diagnostics, use rank monitor logs (launcher log) for failure/timeout correlation and OS-level core dumps for reliable crash post-mortem. Do not assume aggregated logs are complete or reliable.
Rank assignment
The ft_launcher assigns ranks to workers during the rendezvous process.
Rank assignments always use infrastructure-based ordering when available:
The launcher first checks
SLURM_PROCID(automatically set in SLURM environments)If not available, it falls back to
GROUP_RANK(set byft_launcheritself)If neither environment variable is set, ranks are assigned deterministically based on sorted node descriptors
This ensures consistency with the infrastructure’s rank assignment, which is important for static deployments and proper resource allocation.
Hot Spare Nodes and Segment-Aware Rank Assignment
The ft_launcher supports hot spare nodes, which are standby nodes that can replace failed nodes
during restart. Hot spare functionality is always enabled and works with --max-restarts.
By default (--ft-segment=None), the launcher uses simple hot spare mode, which is suitable
for most deployments including H100-based systems where NVLink domain segmentation is not required:
The first
min_nodes(from--nnodes) are assigned as active workersAny additional nodes beyond
min_nodesbecome hot spares with standby ranksHot spares do not require GPU ClusterUUID or NVLink domain awareness
This mode effectively treats each node independently for rank assignment
For large-scale NVSwitch-based systems (e.g., DGX H200, HGX B200), you can enable
segment-aware hot spare mode using --ft-segment=N:
Nspecifies the minimum number of nodes required per NVLink domain (identified by GPU ClusterUUID)Only domains with at least
Nnodes participate in trainingFrom each valid domain, as many complete segments as possible are selected
Nodes in the same segment receive contiguous group ranks for optimal performance
The
min_nodesparameter (from--nnodes) must be divisible bysegmentGPU ClusterUUID is automatically queried via nvidia-smi to identify NVLink domains
Key Differences:
--ft-segment=None(default): Simple mode without domain awareness, suitable for H100 systems--ft-segment=1: Each node is a segment, similar to simple mode but requires ClusterUUID--ft-segment=4or higher: Multi-node segments for NVSwitch-based systems
Example for H100 deployment (8 nodes requested, 6 needed for training):
ft_launcher --nnodes=6:8 --nproc-per-node=8 \
--max-restarts=3 \
training_script.py
# Nodes 0-5: Active workers (ranks 0-47)
# Nodes 6-7: Hot spares (standby ranks 48-63)
Example for NVSwitch deployment with segment=4 (12 nodes requested, 8 needed):
ft_launcher --nnodes=8:12 --nproc-per-node=8 \
--ft-segment=4 --max-restarts=3 \
training_script.py
# Requires domains with at least 4 nodes each
# 8 active nodes = 2 complete segments
# 4 hot spare nodes available for restart
NUMA binding
The ft_launcher supports automatic NUMA node binding for workers through the NVRX_GPUS_PER_NUMA
environment variable. When set, the launcher automatically wraps each worker process with numactl
to bind it to the appropriate NUMA node based on its local rank.
Important
Prerequisites: This feature requires the numactl command-line tool to be installed and
available in the system PATH. The launcher will fail to start workers if numactl is not found.
To install on common Linux distributions:
Ubuntu/Debian:
sudo apt-get install numactlRHEL/CentOS/Rocky:
sudo yum install numactl
How it works:
Set
NVRX_GPUS_PER_NUMAto the number of GPUs per NUMA node on your systemThe launcher calculates the NUMA node as:
numa_node = local_rank // gpus_per_numaEach worker is automatically wrapped with:
numactl --cpunodebind=<numa_node> --membind=<numa_node>This applies only to binary/script entrypoints (not Python function entrypoints)
Example usage:
# For a system with 4 GPUs per NUMA node (8 GPUs total, 2 NUMA nodes)
export NVRX_GPUS_PER_NUMA=4
ft_launcher --nproc-per-node=8 train.py
# In this configuration:
# - Ranks 0-3 will be bound to NUMA node 0
# - Ranks 4-7 will be bound to NUMA node 1
Benefits:
Proper NUMA binding can significantly improve performance by ensuring memory locality and reducing cross-NUMA memory access overhead, which is especially important for multi-GPU training workloads.
Hang detection
The FT package provides two fully independent mechanisms for detecting hangs in user code. Users can choose the API that is best suited for their needs, or use both APIs at the same time.
Heartbeats API
The training script periodically sends heartbeats to the monitor. If no heartbeat arrives in a defined time, the workload is considered hung. This API is the simplest to use but might require coarse timeouts that need to cover a wide range of possible intervals between heartbeats. Please find more details in Heartbeats API Integration.
Sections API
Some parts of the training scripts are wrapped in sections. If any section is opened for too long, the workload is considered hung. The sections-based API requires more changes in the user code, but timeouts can be defined more precisely, and hangs can be detected quicker. Please find more details in Sections API Integration.
Workload control
In some cases, it might be useful to control the ft_launcher behavior based on a rank state.
For example, if an irrecoverable error is encountered in a rank, it might be reasonable to break
the launcher restarting loop and exit instead of restarting; for other exception types, one might
want to exclude the current node from subsequent restart attempts. RankMonitorClient exposes the
nvidia_resiliency_ext.fault_tolerance.rank_monitor_client.RankMonitorClient.send_workload_control_request()
API, which can be used to control the workload restarting logic implemented in the launcher.
Note
Please note that only the ft_launcher behavior is affected by this call. The fault tolerance package is job scheduler-agnostic, i.e., it does not control underlying SLURM job allocations.