FT Launcher & Inprocess integration
FT launcher integrates with Inprocess recovery mechanisms, improving fault tolerance by coordinating injob and inprocess fault recovery.
1. Heartbeat Mechanism
The FT launcher heartbeat remains active throughout execution to detect and mitigate potential hangs.
Users must configure timeouts manually, ensuring they exceed inprocess operational timeouts to prevent conflicts.
2. Worker Monitoring & Restart Policy
A new --restart-policy
argument in ft_launcher
modifies the default worker monitor logic for better compatibility with ../inprocess/index.
Policy Options
min-healthy
: Restarts workers only when the number of healthy worker groups falls below minimum specified in--nnodes
, as set inft_launcher
.
Note
For proper behavior, minimum specified in --nnodes
should match the inprocess
restarter setting by either:
Ensuring
inprocess
operates at the node level likeinjob
by adding arank_assignment
filter to the wrapper, orMaking
injob
operate at the rank level likeinprocess
by specifying one rank per agent.
See the rank assignment guide for more details.
Example of rank_assignment:
rank_assignment = (
inprocess.Compose(
inprocess.rank_assignment.ShiftRanks(),
inprocess.rank_assignment.FilterGroupedByKey(
key_or_fn=lambda _, _: socket.gethostname(),
condition=lambda count: count == 8,
),
),
)
Behavior in min-healthy mode:
If enough nodes remain healthy, the worker monitor stays inactive while collaborating with ../inprocess/index..
If the threshold is breached,
FT launcher
takes over and restarts the training process.
Supported & Unsupported Configurations
To ensure correct behavior with inprocess:
✅ Supported:
restart-policy=min-healthy
(Required):Prevents unintended upscaling.
Disables any-failed worker monitoring.
❌ Unsupported:
any-failed
with inprocess (Not allowed):Incompatible with inprocess restarts.
Causes FT launcher to misinterpret terminated processes as failures, triggering unnecessary restarts.
Enables upscaling, allowing FT launcher to restart training when a new node becomes available.
Can lead to undefined behavior when combined with inprocess restarts.
In short, any-failed
must not be used with inprocess, as it disrupts the intended fault recovery process.
Please refer to the FT Launcher & Inprocess integration example for an implementation example.