FT Launcher & Inprocess integration
FT launcher integrates with Inprocess recovery mechanisms, improving fault tolerance by coordinating injob and inprocess fault recovery.
1. Heartbeat Mechanism
The FT launcher heartbeat remains active throughout execution to detect and mitigate potential hangs.
Users must configure timeouts manually, ensuring they exceed inprocess operational timeouts to prevent conflicts.
2. Worker Monitoring & Restart Policy
A new --ft-restart-policy argument in ft_launcher modifies the default worker monitor logic for better compatibility with Inprocess Restart.
Policy Options
min-healthy: Restarts workers only when the number of healthy worker groups falls below minimum specified in--nnodes, as set inft_launcher.
Note
For proper behavior, minimum specified in --nnodes should match the inprocess restarter setting by either:
Ensuring
inprocessoperates at the node level likeinjobby adding arank_assignmentfilter to the wrapper, orMaking
injoboperate at the rank level likeinprocessby specifying one rank per agent.
See the rank assignment guide for more details.
Example of rank_assignment:
rank_assignment = (
inprocess.Compose(
inprocess.rank_assignment.ShiftRanks(),
inprocess.rank_assignment.FilterGroupedByKey(
key_or_fn=lambda _, _: socket.gethostname(),
condition=lambda count: count == 8,
),
),
)
Behavior in min-healthy mode:
If enough nodes remain healthy, the worker monitor stays inactive while collaborating with Inprocess Restart..
If the threshold is breached,
FT launchertakes over and restarts the training process.
Supported & Unsupported Configurations
To ensure correct behavior with inprocess:
✅ Supported:
restart-policy=min-healthy(Required):Prevents unintended upscaling.
Disables any-failed worker monitoring.
❌ Unsupported:
any-failedwith inprocess (Not allowed):Incompatible with inprocess restarts.
Causes FT launcher to misinterpret terminated processes as failures, triggering unnecessary restarts.
Enables upscaling, allowing FT launcher to restart training when a new node becomes available.
Can lead to undefined behavior when combined with inprocess restarts.
In short, any-failed must not be used with inprocess, as it disrupts the intended fault recovery process.
Please refer to the FT Launcher & Inprocess integration example for an implementation example.