Rank Assignment
- class nvidia_resiliency_ext.inprocess.rank_assignment.FillGaps[source]
A class for reassigning distributed ranks, filling in gaps caused by terminated or unhealthy ranks.
The
FillGaps
class is a specialized rank assignment strategy that reorders ranks to fill gaps created by terminated or unhealthy ranks. It preserves the previous rank assignment for the firstworld_size - len(terminated_ranks)
healthy ranks; the remaining healthy ranks are reassigned to fill in gaps left by unhealthy ranks.Example:
|<--- preserved --->|<- moved ->| |<--new world size->| +---+---+---+---+---+---+---+---+ +---+---+---+---+---+ | 0 | X | 2 | 3 | X | X | 6 | 7 | --> | 0 | 6 | 2 | 3 | 7 | +---+---+---+---+---+---+---+---+ +---+---+---+---+---+ ^ ^ | | | | | | --------------------- | | | -------------
- class nvidia_resiliency_ext.inprocess.rank_assignment.FilterGroupedByKey(key_or_fn, condition, timeout=datetime.timedelta(seconds=60))[source]
A class for filtering distributed ranks by grouping by a key.
FilterGroupedByKey
organizes ranks into groups based on a specified string key. For each group, it increments a group counter by 1 for every healthy rank. A given booleancondition
is then evaluated for each rank, with the corresponding group counter passed as input.If
condition(group_counter)
evaluates toTrue
, the rank is preserved.If it evaluates to
False
, the rank is considered unhealthy and marked for termination.
FilterGroupedByKey
needs to be followed by anotherRankAssignment
that performs the actual rank termination by raisingRankDiscarded
exception.condition = lambda count: count == 2 +---+---+---+---+---+---+---+---+ +---+---+---+---+---+---+---+---+ | 0 | X | 2 | 3 | X | X | 6 | 7 | --> | X | X | 2 | 3 | X | X | 6 | 7 | +---+---+---+---+---+---+---+---+ +---+---+---+---+---+---+---+---+ | key=0 | key=1 | key=2 | key=3 | | key=0 | key=1 | key=2 | key=3 | | | | | | | | | | | |count=1|count=2|count=0|count=2| | False | True | False | True |
Example:
# hostname is the group key, and condition checks if exactly 8 ranks # corresponding to a given hostname are in a healthy state, if the # count is different than 8, all ranks from corresponding hostname are # considered unhealthy, and terminated; remaining healthy ranks are # shifted to the left to fill all gaps created by unhealthy ranks. rank_assignment = ( inprocess.Compose( inprocess.rank_assignment.ShiftRanks(), inprocess.rank_assignment.FilterGroupedByKey( key_or_fn=lambda _, _: socket.gethostname(), condition=lambda count: count == 8, ), ), ),
- Parameters:
key_or_fn (str | Callable[[int, int], str]) – a string key, or a
Callable
evaluated with(rank, world_size)
as the input to produce a string keycondition (Callable[[int], bool]) – condition to be evaluated with group counter as the input, if
False
the rank is terminatedtimeout (timedelta) – timeout for distributed barrier
- class nvidia_resiliency_ext.inprocess.rank_assignment.RankAssignment[source]
Abstract base class for
rank_assignment
argument forinprocess.Wrapper
.RankAssignment
is responsible for reassigning distributed ranks and computing the new world size for the next iteration of the wrapped function.Multiple instances of
RankAssignment
could be composed withinprocess.Compose
to achieve the desired behavior.
- exception nvidia_resiliency_ext.inprocess.rank_assignment.RankDiscarded[source]
Exception raised when unhealthy distributed rank is discarded by
inprocess.rank_assignment.RankAssignment
.
- class nvidia_resiliency_ext.inprocess.rank_assignment.ShiftRanks[source]
A class for reassigning distributed ranks, filling in gaps caused by terminated or unhealthy ranks.
The
ShiftRanks
class is a specialized rank assignment strategy that shifts all healthy ranks to the left to fill gaps created by terminated or unhealthy ranks.ShiftRanks
preserves the relative order of all healthy ranks, but all ranks past the first unhealthy rank are reassigned (shifted).Example:
<- ->|<------- moved ------->| |<--new world size->| ---- v | +---+---+---+---+---+---+---+---+ +---+---+---+---+---+ | 0 | X | 2 | 3 | X | X | 6 | 7 | --> | 0 | 2 | 3 | 6 | 7 | +---+---+---+---+---+---+---+---+ +---+---+---+---+---+ ^ | ^ ^ | | | | | | | | ---- ------------ | | | ------------