Rank Assignment

class nvidia_resiliency_ext.inprocess.rank_assignment.FillGaps[source]

A class for reassigning distributed ranks, filling in gaps caused by terminated or unhealthy ranks.

The FillGaps class is a specialized rank assignment strategy that reorders ranks to fill gaps created by terminated or unhealthy ranks. It preserves the previous rank assignment for the first world_size - len(terminated_ranks) healthy ranks; the remaining healthy ranks are reassigned to fill in gaps left by unhealthy ranks.

Example:

|<--- preserved --->|<- moved ->|     |<--new world size->|

+---+---+---+---+---+---+---+---+     +---+---+---+---+---+
| 0 | X | 2 | 3 | X | X | 6 | 7 | --> | 0 | 6 | 2 | 3 | 7 |
+---+---+---+---+---+---+---+---+     +---+---+---+---+---+
      ^           ^        |  |
      |           |        |  |
      ---------------------   |
                  |           |
                  -------------
class nvidia_resiliency_ext.inprocess.rank_assignment.FilterGroupedByKey(key_or_fn, condition, timeout=datetime.timedelta(seconds=60))[source]

A class for filtering distributed ranks by grouping by a key.

FilterGroupedByKey organizes ranks into groups based on a specified string key. For each group, it increments a group counter by 1 for every healthy rank. A given boolean condition is then evaluated for each rank, with the corresponding group counter passed as input.

  • If condition(group_counter) evaluates to True, the rank is preserved.

  • If it evaluates to False, the rank is considered unhealthy and marked for termination.

FilterGroupedByKey needs to be followed by another RankAssignment that performs the actual rank termination by raising RankDiscarded exception.

condition = lambda count: count == 2

+---+---+---+---+---+---+---+---+     +---+---+---+---+---+---+---+---+
| 0 | X | 2 | 3 | X | X | 6 | 7 | --> | X | X | 2 | 3 | X | X | 6 | 7 |
+---+---+---+---+---+---+---+---+     +---+---+---+---+---+---+---+---+
| key=0 | key=1 | key=2 | key=3 |     | key=0 | key=1 | key=2 | key=3 |
|       |       |       |       |     |       |       |       |       |
|count=1|count=2|count=0|count=2|     | False | True  | False | True  |

Example:

# hostname is the group key, and condition checks if exactly 8 ranks
# corresponding to a given hostname are in a healthy state, if the
# count is different than 8, all ranks from corresponding hostname are
# considered unhealthy, and terminated; remaining healthy ranks are
# shifted to the left to fill all gaps created by unhealthy ranks.

rank_assignment = (
    inprocess.Compose(
        inprocess.rank_assignment.ShiftRanks(),
        inprocess.rank_assignment.FilterGroupedByKey(
            key_or_fn=lambda _, _: socket.gethostname(),
            condition=lambda count: count == 8,
        ),
    ),
),
Parameters:
  • key_or_fn (str | Callable[[int, int], str]) – a string key, or a Callable evaluated with (rank, world_size) as the input to produce a string key

  • condition (Callable[[int], bool]) – condition to be evaluated with group counter as the input, if False the rank is terminated

  • timeout (timedelta) – timeout for distributed barrier

class nvidia_resiliency_ext.inprocess.rank_assignment.RankAssignment[source]

Abstract base class for rank_assignment argument for inprocess.Wrapper.

RankAssignment is responsible for reassigning distributed ranks and computing the new world size for the next iteration of the wrapped function.

Multiple instances of RankAssignment could be composed with inprocess.Compose to achieve the desired behavior.

exception nvidia_resiliency_ext.inprocess.rank_assignment.RankDiscarded[source]

Exception raised when unhealthy distributed rank is discarded by inprocess.rank_assignment.RankAssignment.

class nvidia_resiliency_ext.inprocess.rank_assignment.ShiftRanks[source]

A class for reassigning distributed ranks, filling in gaps caused by terminated or unhealthy ranks.

The ShiftRanks class is a specialized rank assignment strategy that shifts all healthy ranks to the left to fill gaps created by terminated or unhealthy ranks. ShiftRanks preserves the relative order of all healthy ranks, but all ranks past the first unhealthy rank are reassigned (shifted).

Example:

 <-   ->|<------- moved ------->|     |<--new world size->|

          ----
          v   |
+---+---+---+---+---+---+---+---+     +---+---+---+---+---+
| 0 | X | 2 | 3 | X | X | 6 | 7 | --> | 0 | 2 | 3 | 6 | 7 |
+---+---+---+---+---+---+---+---+     +---+---+---+---+---+
      ^   |   ^   ^       |   |
      |   |   |   |       |   |
      ----     ------------   |
                  |           |
                  ------------