Rank Filter

class nvidia_resiliency_ext.inprocess.rank_filter.MaxActiveWorldSize(max_active_world_size)[source]

MaxActiveWorldSize ensures that the active world size is no greater than the specified max_active_world_size. Ranks with indices less than the active world size are active and calling the wrapped function, while ranks outside this range are inactive (sleeping).

Parameters:

max_active_world_size (int | None) – maximum active world size, no limit if None

class nvidia_resiliency_ext.inprocess.rank_filter.RankFilter[source]

RankFilter selects which ranks are active in the current restart iteration of inprocess.Wrapper.

Active ranks call the provided wrapped function. Inactive ranks are waiting idle, and could serve as a pool of static, preallocated and preinitialized spare ranks. Spare ranks would be activated in a subsequent restart iteration if previously active ranks were terminated or became unhealthy.

Multiple instances of RankFilter could be composed with inprocess.Compose to achieve the desired behavior.

class nvidia_resiliency_ext.inprocess.rank_filter.WorldSizeDivisibleBy(divisor=1)[source]

WorldSizeDivisibleBy ensures that the active world size is divisible by a given number. Ranks within the adjusted world size are marked as active and are calling the wrapped function, while ranks outside this range are marked as inactive (sleeping).

Parameters:

divisor (int) – the divisor to adjust the active world size by