State
- class nvidia_resiliency_ext.inprocess.Mode(value)[source]
Indicates operational mode of the current distributed rank.
- ACTIVE
the rank calls the wrapped function
- INACTIVE
the rank is waiting idle
- TERMINATED
the rank was terminated
- class nvidia_resiliency_ext.inprocess.FrozenState(rank, world_size, active_rank=None, active_world_size=None, initial_rank=None, initial_world_size=None, iteration=0, mode=Mode.INITIALIZED, fn_exception=None)
inprocess.FrozenState
is identical toinprocess.State
, except all fields are read-only.
- class nvidia_resiliency_ext.inprocess.State(rank, world_size, active_rank=None, active_world_size=None, initial_rank=None, initial_world_size=None, iteration=0, mode=Mode.INITIALIZED, fn_exception=None)[source]
Represents the current state of the
inprocess.Wrapper
.- Parameters:
rank (int | None) – a distributed rank index as seen by the
inprocess.Wrapper
,None
for terminated ranksworld_size (int) – a total number of distributed ranks controlled by the
inprocess.Wrapper
active_rank (int | None) – a distributed rank index passed to the wrapped function
active_world_size (int | None) – a total number of distributed ranks passed to the wrapped function
initial_rank (int | None) – an distributed rank index, captured when the
Wrapper
was invokedinitial_world_size (int | None) – a total number of initial distributed ranks
iteration (int) – index of the current restart iteration
mode (Mode) – operational mode
fn_exception (Exception | None) – an instance of
Exception
raised by the wrapped function in the current restart iteration,None
if no exception was raised