State

class nvidia_resiliency_ext.inprocess.Mode(value)[source]

Indicates operational mode of the current distributed rank.

INITIALIZED: the State was initialized, RankAssignment was not yet performed

ACTIVE: the rank calls the wrapped function

INACTIVE: the rank is waiting idle

TERMINATED: the rank was terminated

class nvidia_resiliency_ext.inprocess.FrozenState(rank, world_size, active_rank=None, active_world_size=None, initial_rank=None, initial_world_size=None, iteration=0, mode=Mode.INITIALIZED, fn_exception=None)

inprocess.FrozenState is identical to inprocess.State, except all fields are read-only.

Parameters:

rank (int | None)
world_size (int)
active_rank (int | None)
active_world_size (int | None)
initial_rank (int | None)
initial_world_size (int | None)
iteration (int)
mode (Mode)
fn_exception (Exception | None)

class nvidia_resiliency_ext.inprocess.State(rank, world_size, active_rank=None, active_world_size=None, initial_rank=None, initial_world_size=None, iteration=0, mode=Mode.INITIALIZED, fn_exception=None)[source]

Represents the current state of the inprocess.Wrapper.

Parameters:

rank (int | None) – a distributed rank index as seen by the inprocess.Wrapper, None for terminated ranks
world_size (int) – a total number of distributed ranks controlled by the inprocess.Wrapper
active_rank (int | None) – a distributed rank index passed to the wrapped function
active_world_size (int | None) – a total number of distributed ranks passed to the wrapped function
initial_rank (int | None) – an distributed rank index, captured when the Wrapper was invoked
initial_world_size (int | None) – a total number of initial distributed ranks
iteration (int) – index of the current restart iteration
mode (Mode) – operational mode
fn_exception (Exception | None) – an instance of Exception raised by the wrapped function in the current restart iteration, None if no exception was raised