State

class nvidia_resiliency_ext.inprocess.Mode(value)[source]

Indicates operational mode of the current distributed rank.

INITIALIZED

the State was initialized, RankAssignment was not yet performed

ACTIVE

the rank calls the wrapped function

INACTIVE

the rank is waiting idle

TERMINATED

the rank was terminated

class nvidia_resiliency_ext.inprocess.FrozenState(rank, world_size, active_rank=None, active_world_size=None, initial_rank=None, initial_world_size=None, iteration=0, mode=Mode.INITIALIZED, fn_exception=None)

inprocess.FrozenState is identical to inprocess.State, except all fields are read-only.

Parameters:
  • rank (int | None)

  • world_size (int)

  • active_rank (int | None)

  • active_world_size (int | None)

  • initial_rank (int | None)

  • initial_world_size (int | None)

  • iteration (int)

  • mode (Mode)

  • fn_exception (Exception | None)

class nvidia_resiliency_ext.inprocess.State(rank, world_size, active_rank=None, active_world_size=None, initial_rank=None, initial_world_size=None, iteration=0, mode=Mode.INITIALIZED, fn_exception=None)[source]

Represents the current state of the inprocess.Wrapper.

Parameters:
  • rank (int | None) – a distributed rank index as seen by the inprocess.Wrapper, None for terminated ranks

  • world_size (int) – a total number of distributed ranks controlled by the inprocess.Wrapper

  • active_rank (int | None) – a distributed rank index passed to the wrapped function

  • active_world_size (int | None) – a total number of distributed ranks passed to the wrapped function

  • initial_rank (int | None) – an distributed rank index, captured when the Wrapper was invoked

  • initial_world_size (int | None) – a total number of initial distributed ranks

  • iteration (int) – index of the current restart iteration

  • mode (Mode) – operational mode

  • fn_exception (Exception | None) – an instance of Exception raised by the wrapped function in the current restart iteration, None if no exception was raised