Initialize
- class nvidia_resiliency_ext.inprocess.initialize.Initialize[source]
Abstract base class for
initializeargument forinprocess.Wrapper.Initializeis executed at the start of every restart iteration, including the first one.Initializecan raise exceptions (e.g., if specific preconditions are not met). Raising a standard PythonExceptiontriggers another restart, while raising aBaseExceptionterminates the wrapper.Multiple instances of
Initializecould be composed withinprocess.Composeto achieve the desired behavior.- abstract __call__(state)[source]
Implementation of a
Initialize.- Parameters:
state (FrozenState) – read-only
Wrapperstate- Returns:
Forwarded read-only input
state.- Return type:
FrozenState
- class nvidia_resiliency_ext.inprocess.initialize.RetryController(max_iterations=None, min_world_size=1, min_active_world_size=1)[source]
Controls retry logic for distributed training based on specified iteration and world size limits.
This class manages the conditions under which distributed training retries are allowed, raising a
inprocess.exception.RestartAbortexception when the conditions are not met.- Parameters:
max_iterations (int | None) – the maximum number of iterations allowed before aborting retries. If
None, there is no iteration limitmin_world_size (int) – The minimum required world size to proceed with execution
min_active_world_size (int) – The minimum required active world size to proceed with execution