Initialize
- class nvidia_resiliency_ext.inprocess.initialize.Initialize[source]
Abstract base class for
initialize
argument forinprocess.Wrapper
.Initialize
is executed at the start of every restart iteration, including the first one.Initialize
can raise exceptions (e.g., if specific preconditions are not met). Raising a standard PythonException
triggers another restart, while raising aBaseException
terminates the wrapper.Multiple instances of
Initialize
could be composed withinprocess.Compose
to achieve the desired behavior.
- class nvidia_resiliency_ext.inprocess.initialize.RetryController(max_iterations=None, min_world_size=1, min_active_world_size=1)[source]
Controls retry logic for distributed training based on specified iteration and world size limits.
This class manages the conditions under which distributed training retries are allowed, raising a
inprocess.exception.RestartAbort
exception when the conditions are not met.- Parameters:
max_iterations (int | None) – the maximum number of iterations allowed before aborting retries. If
None
, there is no iteration limitmin_world_size (int) – The minimum required world size to proceed with execution
min_active_world_size (int) – The minimum required active world size to proceed with execution