Initialize

class nvidia_resiliency_ext.inprocess.initialize.Initialize[source]

Abstract base class for initialize argument for inprocess.Wrapper.

Initialize is executed at the start of every restart iteration, including the first one. Initialize can raise exceptions (e.g., if specific preconditions are not met). Raising a standard Python Exception triggers another restart, while raising a BaseException terminates the wrapper.

Multiple instances of Initialize could be composed with inprocess.Compose to achieve the desired behavior.

class nvidia_resiliency_ext.inprocess.initialize.RetryController(max_iterations=None, min_world_size=1, min_active_world_size=1)[source]

Controls retry logic for distributed training based on specified iteration and world size limits.

This class manages the conditions under which distributed training retries are allowed, raising a inprocess.exception.RestartAbort exception when the conditions are not met.

Parameters:
  • max_iterations (int | None) – the maximum number of iterations allowed before aborting retries. If None, there is no iteration limit

  • min_world_size (int) – The minimum required world size to proceed with execution

  • min_active_world_size (int) – The minimum required active world size to proceed with execution