Initialize

class nvidia_resiliency_ext.inprocess.initialize.RetryController(max_iterations=None, min_world_size=1, min_active_world_size=1)[source]

Controls retry logic for distributed training based on specified iteration and world size limits.

This class manages the conditions under which distributed training retries are allowed, raising a inprocess.exception.RestartAbort exception when the conditions are not met.

Parameters:
  • max_iterations (int | None) – the maximum number of iterations allowed before aborting retries. If None, there is no iteration limit

  • min_world_size (int) – The minimum required world size to proceed with execution

  • min_active_world_size (int) – The minimum required active world size to proceed with execution