Initialize
- class nvidia_resiliency_ext.inprocess.initialize.RetryController(max_iterations=None, min_world_size=1, min_active_world_size=1)[source]
Controls retry logic for distributed training based on specified iteration and world size limits.
This class manages the conditions under which distributed training retries are allowed, raising a
inprocess.exception.RestartAbort
exception when the conditions are not met.- Parameters:
max_iterations (int | None) – the maximum number of iterations allowed before aborting retries. If
None
, there is no iteration limitmin_world_size (int) – The minimum required world size to proceed with execution
min_active_world_size (int) – The minimum required active world size to proceed with execution