BaseCheckpointManager
BaseCheckpointManager defines interface for managing local checkpoints.
- Each CheckpointManager handles tasks such as:
cleaning up old checkpoints
tracking the iteration of the latest valid checkpoint
saving and loading checkpoints using the implemented backend.
It uses a state_dict interface, requiring users to adjust the state_dict as needed, with MCore facilitating these modifications.
- class nvidia_resiliency_ext.checkpointing.local.ckpt_managers.base_manager.BaseCheckpointManager(session_id, repl_strategy=None)[source]
Bases:
ABC
The Base Checkpoint Manager provides an interface for integrating different checkpoint managers, abstracting replication mechanisms from the underlying implementations.
- Parameters:
repl_strategy (ReplicationStrategy)
- find_latest()[source]
Searches for the most recent complete checkpoint and returns its iteration number.
If no complete checkpoints are found, the method returns -1.
All training ranks have to call this method at once.
- Returns:
The iteration number of the most recent complete checkpoint, or -1 if no checkpoints are available.
- Return type:
- load()[source]
Loads the most recent complete checkpoint.
Ensure that find_latest() has been called first to identify the latest checkpoint.
All training ranks have to call this method at once.
- Returns:
- Tuple[TensorAwareStateDict, str]
state_dict: The state dictionary loaded from the most recent complete checkpoint.
ckpt_id: The identifier of the checkpoint that was successfully loaded.
- Return type:
- property rank
- save(state_dict, iteration, is_async=False)[source]
Saves the state_dict associated with the specified iteration number.
If is_async is set to True, the save operation is performed asynchronously, and the function returns an AsyncRequest object. Otherwise, the save operation is completed synchronously.
All training ranks have to call this method at once.
- Parameters:
- Returns:
An AsyncRequest object if is_async is True; otherwise, None as the operation completes synchronously.
- Return type:
AsyncRequest or None
- exception nvidia_resiliency_ext.checkpointing.local.ckpt_managers.base_manager.CheckpointingException[source]
Bases:
Exception
Base checkpointing related exception
- exception nvidia_resiliency_ext.checkpointing.local.ckpt_managers.base_manager.SameMachineReplicationException(ckpt_id)[source]
Bases:
CheckpointingException
Exception raised when an attempt is made to override a file during replication.
Inherits from CheckpointingException.