BaseCheckpointManager

BaseCheckpointManager defines interface for managing local checkpoints.

Each CheckpointManager handles tasks such as:
  • cleaning up old checkpoints

  • tracking the iteration of the latest valid checkpoint

  • saving and loading checkpoints using the implemented backend.

It uses a state_dict interface, requiring users to adjust the state_dict as needed, with MCore facilitating these modifications.

class nvidia_resiliency_ext.checkpointing.local.ckpt_managers.base_manager.BaseCheckpointManager(session_id, repl_strategy=None)[source]

Bases: ABC

The Base Checkpoint Manager provides an interface for integrating different checkpoint managers, abstracting replication mechanisms from the underlying implementations.

Parameters:

repl_strategy (ReplicationStrategy)

find_latest()[source]

Searches for the most recent complete checkpoint and returns its iteration number.

If no complete checkpoints are found, the method returns -1.

All training ranks have to call this method at once.

Returns:

The iteration number of the most recent complete checkpoint, or -1 if no checkpoints are available.

Return type:

int

load()[source]

Loads the most recent complete checkpoint.

Ensure that find_latest() has been called first to identify the latest checkpoint.

All training ranks have to call this method at once.

Returns:

Tuple[TensorAwareStateDict, str]
  • state_dict: The state dictionary loaded from the most recent complete checkpoint.

  • ckpt_id: The identifier of the checkpoint that was successfully loaded.

Return type:

Tuple[TensorAwareStateDict, str]

property rank
save(state_dict, iteration, is_async=False)[source]

Saves the state_dict associated with the specified iteration number.

If is_async is set to True, the save operation is performed asynchronously, and the function returns an AsyncRequest object. Otherwise, the save operation is completed synchronously.

All training ranks have to call this method at once.

Parameters:
  • state_dict (dict) – The state dictionary to be saved.

  • iteration (int) – The iteration number for identifying the checkpoint.

  • is_async (bool) – Whether to perform the save operation asynchronously.

Returns:

An AsyncRequest object if is_async is True; otherwise, None as the operation completes synchronously.

Return type:

AsyncRequest or None

exception nvidia_resiliency_ext.checkpointing.local.ckpt_managers.base_manager.CheckpointingException[source]

Bases: Exception

Base checkpointing related exception

exception nvidia_resiliency_ext.checkpointing.local.ckpt_managers.base_manager.SameMachineReplicationException(ckpt_id)[source]

Bases: CheckpointingException

Exception raised when an attempt is made to override a file during replication.

Inherits from CheckpointingException.