LocalCheckpointManager

A basic manager for local checkpoints.

class nvidia_resiliency_ext.checkpointing.local.ckpt_managers.local_manager.LocalCheckpointManager(root_local_ckpt_dir, session_id='', repl_strategy=None)[source]

Bases: BaseCheckpointManager

Local Checkpoint Manager designed for handling checkpoints on local storage devices like SSDs or RAM disks.

Parameters:
  • root_local_ckpt_dir (str, Path) – root checkpoint directory on local storage. Checkpoints from different iterations can be saved within the same root directory, as each will have a unique name

  • session_id (str, optional) – adds additional identification opportunity for local checkpoints used in different training workloads. An example use case is the root_local_ckpt_dir being configured by the cluster administrator (e.g. /tmp/…) and session_id configured by the end user for differentiating different local checkpoints.

  • repl_strategy (ReplicationStrategy, optional) – strategy used to perform local checkpoint shards replication.

property local_ckpt_dir