nvidia-resiliency-ext

Documentation contents:

  • Fault Tolerance
  • Inprocess Restart
  • Async Checkpointing
  • Local Checkpointing
    • Usage guide
    • API documentation
      • PTL Callback support
      • BaseCheckpointManager
      • LocalCheckpointManager
      • Replication
      • BaseTensorAwareStateDict
      • BasicTensorAwareStateDict
    • Examples
  • Straggler Detection
nvidia-resiliency-ext
  • Local Checkpointing
  • API documentation
  • View page source

API documentation

API documentation

  • PTL Callback support
  • BaseCheckpointManager
    • BaseCheckpointManager
    • CheckpointingException
    • SameMachineReplicationException
  • LocalCheckpointManager
    • LocalCheckpointManager
  • Replication
    • CliqueReplicationStrategy
    • LazyCliqueReplicationStrategy
    • LazyReplicationStrategyBuilder
    • NoReplicasAvailableError
    • ReplicationStrategy
  • BaseTensorAwareStateDict
    • TensorAwareStateDict
  • BasicTensorAwareStateDict
    • BasicTensorAwareStateDict
Previous Next

© Copyright 2024, NVIDIA Corporation.

Built with Sphinx using a theme provided by Read the Docs.