nvidia-resiliency-ext

Documentation contents:

  • Fault Tolerance
  • Inprocess Restart
  • Async Checkpointing
    • Usage guide
    • API documentation
      • Asynchronous Checkpoint Core Utilities
      • Asynchronous FileSystemWriter Implementation
      • Asynchronous Pytorch Distributed Checkpoint save with optimized FileSystemWriter
      • Asynchronous PyTorch torch.save with the Core utility
    • Examples
  • Local Checkpointing
  • Straggler Detection
nvidia-resiliency-ext
  • Async Checkpointing
  • API documentation
  • View page source

API documentation

API documentation

  • Asynchronous Checkpoint Core Utilities
    • AsyncCaller
    • AsyncCallsQueue
    • AsyncRequest
    • PersistentAsyncCaller
    • TemporalAsyncCaller
  • Asynchronous FileSystemWriter Implementation
    • FileSystemWriterAsync
  • Asynchronous Pytorch Distributed Checkpoint save with optimized FileSystemWriter
    • CheckpointMetadataCache
    • get_metadata_caching_status()
    • init_checkpoint_metadata_cache()
    • save_state_dict_async_finalize()
    • save_state_dict_async_plan()
    • verify_global_md_reuse()
  • Asynchronous PyTorch torch.save with the Core utility
    • TorchAsyncCheckpoint
Previous Next

© Copyright 2024, NVIDIA Corporation.

Built with Sphinx using a theme provided by Read the Docs.