nvidia-resiliency-ext

Documentation contents:

  • Fault Tolerance
  • Inprocess Restart
  • Async Checkpointing
  • Local Checkpointing
  • Straggler Detection
nvidia-resiliency-ext
  • Overview: module code

All modules for which code is available

  • nvidia_resiliency_ext.checkpointing.async_ckpt.core
  • nvidia_resiliency_ext.checkpointing.async_ckpt.filesystem_async
  • nvidia_resiliency_ext.checkpointing.async_ckpt.state_dict_saver
  • nvidia_resiliency_ext.checkpointing.async_ckpt.torch_ckpt
  • nvidia_resiliency_ext.checkpointing.local.base_state_dict
  • nvidia_resiliency_ext.checkpointing.local.basic_state_dict
  • nvidia_resiliency_ext.checkpointing.local.ckpt_managers.base_manager
  • nvidia_resiliency_ext.checkpointing.local.ckpt_managers.local_manager
  • nvidia_resiliency_ext.checkpointing.local.replication.strategies
  • nvidia_resiliency_ext.fault_tolerance.config
  • nvidia_resiliency_ext.fault_tolerance.rank_monitor_client
  • nvidia_resiliency_ext.inprocess.exception
  • nvidia_resiliency_ext.inprocess.finalize
  • nvidia_resiliency_ext.inprocess.health_check
  • nvidia_resiliency_ext.inprocess.initialize
  • nvidia_resiliency_ext.straggler.reporting
  • nvidia_resiliency_ext.straggler.statistics
  • nvidia_resiliency_ext.straggler.straggler

© Copyright 2024, NVIDIA Corporation.

Built with Sphinx using a theme provided by Read the Docs.