nvidia-resiliency-ext

Documentation contents:

  • Fault Tolerance
    • Usage guide
    • Integration Guides
    • API documentation
      • Config
      • Client
      • Server
      • Callback
    • Examples
  • Inprocess Restart
  • Async Checkpointing
  • Local Checkpointing
  • Straggler Detection
nvidia-resiliency-ext
  • Fault Tolerance
  • API documentation
  • View page source

API documentation

API documentation

  • Config
    • FaultToleranceConfig
      • FaultToleranceConfig.from_args()
      • FaultToleranceConfig.from_kwargs()
      • FaultToleranceConfig.from_yaml_file()
      • FaultToleranceConfig.to_yaml_file()
  • Client
    • RankMonitorClient
      • RankMonitorClient.calculate_and_set_hb_timeouts()
      • RankMonitorClient.calculate_and_set_section_timeouts()
      • RankMonitorClient.end_all_sections()
      • RankMonitorClient.end_section()
      • RankMonitorClient.init_workload_monitoring()
      • RankMonitorClient.load_state_dict()
      • RankMonitorClient.send_heartbeat()
      • RankMonitorClient.send_workload_control_request()
      • RankMonitorClient.shutdown_workload_monitoring()
      • RankMonitorClient.start_section()
      • RankMonitorClient.state_dict()
    • RankMonitorClientError
  • Server
    • RankMonitorLogger
    • RankMonitorServer
  • Callback
    • FaultToleranceCallback
      • FaultToleranceCallback.on_exception()
      • FaultToleranceCallback.on_load_checkpoint()
      • FaultToleranceCallback.on_save_checkpoint()
      • FaultToleranceCallback.on_train_batch_end()
      • FaultToleranceCallback.on_train_end()
      • FaultToleranceCallback.on_train_start()
      • FaultToleranceCallback.on_validation_batch_end()
      • FaultToleranceCallback.on_validation_end()
      • FaultToleranceCallback.on_validation_start()
      • FaultToleranceCallback.setup()
      • FaultToleranceCallback.teardown()
Previous Next

© Copyright 2024, NVIDIA Corporation.

Built with Sphinx using a theme provided by Read the Docs.