nvidia-resiliency-ext

Documentation contents:

  • Fault Tolerance
    • Usage guide
    • Integration Guides
    • API documentation
      • Config
      • Client
      • Server
      • Callback
    • Examples
  • Inprocess Restart
  • Async Checkpointing
  • Local Checkpointing
  • Straggler Detection
nvidia-resiliency-ext
  • Fault Tolerance
  • API documentation
  • View page source

API documentation

API documentation

  • Config
    • FaultToleranceConfig
      • FaultToleranceConfig.from_args()
      • FaultToleranceConfig.from_kwargs()
      • FaultToleranceConfig.from_yaml_file()
      • FaultToleranceConfig.to_yaml_file()
  • Client
    • RankMonitorClient
      • RankMonitorClient.calculate_and_set_hb_timeouts()
      • RankMonitorClient.calculate_and_set_section_timeouts()
      • RankMonitorClient.end_all_sections()
      • RankMonitorClient.end_section()
      • RankMonitorClient.init_workload_monitoring()
      • RankMonitorClient.load_state_dict()
      • RankMonitorClient.send_heartbeat()
      • RankMonitorClient.send_workload_control_request()
      • RankMonitorClient.shutdown_workload_monitoring()
      • RankMonitorClient.start_section()
      • RankMonitorClient.state_dict()
    • RankMonitorClientError
  • Server
  • Callback
Previous Next

© Copyright 2024, NVIDIA Corporation.

Built with Sphinx using a theme provided by Read the Docs.