API documentation
API documentation
- Config
- Client
RankMonitorClient
RankMonitorClient.calculate_and_set_hb_timeouts()
RankMonitorClient.calculate_and_set_section_timeouts()
RankMonitorClient.end_all_sections()
RankMonitorClient.end_section()
RankMonitorClient.init_workload_monitoring()
RankMonitorClient.load_state_dict()
RankMonitorClient.send_heartbeat()
RankMonitorClient.send_workload_control_request()
RankMonitorClient.shutdown_workload_monitoring()
RankMonitorClient.start_section()
RankMonitorClient.state_dict()
RankMonitorClientError
- Server
- Callback
FaultToleranceCallback
FaultToleranceCallback.on_exception()
FaultToleranceCallback.on_load_checkpoint()
FaultToleranceCallback.on_save_checkpoint()
FaultToleranceCallback.on_train_batch_end()
FaultToleranceCallback.on_train_end()
FaultToleranceCallback.on_train_start()
FaultToleranceCallback.on_validation_batch_end()
FaultToleranceCallback.on_validation_end()
FaultToleranceCallback.on_validation_start()
FaultToleranceCallback.setup()
FaultToleranceCallback.teardown()