API documentation
API documentation
- Config
- Client
RankMonitorClientRankMonitorClient.calculate_and_set_hb_timeouts()RankMonitorClient.calculate_and_set_section_timeouts()RankMonitorClient.end_all_sections()RankMonitorClient.end_section()RankMonitorClient.init_workload_monitoring()RankMonitorClient.load_state_dict()RankMonitorClient.send_heartbeat()RankMonitorClient.send_workload_control_request()RankMonitorClient.shutdown_workload_monitoring()RankMonitorClient.start_section()RankMonitorClient.state_dict()
RankMonitorClientError
- Server
- Callback
FaultToleranceCallbackFaultToleranceCallback.on_exception()FaultToleranceCallback.on_load_checkpoint()FaultToleranceCallback.on_save_checkpoint()FaultToleranceCallback.on_train_batch_end()FaultToleranceCallback.on_train_end()FaultToleranceCallback.on_train_start()FaultToleranceCallback.on_validation_batch_end()FaultToleranceCallback.on_validation_end()FaultToleranceCallback.on_validation_start()FaultToleranceCallback.setup()FaultToleranceCallback.teardown()