Callback
- class nvidia_resiliency_ext.ptl_resiliency.straggler_det_callback.StragglerDetectionCallback(report_time_interval, calc_relative_gpu_perf, calc_individual_gpu_perf, num_gpu_perf_scores_to_print, gpu_relative_perf_threshold, gpu_individual_perf_threshold, stop_if_detected, enable_ptl_logging, profiling_interval=1, logger_name='nemo_logger.StragglerDetectionCallback')[source]
Bases:
Callback
Initialize straggler detection callback instance.
- Parameters:
report_time_interval (float) – Interval [seconds] of the straggler check
calc_relative_gpu_perf (bool) – Calculate relative GPU performance
calc_individual_gpu_perf (bool) – Calculate individual GPU performance
num_gpu_perf_scores_to_print (int) – How many best and worst perf scores to print (0 - does not print periodically, but only if stragglers are detected)
gpu_relative_perf_threshold (float) – Threshold for relative GPU performance scores
gpu_individual_perf_threshold (float) – Threshold for individual GPU performance scores
stop_if_detected (bool) – Set to True, to terminate the workload if stragglers are detected
enable_ptl_logging (bool) – Set to True, to log GPU performance scores to all PTL loggers enabled through trainer
profiling_interval (int) – profiling_interval passed to straggler.Detector.initialize. Defaults to 1.
logger_name (Optional[str], optional) – Defaults to “nemo_logger.StragglerDetectionCallback”.
- Raises:
ValueError – If invalid config was provided.