Callback

class nvidia_resiliency_ext.ptl_resiliency.straggler_det_callback.StragglerDetectionCallback(report_time_interval, calc_relative_gpu_perf, calc_individual_gpu_perf, num_gpu_perf_scores_to_print, gpu_relative_perf_threshold, gpu_individual_perf_threshold, stop_if_detected, enable_ptl_logging, profiling_interval=1, logger_name='nemo_logger.StragglerDetectionCallback')[source]

Bases: Callback

Initialize straggler detection callback instance.

Parameters:

report_time_interval (float) – Interval [seconds] of the straggler check
calc_relative_gpu_perf (bool) – Calculate relative GPU performance
calc_individual_gpu_perf (bool) – Calculate individual GPU performance
num_gpu_perf_scores_to_print (int) – How many best and worst perf scores to print (0 - does not print periodically, but only if stragglers are detected)
gpu_relative_perf_threshold (float) – Threshold for relative GPU performance scores
gpu_individual_perf_threshold (float) – Threshold for individual GPU performance scores
stop_if_detected (bool) – Set to True, to terminate the workload if stragglers are detected
enable_ptl_logging (bool) – Set to True, to log GPU performance scores to all PTL loggers enabled through trainer
profiling_interval (int) – profiling_interval passed to straggler.Detector.initialize. Defaults to 1.
logger_name (Optional[str], optional) – Defaults to “nemo_logger.StragglerDetectionCallback”.

Raises:

ValueError – If invalid config was provided.

on_train_batch_end(trainer, pl_module, outputs, batch, batch_idx)[source]: Called when the train batch ends.

Note

The value outputs["loss"] here will be the normalized value w.r.t accumulate_grad_batches of the loss returned from training_step.

setup(trainer, pl_module, stage)[source]: Called when fit, validate, test, predict, or tune begins.

teardown(trainer, pl_module, stage)[source]: Called when fit, validate, test, predict, or tune ends.