Usage guide

Straggler Detection Package Design Overview

Package users can mark sections of their code as “straggler detection sections”. Timing information is collected for such sections, which includes CPU (wall-clock) time spent in the section. Additionally, CUDA kernels launched in the detection sections can be benchmarked.

Based on the timings collected, the following performance scores are computed:
  • For each detection section, there is a per-rank performance score based on the measured CPU wall-clock time spent in the section.

  • [Optionally] All CUDA kernel timing information from all monitored sections is aggregated into a single per-rank GPU performance score.

Performance scores are scalar values from 0.0 (worst) to 1.0 (best), reflecting each rank’s performance. A performance score can be interpreted as the ratio of current performance to reference performance.

Depending on the reference used, there are two types of performance scores:
  • Relative performance score: The best-performing rank in the workload is used as a reference.

  • Individual performance score: The best historical performance of the rank is used as a reference.

Examples:
  • If the relative performance score is 0.5, it means that a rank is twice as slow as the fastest rank.

  • If the individual performance score is 0.5, it means that a rank is twice as slow as its best performance.

If there are 8 ranks and 2 sections “data_loading” and “fwd” defined, scores might look like this:

Relative section performance scores:
{
'data_loading': {0: 0.961, 1: 0.919, 2: 0.941, 3: 0.999, 4: 1.0, 5: 0.847, 6: 0.591, 7: 0.788},
'fwd': {0: 0.994, 1: 0.949, 2: 0.994, 3: 0.998, 4: 1.0, 5: 0.947, 6: 0.991, 7: 0.988}
}
Relative GPU performance scores:
{
0: 0.703, 1: 0.978, 2: 0.689, 3: 0.663, 4: 0.673, 5: 0.925, 6: 0.720, 7: 0.737
}

# NOTE: GPU performance scores are not related to any particular section.

#
# Individual scores have the same structure as relative scores.
#

If the performance score drops below the specified threshold, the corresponding section or GPU is identified as a straggler. To detect the stragglers, users should call the reporting function periodically and check the returned report object for stragglers.

Straggler Detection Usage Overview

Sections of the user code that should be monitored for performance (“detection sections”) can be defined using:
  • Detector.detection_section context manager that allows wrapping a custom code block.

  • Detector.wrap_callables method that wraps given callables within Detector.detection_section.

Both methods yield the same results. Users can choose the API that is most convenient for their needs.

Note

If a detection section is monitored for CPU performance, a given section is supposed to run a similar amount of work on all ranks (e.g., if a significant amount of additional work was conducted on rank 0, rank 0 would be identified as a straggler). Captured CUDA kernels can differ between ranks, but there should be some subset common to all ranks. The simplest approach might be to wrap the model’s forward call.

Using Detector.detection_section:

import nvidia_resiliency_ext.straggler as straggler

straggler.Detector.initialize(
   scores_to_compute=["relative_perf_scores", "individual_perf_scores"],
   gather_on_rank0=True # all ranks' results will be available on rank 0
)

for training_iter in range(num_iters):

   ...

   with straggler.Detector.detection_section("data_loading", profile_cuda=False):
      data, labels = data_loader.get_next_batch()

   with straggler.Detector.detection_section("fwd", profile_cuda=True):
      out = model(data)
   ...

   if (training_iter % STRAGGLER_REPORT_INTERVAL) == 0:
      report = straggler.Detector.generate_report()
      if rank == 0:
            stragglers = report.identify_stragglers()
            maybe_notify_user(stragglers)

...

straggler.Detector.shutdown()

Using Detector.wrap_callables:

import nvidia_resiliency_ext.straggler as straggler

straggler.Detector.initialize(
   scores_to_compute=["relative_perf_scores", "individual_perf_scores"],
   gather_on_rank0=True # all ranks' results will be available on rank 0
)

straggler.Detector.wrap_callables(
   callable_ids=[
      straggler.CallableId(DataLoaderClass, "get_next_batch", ignored_args=("self",)),
      straggler.CallableId(ModelClass, "forward", ignored_args=("self",)),
   ]
)

for training_iter in range(num_iters):

   ...

   data, labels = data_loader.get_next_batch()
   out = model(data)

   ...

   if (training_iter % STRAGGLER_REPORT_INTERVAL) == 0:
      report = straggler.Detector.generate_report()
      if rank == 0:
            stragglers = report.identify_stragglers()
            maybe_notify_user(stragglers)

...

straggler.Detector.shutdown()

Alternative Reporting

Besides calling Detector.generate_report after a fixed number of iterations tweaked for a particular workload, users can choose to call .generate_report_if_interval_elapsed along with each training step. The report generation will occur approximately within each time period specified through Detector.initialize(report_time_interval=...).

Note

Straggler detection might involve inter-rank synchronization and should be invoked with reasonable frequency (e.g., every few minutes).

import nvidia_resiliency_ext.straggler as straggler

straggler.Detector.initialize(report_time_interval=300, ...) # Report every 5 minutes

for training_iter in range(num_iters):

   ...

   # During each training iteration
   report = straggler.Detector.generate_report_if_interval_elapsed()
   if report is not None:
      handle_report(report)

   # Note: straggler.Detector.generate_report() works as usual, users can alternatively use:
   # if (iter_idx % REPORT_INTERVAL) == 0:
   #     straggler.Detector.generate_report()

...

straggler.Detector.shutdown()

Reducing the Overhead

CUDA kernel profiling imposes some small step time overhead that depends on the workload but generally is expected to be <1%. If needed, the amount of overhead can be reduced with Detector.initialize(profiling_interval=...). If profiling_interval is > 1, only a fraction of section runs are profiled.

Detector.generate_report() overhead depends on the Detector.initialize parameters:
  • Relative performance score calculation (scores_to_compute=["relative_perf_scores", ...]) involves sharing some data between ranks.

  • If gather_on_rank0=True, results from all ranks are gathered on rank 0.

Hence, the following initialization parameters can be used to avoid any inter-rank synchronization:

straggler.Detector.initialize(
   scores_to_compute=["individual_perf_scores"],
   gather_on_rank0=False # each rank will report its own results
)

In that case, all ranks compute and report their own, individual results only.

Straggler Detection Integration Guide

This section describes integration with a PTL-based workload (e.g., NeMo) using StragglerDetectionCallback. All that is needed is to include StragglerDetectionCallback in the PTL trainer callbacks.

from nvidia_resiliency_ext.ptl_resiliency import StragglerDetectionCallback

straggler_cb_args = dict(
   report_time_interval=300.0,
   calc_relative_gpu_perf=True,
   calc_individual_gpu_perf=True,
   num_gpu_perf_scores_to_log=3,
   gpu_relative_perf_threshold=0.7,
   gpu_individual_perf_threshold=0.7,
   stop_if_detected=False,
   logger_name="test_logger",
)

straggler_det_cb = StragglerDetectionCallback(**straggler_cb_args)

trainer = pl.Trainer(
   ...
   callbacks=[..., straggler_det_cb],
)

StragglerDetectionCallback initialization parameters:

StragglerDetectionCallback.__init__(report_time_interval, calc_relative_gpu_perf, calc_individual_gpu_perf, num_gpu_perf_scores_to_print, gpu_relative_perf_threshold, gpu_individual_perf_threshold, stop_if_detected, enable_ptl_logging, profiling_interval=1, logger_name='nemo_logger.StragglerDetectionCallback')[source]

Initialize straggler detection callback instance.

Parameters:
  • report_time_interval (float) – Interval [seconds] of the straggler check

  • calc_relative_gpu_perf (bool) – Calculate relative GPU performance

  • calc_individual_gpu_perf (bool) – Calculate individual GPU performance

  • num_gpu_perf_scores_to_print (int) – How many best and worst perf scores to print (0 - does not print periodically, but only if stragglers are detected)

  • gpu_relative_perf_threshold (float) – Threshold for relative GPU performance scores

  • gpu_individual_perf_threshold (float) – Threshold for individual GPU performance scores

  • stop_if_detected (bool) – Set to True, to terminate the workload if stragglers are detected

  • enable_ptl_logging (bool) – Set to True, to log GPU performance scores to all PTL loggers enabled through trainer

  • profiling_interval (int) – profiling_interval passed to straggler.Detector.initialize. Defaults to 1.

  • logger_name (Optional[str], optional) – Defaults to “nemo_logger.StragglerDetectionCallback”.

Raises:

ValueError – If invalid config was provided.