Usage guide
Straggler Detection Package Design Overview
Package users can mark sections of their code as “straggler detection sections”. Timing information is collected for such sections, which includes CPU (wall-clock) time spent in the section. Additionally, CUDA kernels launched in the detection sections can be benchmarked.
- Based on the timings collected, the following performance scores are computed:
For each detection section, there is a per-rank performance score based on the measured CPU wall-clock time spent in the section.
[Optionally] All CUDA kernel timing information from all monitored sections is aggregated into a single per-rank GPU performance score.
Performance scores are scalar values from 0.0 (worst) to 1.0 (best), reflecting each rank’s performance. A performance score can be interpreted as the ratio of current performance to reference performance.
- Depending on the reference used, there are two types of performance scores:
Relative performance score: The best-performing rank in the workload is used as a reference.
Individual performance score: The best historical performance of the rank is used as a reference.
- Examples:
If the relative performance score is 0.5, it means that a rank is twice as slow as the fastest rank.
If the individual performance score is 0.5, it means that a rank is twice as slow as its best performance.
If there are 8 ranks and 2 sections “data_loading” and “fwd” defined, scores might look like this:
Relative section performance scores:
{
'data_loading': {0: 0.961, 1: 0.919, 2: 0.941, 3: 0.999, 4: 1.0, 5: 0.847, 6: 0.591, 7: 0.788},
'fwd': {0: 0.994, 1: 0.949, 2: 0.994, 3: 0.998, 4: 1.0, 5: 0.947, 6: 0.991, 7: 0.988}
}
Relative GPU performance scores:
{
0: 0.703, 1: 0.978, 2: 0.689, 3: 0.663, 4: 0.673, 5: 0.925, 6: 0.720, 7: 0.737
}
# NOTE: GPU performance scores are not related to any particular section.
#
# Individual scores have the same structure as relative scores.
#
If the performance score drops below the specified threshold, the corresponding section or GPU is identified as a straggler. To detect the stragglers, users should call the reporting function periodically and check the returned report object for stragglers.
Straggler Detection Usage Overview
- Sections of the user code that should be monitored for performance (“detection sections”) can be defined using:
Detector.detection_section
context manager that allows wrapping a custom code block.Detector.wrap_callables
method that wraps given callables withinDetector.detection_section
.
Both methods yield the same results. Users can choose the API that is most convenient for their needs.
Note
If a detection section is monitored for CPU performance, a given section is supposed to run a similar amount of work on all ranks
(e.g., if a significant amount of additional work was conducted on rank 0, rank 0 would be identified as a straggler).
Captured CUDA kernels can differ between ranks, but there should be some subset common to all ranks.
The simplest approach might be to wrap the model’s forward
call.
Using Detector.detection_section
:
import nvidia_resiliency_ext.straggler as straggler
straggler.Detector.initialize(
scores_to_compute=["relative_perf_scores", "individual_perf_scores"],
gather_on_rank0=True # all ranks' results will be available on rank 0
)
for training_iter in range(num_iters):
...
with straggler.Detector.detection_section("data_loading", profile_cuda=False):
data, labels = data_loader.get_next_batch()
with straggler.Detector.detection_section("fwd", profile_cuda=True):
out = model(data)
...
if (training_iter % STRAGGLER_REPORT_INTERVAL) == 0:
report = straggler.Detector.generate_report()
if rank == 0:
stragglers = report.identify_stragglers()
maybe_notify_user(stragglers)
...
straggler.Detector.shutdown()
Using Detector.wrap_callables
:
import nvidia_resiliency_ext.straggler as straggler
straggler.Detector.initialize(
scores_to_compute=["relative_perf_scores", "individual_perf_scores"],
gather_on_rank0=True # all ranks' results will be available on rank 0
)
straggler.Detector.wrap_callables(
callable_ids=[
straggler.CallableId(DataLoaderClass, "get_next_batch", ignored_args=("self",)),
straggler.CallableId(ModelClass, "forward", ignored_args=("self",)),
]
)
for training_iter in range(num_iters):
...
data, labels = data_loader.get_next_batch()
out = model(data)
...
if (training_iter % STRAGGLER_REPORT_INTERVAL) == 0:
report = straggler.Detector.generate_report()
if rank == 0:
stragglers = report.identify_stragglers()
maybe_notify_user(stragglers)
...
straggler.Detector.shutdown()
Alternative Reporting
Besides calling Detector.generate_report
after a fixed number of iterations tweaked for a particular workload,
users can choose to call .generate_report_if_interval_elapsed
along with each training step.
The report generation will occur approximately within each time period specified through Detector.initialize(report_time_interval=...)
.
Note
Straggler detection might involve inter-rank synchronization and should be invoked with reasonable frequency (e.g., every few minutes).
import nvidia_resiliency_ext.straggler as straggler
straggler.Detector.initialize(report_time_interval=300, ...) # Report every 5 minutes
for training_iter in range(num_iters):
...
# During each training iteration
report = straggler.Detector.generate_report_if_interval_elapsed()
if report is not None:
handle_report(report)
# Note: straggler.Detector.generate_report() works as usual, users can alternatively use:
# if (iter_idx % REPORT_INTERVAL) == 0:
# straggler.Detector.generate_report()
...
straggler.Detector.shutdown()
Reducing the Overhead
CUDA kernel profiling imposes some small step time overhead that depends on the workload but generally is expected to be <1%.
If needed, the amount of overhead can be reduced with Detector.initialize(profiling_interval=...)
.
If profiling_interval
is > 1, only a fraction of section runs are profiled.
Detector.generate_report()
overhead depends on theDetector.initialize
parameters:Relative performance score calculation (
scores_to_compute=["relative_perf_scores", ...]
) involves sharing some data between ranks.If
gather_on_rank0=True
, results from all ranks are gathered on rank 0.
Hence, the following initialization parameters can be used to avoid any inter-rank synchronization:
straggler.Detector.initialize(
scores_to_compute=["individual_perf_scores"],
gather_on_rank0=False # each rank will report its own results
)
In that case, all ranks compute and report their own, individual results only.
Straggler Detection Integration Guide
This section describes integration with a PTL-based workload (e.g., NeMo) using StragglerDetectionCallback
.
All that is needed is to include StragglerDetectionCallback
in the PTL trainer callbacks.
from nvidia_resiliency_ext.ptl_resiliency import StragglerDetectionCallback
straggler_cb_args = dict(
report_time_interval=300.0,
calc_relative_gpu_perf=True,
calc_individual_gpu_perf=True,
num_gpu_perf_scores_to_log=3,
gpu_relative_perf_threshold=0.7,
gpu_individual_perf_threshold=0.7,
stop_if_detected=False,
logger_name="test_logger",
)
straggler_det_cb = StragglerDetectionCallback(**straggler_cb_args)
trainer = pl.Trainer(
...
callbacks=[..., straggler_det_cb],
)
StragglerDetectionCallback
initialization parameters:
- StragglerDetectionCallback.__init__(report_time_interval, calc_relative_gpu_perf, calc_individual_gpu_perf, num_gpu_perf_scores_to_print, gpu_relative_perf_threshold, gpu_individual_perf_threshold, stop_if_detected, enable_ptl_logging, profiling_interval=1, logger_name='nemo_logger.StragglerDetectionCallback')[source]
Initialize straggler detection callback instance.
- Parameters:
report_time_interval (float) – Interval [seconds] of the straggler check
calc_relative_gpu_perf (bool) – Calculate relative GPU performance
calc_individual_gpu_perf (bool) – Calculate individual GPU performance
num_gpu_perf_scores_to_print (int) – How many best and worst perf scores to print (0 - does not print periodically, but only if stragglers are detected)
gpu_relative_perf_threshold (float) – Threshold for relative GPU performance scores
gpu_individual_perf_threshold (float) – Threshold for individual GPU performance scores
stop_if_detected (bool) – Set to True, to terminate the workload if stragglers are detected
enable_ptl_logging (bool) – Set to True, to log GPU performance scores to all PTL loggers enabled through trainer
profiling_interval (int) – profiling_interval passed to straggler.Detector.initialize. Defaults to 1.
logger_name (Optional[str], optional) – Defaults to “nemo_logger.StragglerDetectionCallback”.
- Raises:
ValueError – If invalid config was provided.