Reporting
- class nvidia_resiliency_ext.straggler.reporting.Report(gpu_relative_perf_scores, section_relative_perf_scores, gpu_individual_perf_scores, section_individual_perf_scores, rank_to_node, local_section_summaries, local_kernel_summaries, generate_report_elapsed_time, gather_on_rank0, rank)[source]
Bases:
object
There are two types of performance scores provided in this report: - Relative performance scores (0…1) that compare the performance of the current rank to the best performing rank. - Individual performance scores (0…1) that compare the performance of the current rank to the rank’s past performance.
Relative scores need inter-rank synchronization to be computed, while individual scores can be computed on each rank separately. All performance scores can be interpreted as a ratio of: current performance/reference performance.
For example: - If the relative performance score is 0.5, it means that the current rank is 2x slower than the best performing rank. - If the individual performance score is 0.5, it means that the current rank is 2x slower than its best performance.
If gather_on_rank0=True: *_perf_scores fields contain results for all ranks (only on rank0; otherwise undefined). If gather_on_rank0=False: *_perf_scores fields contain only current rank results.
Containers can be empty if there are no results.
Attributes: - gpu_relative_perf_scores: Rank -> GPU relative performance score. - section_relative_perf_scores: Section name -> (Rank -> Section relative performance score). - gpu_individual_perf_scores: Rank -> GPU individual performance score. - section_individual_perf_scores: Section name -> (Rank -> Section individual performance score). - rank_to_node: Rank -> Node name, an auxiliary mapping useful for results reporting. - local_section_summaries: Local (e.g., this rank) timing stats for each user-defined section/context. - local_kernel_summaries: Local (e.g., this rank) timing stats for each captured CUDA kernel. - generate_report_elapsed_time: How long it took to generate this report. - gather_on_rank0: Mode for results gathering. - rank: Rank of the current report. None if not corresponding to any rank (i.e., gather_on_rank0=True).
- Parameters:
- identify_stragglers(gpu_rel_threshold=0.75, section_rel_threshold=0.75, gpu_indiv_threshold=0.75, section_indiv_threshold=0.75)[source]
Identify the ranks with straggler GPUs based on performance thresholds.
- Parameters:
gpu_rel_threshold (float) – The threshold for relative GPU performance scores to identify stragglers. Default is 0.75
section_rel_threshold (float) – The threshold for relative sections performance scores. Default is 0.75
gpu_indiv_threshold (float) – The threshold for individual GPU performance scores. Default is 0.75
section_indiv_threshold (float) – The threshold for individual section performance scores. Default is 0.75
- Returns:
‘straggler_gpus_relative’ (Set[StragglerId]): Stragglers with relative GPU performance scores below the threshold
’straggler_gpus_individual’ (Set[StragglerId]): Stragglers with individual GPU performance scores below the threshold
’straggler_sections_relative’ (Dict[str, Set[StragglerId]]): Sections with ranks having relative performance scores below the threshold
’straggler_sections_individual’ (Dict[str, Set[StragglerId]]): Sections with ranks having individual performance scores below the threshold
- Return type:
Dict with string keys
- class nvidia_resiliency_ext.straggler.reporting.ReportGenerator(scores_to_compute, gather_on_rank0=True, pg=None, node_name='<notset>')[source]
Bases:
object
This class is responsible for generating the performance report, based on the section and kernel summaries (section/kernel name -> stats).
It is guaranteed that .generate_report is called on all ranks, so any required data synchronization can be done here.
- Parameters:
scores_to_compute (list) – List of scores to compute, e.g. [‘relative_perf_scores’, ‘individual_perf_scores’]
gather_on_rank0 (bool) – If True, report on rank 0 includes results for all ranks, reports on other ranks are empty If False, report on any rank contains just the results for that particular rank
pg – Process group for communication
node_name – User-friendly name of the current node, that will be used in reports
- generate_report(section_summaries, kernel_summaries)[source]
Generate a performance report based on the given summaries.
All ranks need to call .generate_report, as there can be synchronization between ranks.
- If gather_on_rank0 is True:
The report on rank 0 includes results for all ranks.
On all other ranks the return value is None.
- If gather_on_rank0 is False:
The report on any rank contains just the results for that particular rank.
- Parameters:
- Returns:
A Report object containing the performance scores and summaries, or None if gather_on_rank0 is True and the current rank is not 0