Reporting
- class nvidia_resiliency_ext.straggler.reporting.Report(gpu_relative_perf_scores, section_relative_perf_scores, gpu_individual_perf_scores, section_individual_perf_scores, rank_to_node, local_section_summaries, local_kernel_summaries, generate_report_elapsed_time, gather_on_rank0, rank)[source]
Bases:
object
- There are 2 types of the performance scores provided in this report:
Relative performance scores (0…1) that compare the performance of the current rank to the best performing rank.
Individual performance scores (0…1) that compare the performance of the current rank to the rank’s past performance.
Relative scores need inter-rank synchronization to be computed, while individual scores can be computed on each rank separately. All performance scores can be interpreted as a ratio of: current performance/reference performance. For example:
If the relative performance score is 0.5, it means that the current rank is 2x slower than the best performing rank.
If the individual performance score is 0.5, it means that the current rank is 2x slower than its best performance.
If gather_on_rank0=True: *_perf_scores fields contain results for all ranks (only on rank0, otherwise undefined). If gather_on_rank0=False: *_perf_scores fields contain only current rank results.
Containers can be empty if there are no results.
gpu_relative_perf_scores: Rank -> GPU relative performance score.
section_relative_perf_scores: Section name -> (Rank -> Section relative performance score).
gpu_individual_perf_scores: Rank -> GPU individual performance score.
section_individual_perf_scores: Section name -> (Rank -> Section individual performance score).
rank_to_node: Rank -> Node name, aux mapping useful for results reporting.
local_section_summaries: Local (e.g. this rank) timing stats for each user defined section/context.
local_kernel_summaries: Local (e.g. this rank) timing stats for each captured CUDA kernel.
generate_report_elapsed_time: How long did it take to generate this report.
gather_on_rank0 : Mode for results gather, if gather_on_rank0 = False,
rank: Rank of the current report. None if not corresponding to any rank (i.e. gather_on_rank0=True).
- Parameters:
- identify_stragglers(gpu_rel_threshold=0.75, section_rel_threshold=0.75, gpu_indiv_threshold=0.75, section_indiv_threshold=0.75)[source]
Identify the ranks with straggler GPUs based on performance thresholds.
- Parameters:
gpu_rel_threshold (float) – The threshold for relative GPU performance scores to identify stragglers. Default is 0.75
section_rel_threshold (float) – The threshold for relative sections performance scores. Default is 0.75
gpu_indiv_threshold (float) – The threshold for individual GPU performance scores. Default is 0.75
section_indiv_threshold (float) – The threshold for individual section performance scores. Default is 0.75
- Returns:
‘straggler_gpus_relative’ (Set[StragglerId]): Stragglers with relative GPU performance scores below the threshold
’straggler_gpus_individual’ (Set[StragglerId]): Stragglers with individual GPU performance scores below the threshold
’straggler_sections_relative’ (Dict[str, Set[StragglerId]]): Sections with ranks having relative performance scores below the threshold
’straggler_sections_individual’ (Dict[str, Set[StragglerId]]): Sections with ranks having individual performance scores below the threshold
- Return type:
Dict with string keys
- class nvidia_resiliency_ext.straggler.reporting.ReportGenerator(scores_to_compute, gather_on_rank0=True, pg=None, node_name='<notset>')[source]
Bases:
object
This class is responsible for generating the performance report, based on the section and kernel summaries (section/kernel name -> stats).
It is guaranteed that .generate_report is called on all ranks, so any required data synchronization can be done here.
- Parameters:
scores_to_compute (list) – List of scores to compute, e.g. [‘relative_perf_scores’, ‘individual_perf_scores’]
gather_on_rank0 (bool) – If True, report on rank 0 includes results for all ranks, reports on other ranks are empty If False, report on any rank contains just the results for that particular rank
pg – Process group for communication
node_name – User-friendly name of the current node, that will be used in reports
- generate_report(section_summaries, kernel_summaries)[source]
Generate a performance report based on the given summaries.
All ranks need to call .generate_report, as there can be synchronization between ranks.
- If gather_on_rank0 is True:
The report on rank 0 includes results for all ranks.
On all other ranks the return value is None.
- If gather_on_rank0 is False:
The report on any rank contains just the results for that particular rank.
- Parameters:
- Returns:
A Report object containing the performance scores and summaries, or None if gather_on_rank0 is True and the current rank is not 0