Reporting

class nvidia_resiliency_ext.straggler.reporting.Report(gpu_relative_perf_scores, section_relative_perf_scores, gpu_individual_perf_scores, section_individual_perf_scores, rank_to_node, local_section_summaries, local_kernel_summaries, generate_report_elapsed_time, gather_on_rank0, rank)[source]

Bases: object

There are 2 types of the performance scores provided in this report:
  • Relative performance scores (0…1) that compare the performance of the current rank to the best performing rank.

  • Individual performance scores (0…1) that compare the performance of the current rank to the rank’s past performance.

Relative scores need inter-rank synchronization to be computed, while individual scores can be computed on each rank separately. All performance scores can be interpreted as a ratio of: current performance/reference performance. For example:

  • If the relative performance score is 0.5, it means that the current rank is 2x slower than the best performing rank.

  • If the individual performance score is 0.5, it means that the current rank is 2x slower than its best performance.

If gather_on_rank0=True: *_perf_scores fields contain results for all ranks (only on rank0, otherwise undefined). If gather_on_rank0=False: *_perf_scores fields contain only current rank results.

Containers can be empty if there are no results.

  • gpu_relative_perf_scores: Rank -> GPU relative performance score.

  • section_relative_perf_scores: Section name -> (Rank -> Section relative performance score).

  • gpu_individual_perf_scores: Rank -> GPU individual performance score.

  • section_individual_perf_scores: Section name -> (Rank -> Section individual performance score).

  • rank_to_node: Rank -> Node name, aux mapping useful for results reporting.

  • local_section_summaries: Local (e.g. this rank) timing stats for each user defined section/context.

  • local_kernel_summaries: Local (e.g. this rank) timing stats for each captured CUDA kernel.

  • generate_report_elapsed_time: How long did it take to generate this report.

  • gather_on_rank0 : Mode for results gather, if gather_on_rank0 = False,

  • rank: Rank of the current report. None if not corresponding to any rank (i.e. gather_on_rank0=True).

Parameters:
identify_stragglers(gpu_rel_threshold=0.75, section_rel_threshold=0.75, gpu_indiv_threshold=0.75, section_indiv_threshold=0.75)[source]

Identify the ranks with straggler GPUs based on performance thresholds.

Parameters:
  • gpu_rel_threshold (float) – The threshold for relative GPU performance scores to identify stragglers. Default is 0.75

  • section_rel_threshold (float) – The threshold for relative sections performance scores. Default is 0.75

  • gpu_indiv_threshold (float) – The threshold for individual GPU performance scores. Default is 0.75

  • section_indiv_threshold (float) – The threshold for individual section performance scores. Default is 0.75

Returns:

  • ‘straggler_gpus_relative’ (Set[StragglerId]): Stragglers with relative GPU performance scores below the threshold

  • ’straggler_gpus_individual’ (Set[StragglerId]): Stragglers with individual GPU performance scores below the threshold

  • ’straggler_sections_relative’ (Dict[str, Set[StragglerId]]): Sections with ranks having relative performance scores below the threshold

  • ’straggler_sections_individual’ (Dict[str, Set[StragglerId]]): Sections with ranks having individual performance scores below the threshold

Return type:

Dict with string keys

class nvidia_resiliency_ext.straggler.reporting.ReportGenerator(scores_to_compute, gather_on_rank0=True, pg=None, node_name='<notset>')[source]

Bases: object

This class is responsible for generating the performance report, based on the section and kernel summaries (section/kernel name -> stats).

It is guaranteed that .generate_report is called on all ranks, so any required data synchronization can be done here.

Parameters:
  • scores_to_compute (list) – List of scores to compute, e.g. [‘relative_perf_scores’, ‘individual_perf_scores’]

  • gather_on_rank0 (bool) – If True, report on rank 0 includes results for all ranks, reports on other ranks are empty If False, report on any rank contains just the results for that particular rank

  • pg – Process group for communication

  • node_name – User-friendly name of the current node, that will be used in reports

generate_report(section_summaries, kernel_summaries)[source]

Generate a performance report based on the given summaries.

All ranks need to call .generate_report, as there can be synchronization between ranks.

If gather_on_rank0 is True:
  • The report on rank 0 includes results for all ranks.

  • On all other ranks the return value is None.

If gather_on_rank0 is False:
  • The report on any rank contains just the results for that particular rank.

Parameters:
  • section_summaries (dict) – Timing stats collected for user-defined sections on the current rank.

  • kernel_summaries (dict) – Timing stats collected for captured CUDA kernels on the current rank.

Returns:

A Report object containing the performance scores and summaries, or None if gather_on_rank0 is True and the current rank is not 0

class nvidia_resiliency_ext.straggler.reporting.StragglerId(rank, node)[source]

Bases: object

Straggler identity

Parameters: