Straggler

class nvidia_resiliency_ext.straggler.straggler.CallableId(obj, name, arg_filter_fn=None, extra_args_fn=None, ignored_args=None)[source]

Bases: object

Represents a unique identifier for a callable object.

Parameters:
obj

The object that contains the callable

Type:

object

name

The name of the callable

Type:

str

__str__()[source]

Returns a string representation of the CallableId, which includes the name of the object and the name of the callable

class nvidia_resiliency_ext.straggler.straggler.CustomSection(name, location, total_entry_cnt=0, max_elapseds_len=8192, cpu_elapsed_times=<factory>)[source]

Bases: object

CustomSection represents user defined section of code (Detector.detection_section).

Each section has CPU execution time computed.

If CUDA profiling is enabled for the section, kernels launched in the section will be profiled with CUPTI. All kernel profiling results are collected by the CUPTI extension.

Parameters:
class nvidia_resiliency_ext.straggler.straggler.Detector[source]

Bases: object

Main class for straggler detection. The Detector class uses class methods and is not intended to be instantiated.

initialized

Class-level attribute to track initialization.

Type:

bool

scores_to_compute

List of scores to compute, can include ‘relative_perf_scores’, ‘individual_perf_scores’.

Type:

list

gather_on_rank0

If True, when .generate_report is called report on rank 0 includes results for all ranks, reports on other ranks are empty If False, .generate_report is called report on any rank contains just the results for that particular rank

Type:

bool

profiling_interval

Profile each profiling_interval-th section entry. Defaults to 1.

Type:

int, optional

report_time_interval

Interval in seconds for generate_report_if_interval_elapsed. Defaults to 60.

Type:

float, optional

custom_sections

Dict for recording CustomSection objects.

Type:

dict

cupti_manager

CuptiManager is used for usage and managing of CUPTI methods, timing statistics calculation.

Type:

CuptiManager

reporter

ReportGenerator is used with result parsing and performance scoring algorithms.

Type:

ReportGenerator

report_interval_tracker

ReportIntervalTracker is used to synchronize report_time_interval between ranks.

Type:

ReportIntervalTracker

original_callables
Type:

dict, optional

classmethod detection_section(name=None, profile_cuda=True)[source]

Context manager for monitoring user defined sections of code.

NOTE: profiling_interval Detector constructor parameter determines how frequently

sections are monitored. If can be > 1 to reduce the profiling overhead.

Parameters:
  • name (str, optional) – Section name used for the reporting. Must be unique per user code. Defaults to None, in which case the detection_section entry location (with …) (file path and line) is used as a section name.

  • profile_cuda (bool, optional) – If true, CUDA kernels launched under this section will be captured and used to compute rank “GPU performance score”. Defaults to True.

classmethod generate_report()[source]

Calls ReportGenerator.generate_report method, resets recorded results.

classmethod generate_report_if_interval_elapsed()[source]

Calls ReportGenerator.generate_report method, if reporting interval elapsed. Supposed to be called during each training iteration on every rank. Reporting interval elapsed is synchronized beetween ranks through ReportIntervalTracker. Returns None if interval has not elapsed. Othewise generate_report return value is returned.

classmethod initialize(scores_to_compute='all', gather_on_rank0=True, profiling_interval=1, report_time_interval=60, node_name=None)[source]
Parameters:
  • scores_to_compute (list|str, optional) – List of scores to compute, can include ‘relative_perf_scores’, ‘individual_perf_scores’. or string “all” meaning: “all scores should be computed”.

  • gather_on_rank0 (bool, optional) – If True, when .generate_report is called report on rank 0 includes results for all ranks, reports on other ranks are empty If False, .generate_report is called report on any rank contains just the results for that particular rank

  • profiling_interval (int, optional) – Profile each profiling_interval-th section entry. Defaults to 1.

  • report_time_interval (float, optional) – Interval in seconds for generate_report_if_interval_elapsed. Defaults to 60.

  • node_name (str | None) – (str, optional): User-friendly name of the current node to be used in reports. If None socket.gethostname will be used.

classmethod is_interval_elapsed()[source]

Returns boolean flag that is True if interval elapsed during previous generate_report_if_interval_elapsed call. False otherwise.

Return type:

bool

classmethod restore_original_callables()[source]

Restore callable objects after cls._build_wrapper method was called for wrapping profiled callables.

classmethod shutdown()[source]

Shutdown Detector.

classmethod wrap_callables(callable_ids, profile_cuda=True)[source]

Each time fn (fn = getattr(callable_id.obj, callable_id.name)) is called, it will do the following: with straggler.Detector.detection_section(str(callable_id)):

fn(*args,**kwargs)

Parameters:
  • callable_ids (List[CallableId]) – list of callables to wrap with detection context

  • profile_cuda (bool, optional) – If true, CUDA kernels launched under this section will be captured and used to compute rank “GPU performance score”. Defaults to True.