Straggler
- class nvidia_resiliency_ext.straggler.straggler.CallableId(obj, name, arg_filter_fn=None, extra_args_fn=None, ignored_args=None)[source]
Bases:
object
Represents a unique identifier for a callable object.
- Parameters:
obj (object)
name (str)
arg_filter_fn (Callable[[BoundArguments], bool] | None)
extra_args_fn (Callable[[BoundArguments], dict] | None)
- class nvidia_resiliency_ext.straggler.straggler.CustomSection(name, location, total_entry_cnt=0, max_elapseds_len=8192, cpu_elapsed_times=<factory>)[source]
Bases:
object
CustomSection represents user defined section of code (Detector.detection_section).
Each section has CPU execution time computed.
If CUDA profiling is enabled for the section, kernels launched in the section will be profiled with CUPTI. All kernel profiling results are collected by the CUPTI extension.
- class nvidia_resiliency_ext.straggler.straggler.Detector[source]
Bases:
object
Main class for straggler detection. The Detector class uses class methods and is not intended to be instantiated.
- scores_to_compute
List of scores to compute, can include ‘relative_perf_scores’, ‘individual_perf_scores’.
- Type:
- gather_on_rank0
If True, when .generate_report is called report on rank 0 includes results for all ranks, reports on other ranks are empty If False, .generate_report is called report on any rank contains just the results for that particular rank
- Type:
- profiling_interval
Profile each profiling_interval-th section entry. Defaults to 1.
- Type:
int, optional
- report_time_interval
Interval in seconds for generate_report_if_interval_elapsed. Defaults to 60.
- Type:
float, optional
- cupti_manager
CuptiManager is used for usage and managing of CUPTI methods, timing statistics calculation.
- Type:
CuptiManager
- reporter
ReportGenerator is used with result parsing and performance scoring algorithms.
- Type:
- report_interval_tracker
ReportIntervalTracker is used to synchronize report_time_interval between ranks.
- Type:
ReportIntervalTracker
- classmethod detection_section(name=None, profile_cuda=True)[source]
Context manager for monitoring user defined sections of code.
- NOTE: profiling_interval Detector constructor parameter determines how frequently
sections are monitored. If can be > 1 to reduce the profiling overhead.
- Parameters:
name (str, optional) – Section name used for the reporting. Must be unique per user code. Defaults to None, in which case the detection_section entry location (with …) (file path and line) is used as a section name.
profile_cuda (bool, optional) – If true, CUDA kernels launched under this section will be captured and used to compute rank “GPU performance score”. Defaults to True.
- classmethod generate_report()[source]
Calls ReportGenerator.generate_report method, resets recorded results.
- classmethod generate_report_if_interval_elapsed()[source]
Calls ReportGenerator.generate_report method, if reporting interval elapsed. Supposed to be called during each training iteration on every rank. Reporting interval elapsed is synchronized beetween ranks through ReportIntervalTracker. Returns None if interval has not elapsed. Othewise generate_report return value is returned.
- classmethod initialize(scores_to_compute='all', gather_on_rank0=True, profiling_interval=1, report_time_interval=60, node_name=None)[source]
- Parameters:
scores_to_compute (list|str, optional) – List of scores to compute, can include ‘relative_perf_scores’, ‘individual_perf_scores’. or string “all” meaning: “all scores should be computed”.
gather_on_rank0 (bool, optional) – If True, when .generate_report is called report on rank 0 includes results for all ranks, reports on other ranks are empty If False, .generate_report is called report on any rank contains just the results for that particular rank
profiling_interval (int, optional) – Profile each profiling_interval-th section entry. Defaults to 1.
report_time_interval (float, optional) – Interval in seconds for generate_report_if_interval_elapsed. Defaults to 60.
node_name (str | None) – (str, optional): User-friendly name of the current node to be used in reports. If None socket.gethostname will be used.
- classmethod is_interval_elapsed()[source]
Returns boolean flag that is True if interval elapsed during previous generate_report_if_interval_elapsed call. False otherwise.
- Return type:
- classmethod restore_original_callables()[source]
Restore callable objects after cls._build_wrapper method was called for wrapping profiled callables.
- classmethod wrap_callables(callable_ids, profile_cuda=True)[source]
Each time fn (fn = getattr(callable_id.obj, callable_id.name)) is called, it will do the following: with straggler.Detector.detection_section(str(callable_id)):
fn(*args,**kwargs)
- Parameters:
callable_ids (List[CallableId]) – list of callables to wrap with detection context
profile_cuda (bool, optional) – If true, CUDA kernels launched under this section will be captured and used to compute rank “GPU performance score”. Defaults to True.