Asynchronous Checkpoint Core Utilities
This module provides an async utilities which allow to start a checkpoint save process in the background.
- class nvidia_resiliency_ext.checkpointing.async_ckpt.core.AsyncCallsQueue[source]
Bases:
object
Manages a queue of async calls.
Allows adding a new async call with schedule_async_request and finalizing active calls with maybe_finalize_async_calls.
- maybe_finalize_async_calls(blocking=False, no_dist=True)[source]
Finalizes all available calls.
This method must be called on all ranks.
- Parameters:
- Returns:
- list of indices (as returned by schedule_async_request)
of async calls that have been successfully finalized.
- Return type:
List[int]
- schedule_async_request(async_request)[source]
Start a new async call and add it to a queue of active async calls.
This method must be called on all ranks.
- Parameters:
async_request (AsyncRequest) – async request to start.
- Returns:
- index of the async call that was started.
This can help the user keep track of the async calls.
- Return type:
- class nvidia_resiliency_ext.checkpointing.async_ckpt.core.AsyncRequest(async_fn, async_fn_args, finalize_fns, async_fn_kwargs={}, is_frozen=False)[source]
Bases:
NamedTuple
Represents an async request that needs to be scheduled for execution.
- Parameters:
async_fn (Callable, optional) – async function to call. None represents noop.
async_fn_args (Tuple) – args to pass to async_fn.
finalize_fns (List[Callable]) – list of functions to call to finalize the request. These functions will be called synchronously after async_fn is done on all ranks.
async_fn_kwargs (Dict)
is_frozen (bool)
Create new instance of AsyncRequest(async_fn, async_fn_args, finalize_fns, async_fn_kwargs, is_frozen)
- add_finalize_fn(fn)[source]
Adds a new finalize function to the request.
- Parameters:
fn (Callable) – function to add to the async request. This function will be called after existing finalization functions.
- Returns:
None
- Return type:
None
- execute_sync()[source]
Helper to synchronously execute the request.
This logic is equivalent to what should happen in case of the async call.
- Return type:
None
- class nvidia_resiliency_ext.checkpointing.async_ckpt.core.DistributedAsyncCaller[source]
Bases:
object
Wrapper around mp.Process that ensures correct semantic of distributed finalization.
Starts process asynchronously and allows checking if all processes on all ranks are done.
- is_current_async_call_done(blocking=False, no_dist=True)[source]
Check if async save is finished on all ranks.
For semantic correctness, requires rank synchronization in each check. This method must be called on all ranks.
- Parameters:
- Returns:
- True if all ranks are done (immediately of after active wait
if blocking is True), False if at least one rank is still active.
- Return type:
- schedule_async_call(async_fn, save_args, save_kwargs={})[source]
Spawn a process with async_fn as the target.
This method must be called on all ranks.
- Parameters:
async_fn (Callable, optional) – async function to call. If None, no process will be started.
save_args (Tuple) – async function args.
save_kwargs (Dict) – async function kwargs. Default is None
- Return type:
None