Asynchronous Checkpoint Core Utilities

This module provides an async utilities which allow to start a checkpoint save process in the background.

class nvidia_resiliency_ext.checkpointing.async_ckpt.core.AsyncCallsQueue[source]

Bases: object

Manages a queue of async calls.

Allows adding a new async call with schedule_async_request and finalizing active calls with maybe_finalize_async_calls.

close()[source]

Finalize all calls upon closing.

get_num_unfinalized_calls()[source]

Get the number of active async calls.

maybe_finalize_async_calls(blocking=False, no_dist=True)[source]

Finalizes all available calls.

This method must be called on all ranks.

Parameters:
  • blocking (bool, optional) – if True, will wait until all active requests are done. Otherwise, finalizes only the async request that already finished. Defaults to False.

  • no_dist (bool, Optional) – if True, training ranks simply check its asynchronous checkpoint writer without synchronization.

Returns:

list of indices (as returned by schedule_async_request)

of async calls that have been successfully finalized.

Return type:

List[int]

schedule_async_request(async_request)[source]

Start a new async call and add it to a queue of active async calls.

This method must be called on all ranks.

Parameters:

async_request (AsyncRequest) – async request to start.

Returns:

index of the async call that was started.

This can help the user keep track of the async calls.

Return type:

int

class nvidia_resiliency_ext.checkpointing.async_ckpt.core.AsyncRequest(async_fn, async_fn_args, finalize_fns, async_fn_kwargs={}, is_frozen=False)[source]

Bases: NamedTuple

Represents an async request that needs to be scheduled for execution.

Parameters:
  • async_fn (Callable, optional) – async function to call. None represents noop.

  • async_fn_args (Tuple) – args to pass to async_fn.

  • finalize_fns (List[Callable]) – list of functions to call to finalize the request. These functions will be called synchronously after async_fn is done on all ranks.

  • async_fn_kwargs (Dict)

  • is_frozen (bool)

Create new instance of AsyncRequest(async_fn, async_fn_args, finalize_fns, async_fn_kwargs, is_frozen)

add_finalize_fn(fn)[source]

Adds a new finalize function to the request.

Parameters:

fn (Callable) – function to add to the async request. This function will be called after existing finalization functions.

Returns:

None

Return type:

None

async_fn: Callable | None

Alias for field number 0

async_fn_args: Tuple

Alias for field number 1

async_fn_kwargs: Dict

Alias for field number 3

execute_sync()[source]

Helper to synchronously execute the request.

This logic is equivalent to what should happen in case of the async call.

Return type:

None

finalize_fns: List[Callable]

Alias for field number 2

freeze()[source]

Freezes the async request, disallowing adding new finalization functions.

Returns:

new async request with all same fields except for the

is_frozen flag.

Return type:

AsyncRequest

is_frozen: bool

Alias for field number 4

class nvidia_resiliency_ext.checkpointing.async_ckpt.core.DistributedAsyncCaller[source]

Bases: object

Wrapper around mp.Process that ensures correct semantic of distributed finalization.

Starts process asynchronously and allows checking if all processes on all ranks are done.

is_current_async_call_done(blocking=False, no_dist=True)[source]

Check if async save is finished on all ranks.

For semantic correctness, requires rank synchronization in each check. This method must be called on all ranks.

Parameters:
  • blocking (bool, optional) – if True, will wait until the call is done on all ranks. Otherwise, returns immediately if at least one rank is still active. Defaults to False.

  • no_dist (bool, Optional) – if True, training ranks simply check its asynchronous checkpoint writer without synchronization.

Returns:

True if all ranks are done (immediately of after active wait

if blocking is True), False if at least one rank is still active.

Return type:

bool

schedule_async_call(async_fn, save_args, save_kwargs={})[source]

Spawn a process with async_fn as the target.

This method must be called on all ranks.

Parameters:
  • async_fn (Callable, optional) – async function to call. If None, no process will be started.

  • save_args (Tuple) – async function args.

  • save_kwargs (Dict) – async function kwargs. Default is None

Return type:

None