Async Checkpointing
The asynchronous checkpointing feature in the NVIDIA Resiliency Extension provides core utilities to offload checkpointing routines to the background. It leverages torch.multiprocessing to either fork a temporary process or spawn a persistent process for efficient, non-blocking checkpointing.
Applications can monitor asynchronous checkpoint progress in a non-blocking manner and define a custom finalization step once all ranks complete their background checkpoint saving.
This repository includes an implementation of asynchronous checkpointing utilities for both torch.save and torch.distributed.save_state_dict. Our modified torch.distributed.save_state_dict interface is integrated with an optimized backend, FileSystemWriterAsync, which:
Runs in the async checkpoint process creating child parallel processes for intra-node parallelism, avoiding GIL contention.
Minimizes metadata communication overhead by metadata caching, ensuring efficient checkpoint saving.
Contents: