BasicTensorAwareStateDict

BasicTensorAwareStateDict provides a simple implementation of the TensorAwareStateDict interface, which is used to manage state dictionaries within a CheckpointManager.

This class requires that all tensors in the user-provided state_dict are located on CUDA devices and are easily accessible (i.e., they can only be nested within dictionaries or lists).

This implementation covers the most common use cases for state dict management in distributed training scenarios.

class nvidia_resiliency_ext.checkpointing.local.basic_state_dict.BasicTensorAwareStateDict(state_dict)[source]

Bases: TensorAwareStateDict

The most basic implemention of TensorAwareStateDict defining the interface between the user code and checkpoint manager.

This class requires that all tensors in the user state_dict are on cuda and are easily accessible (can be only nested in dicts or lists)

copy_tensors_to_cpu(non_blocking=False)[source]

Stores CPU copies of tensors in the state_dict, replacing the originals, but without destroying them.

Parameters:

non_blocking (bool) – if set to True allows for asynchronous copying.

init_tensors()[source]

Initializes empty tensors with the same properties as the original tensors.

This function should only be called after the original tensors have been popped. It ensures that the newly created empty tensors match the shape, dtype, and device of the originals, but contain no data.

insert_tensors(tensor_data)[source]

Reverse of pop_tensors. Replace tensor placeholders with actual values. The value of self is considered to be the same after:

self.insert_tensors(self.pop_tensors())
Parameters:

tensor_data – An iterable containing the tensor data to be inserted

property is_hollow

True iff tensors had been extracted and have not been inserted back yet.

pop_tensors()[source]

Extracts the tensor data from the state dict, preserving metadata.

Removes the tensor data while retaining metadata (e.g., shape, dtype, device) needed to recreate empty tensors. After this operation, the state dictionary is “hollow”, containing no tensor data. Further calls to pop_tensor will raise an error.

Returns:

List of extracted tensors

restore_tensor_device(non_blocking=True)[source]

Restores all tensors to their original CUDA devices if a move is required.

Parameters:

non_blocking (bool) – if set to True allows for asynchronous copying.

property tensors

Get the tensor data from the state dict.