distribute
torch.distribute utils.
Classes
A shared workspace implementation using Network File Storage (NFS). |
Functions
Gathers the layer config across distributed processes using shm or NFS. |
|
Gathers the tensors across distributed processes using shm. |
- class NFSWorkspace
Bases:
object
A shared workspace implementation using Network File Storage (NFS).
- NOTE: all read/write/modifition to the NFS dir do not involve any collective
communication nor barrier. It is users’ responsibility to synchronize all ranks (local and remove processes).
This implementation uses torch.save and torch.load for serialization.
- Parameters:
workspace_path – the path to the NFS directory for postprocess cross rank communication. If not provided, SharedMemory will be used instead.
- __init__(workspace_path=None)
Create the NFS work dir and clean up existing existing state files.
- Parameters:
workspace_path (Path | str | None) –
- property is_initialized
Whether the workspace is intialized.
- read_configs_and_weights_from_rank(target_rank)
All ranks read the target_rank state file.
- Parameters:
target_rank (int) – the target rank
- Returns:
the model/module config and the weights
- Return type:
Tuple[Dict[str, Any] | None, Dict[str, Any] | None]
- write_configs_and_weights(config_json, weights)
All ranks write the state file to the shared NFS dir.
- Parameters:
config_json (Dict[str, Any]) – model or module config in json
weights (Dict[str, Any]) – module weights in torch’s state_dict format
- get_configs_parallel(config, ranks, group, workspace_path=None)
Gathers the layer config across distributed processes using shm or NFS.
- Parameters:
config – the config (nullable) that each rank want to pass to the first rank.
ranks (List[int]) – the list of the ranks
group – the barrier sync group.
workspace_path (Path | str | None) – the path to the NFS directory for postprocess cross rank communication.
- Yields:
the first rank in the ranks has the full access of the configs across all the ranks. the other ranks returns an empty list
When workspace_path is provided, an NFSWorkspace object is created to perform communication across ranks. Otherwise, SharedMemory is used for local multi-process communication. The shm will be destroyed after consumption.
- get_tensors_parallel(tensor, ranks, group=None)
Gathers the tensors across distributed processes using shm.
- Parameters:
tensor (Tensor) – the tensor that each rank want to pass to the first rank. The tensors across the ranks need to have the same size.
ranks (List[int]) – the list of the ranks
group – the barrier sync group.
- Yields:
the first rank in the ranks has the full access of the tensors across all the ranks. the other ranks returns an empty list
The shm will be destroyed after consumption.