distribute

torch.distribute utils.

Classes

NFSWorkspace

A shared workspace implementation using Network File Storage (NFS).

Functions

`get_configs_parallel`	Gathers the layer config across distributed processes using shm or NFS.
`get_tensors_parallel`	Gathers the tensors across distributed processes using shm.

class NFSWorkspace

Bases: object

A shared workspace implementation using Network File Storage (NFS).

NOTE: all read/write/modifition to the NFS dir do not involve any collective: communication nor barrier. It is users’ responsibility to synchronize all ranks (local and remove processes).

This implementation uses torch.save and torch.load for serialization.

Parameters:: workspace_path – the path to the NFS directory for postprocess cross rank communication. If not provided, SharedMemory will be used instead.

__init__(workspace_path=None)

Create the NFS work dir and clean up existing existing state files.

Parameters:: workspace_path (Path | str | None)

property is_initialized: Whether the workspace is initialized.

read_configs_and_weights_from_rank(target_rank)

All ranks read the target_rank state file.

Parameters:: target_rank (int) – the target rank
Returns:: the model/module config and the weights
Return type:: tuple[dict[str, Any] | None, dict[str, Any] | None]

write_configs_and_weights(config_json, weights)

All ranks write the state file to the shared NFS dir.

Parameters:

config_json (dict[str, Any]) – model or module config in json
weights (dict[str, Any]) – module weights in torch’s state_dict format

get_configs_parallel(config, ranks, group, workspace_path=None)

Gathers the layer config across distributed processes using shm or NFS.

Parameters:

config – the config (nullable) that each rank want to pass to the first rank.
ranks (list[int]) – the list of the ranks
group – the barrier sync group.
workspace_path (Path | str | None) – the path to the NFS directory for postprocess cross rank communication.

Yields:

the first rank in the ranks has the full access of the configs across all the ranks. the other ranks returns an empty list

When workspace_path is provided, an NFSWorkspace object is created to perform communication across ranks. Otherwise, SharedMemory is used for local multi-process communication. The shm will be destroyed after consumption.

get_tensors_parallel(tensor, ranks, group=None)

Gathers the tensors across distributed processes using shm.

Parameters:

tensor (Tensor) – the tensor that each rank want to pass to the first rank. The tensors across the ranks need to have the same size.
ranks (list[int]) – the list of the ranks
group – the barrier sync group.

Yields:

the first rank in the ranks has the full access of the tensors across all the ranks. the other ranks returns an empty list

The shm will be destroyed after consumption.