distribute

torch.distribute utils.

Classes

NFSWorkspace

A shared workspace implementation using Network File Storage (NFS).

Functions

get_configs_parallel

Gathers the layer config across distributed processes using shm or NFS.

get_tensors_parallel

Gathers the tensors across distributed processes using shm.

class NFSWorkspace

Bases: object

A shared workspace implementation using Network File Storage (NFS).

NOTE: all read/write/modifition to the NFS dir do not involve any collective

communication nor barrier. It is users’ responsibility to synchronize all ranks (local and remove processes).

This implementation uses torch.save and torch.load for serialization.

Parameters:

workspace_path – the path to the NFS directory for postprocess cross rank communication. If not provided, SharedMemory will be used instead.

__init__(workspace_path=None)

Create the NFS work dir and clean up existing existing state files.

Parameters:

workspace_path (Path | str | None) –

property is_initialized

Whether the workspace is intialized.

read_configs_and_weights_from_rank(target_rank)

All ranks read the target_rank state file.

Parameters:

target_rank (int) – the target rank

Returns:

the model/module config and the weights

Return type:

Tuple[Dict[str, Any] | None, Dict[str, Any] | None]

write_configs_and_weights(config_json, weights)

All ranks write the state file to the shared NFS dir.

Parameters:
  • config_json (Dict[str, Any]) – model or module config in json

  • weights (Dict[str, Any]) – module weights in torch’s state_dict format

get_configs_parallel(config, ranks, group, workspace_path=None)

Gathers the layer config across distributed processes using shm or NFS.

Parameters:
  • config – the config (nullable) that each rank want to pass to the first rank.

  • ranks (List[int]) – the list of the ranks

  • group – the barrier sync group.

  • workspace_path (Path | str | None) – the path to the NFS directory for postprocess cross rank communication.

Yields:

the first rank in the ranks has the full access of the configs across all the ranks. the other ranks returns an empty list

When workspace_path is provided, an NFSWorkspace object is created to perform communication across ranks. Otherwise, SharedMemory is used for local multi-process communication. The shm will be destroyed after consumption.

get_tensors_parallel(tensor, ranks, group=None)

Gathers the tensors across distributed processes using shm.

Parameters:
  • tensor (Tensor) – the tensor that each rank want to pass to the first rank. The tensors across the ranks need to have the same size.

  • ranks (List[int]) – the list of the ranks

  • group – the barrier sync group.

Yields:

the first rank in the ranks has the full access of the tensors across all the ranks. the other ranks returns an empty list

The shm will be destroyed after consumption.