Skip to content

Megatron utils

is_only_data_parallel()

Checks to see if you are in a distributed megatron environment with only data parallelism active.

This is useful if you are working on a model, loss, etc and you know that you do not yet support megatron model parallelism. You can test that the only kind of parallelism in use is data parallelism.

Returns:

Type Description
bool

True if data parallel is the only parallel mode, False otherwise.

Source code in bionemo/llm/utils/megatron_utils.py
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def is_only_data_parallel() -> bool:
    """Checks to see if you are in a distributed megatron environment with only data parallelism active.

    This is useful if you are working on a model, loss, etc and you know that you do not yet support megatron model
    parallelism. You can test that the only kind of parallelism in use is data parallelism.

    Returns:
        True if data parallel is the only parallel mode, False otherwise.
    """
    if not (torch.distributed.is_available() and parallel_state.is_initialized()):
        raise RuntimeError("This function is only defined within an initialized megatron parallel environment.")
    # Idea: when world_size == data_parallel_world_size, then you know that you are fully DDP, which means you are not
    #  using model parallelism (meaning virtual GPUs composed of several underlying GPUs that you need to reduce over).

    world_size: int = torch.distributed.get_world_size()
    dp_world_size: int = parallel_state.get_data_parallel_world_size()
    return world_size == dp_world_size