Checks to see if you are in a distributed megatron environment with only data parallelism active.
This is useful if you are working on a model, loss, etc and you know that you do not yet support megatron model
parallelism. You can test that the only kind of parallelism in use is data parallelism.
Returns:
Type |
Description |
bool
|
True if data parallel is the only parallel mode, False otherwise.
|
Source code in bionemo/llm/utils/megatron_utils.py
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36 | def is_only_data_parallel() -> bool:
"""Checks to see if you are in a distributed megatron environment with only data parallelism active.
This is useful if you are working on a model, loss, etc and you know that you do not yet support megatron model
parallelism. You can test that the only kind of parallelism in use is data parallelism.
Returns:
True if data parallel is the only parallel mode, False otherwise.
"""
if not (torch.distributed.is_available() and parallel_state.is_initialized()):
raise RuntimeError("This function is only defined within an initialized megatron parallel environment.")
# Idea: when world_size == data_parallel_world_size, then you know that you are fully DDP, which means you are not
# using model parallelism (meaning virtual GPUs composed of several underlying GPUs that you need to reduce over).
world_size: int = torch.distributed.get_world_size()
dp_world_size: int = parallel_state.get_data_parallel_world_size()
return world_size == dp_world_size
|