Convert zero3 to zero1
convert_zero_checkpoint_to_fp32_state_dict(checkpoint_dir, output_dir, tag=None, exclude_frozen_parameters=False, mp_size=8, overwrite=False, num_workers=1, ranks_to_process=None)
Converts a DeepSpeed Zero-3 checkpoint to a PyTorch FP32 state_dict.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
checkpoint_dir
|
str
|
Path to the desired checkpoint folder. |
required |
output_dir
|
str
|
Directory to save the PyTorch FP32 state_dict output files. |
required |
tag
|
Optional[str]
|
Checkpoint tag used as a unique identifier or sub-directory that contains the checkpoint. |
None
|
exclude_frozen_parameters
|
bool
|
Whether to exclude frozen parameters. |
False
|
mp_size
|
int
|
Model parallel size of the source checkpoint. |
8
|
overwrite
|
bool
|
Whether to overwrite existing MP shards. |
False
|
num_workers
|
int
|
Number of workers to use for processing. |
1
|
ranks_to_process
|
Optional[List[int]]
|
List of ranks to process. |
None
|
Raises:
Type | Description |
---|---|
FileNotFoundError
|
If the checkpoint directory does not exist. |
Source code in bionemo/evo2/utils/checkpoint/convert_zero3_to_zero1.py
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 |
|