Reproducible Scalingο
A special use case is to re-run or continue a training run with the exact same data order, but using a different number of nodes or ranks.
Since version 2.0.0, Megatron Energon supports this behavior if a few constraints are met:
The global batch size must stay the same across runs
The global batch size must be a multiple of
micro-batch size * world_size * num_workers
The multiple of that is the number of gradient accumulation steps in your training
The product
world_size * num_workers
must stay the same across runs, such that the global number of workers stays the sameYou need to set the same
torch.manual_seed(...)
on each rank before constructing the dataset and the data loader
By obeying these rules, you will be able to reproduce the same global batches. Letβs look at an example.
Name |
Global batch size |
Micro batch size |
World size |
Number of Workers |
Gradient accumulation steps |
---|---|---|---|---|---|
Run 1 |
8 |
2 |
4 |
1 |
1 |
Run 2 |
8 |
2 |
1 |
4 |
4 |
Iterating the dataset will yield the same global batches for both of these runs, if the seed is set correctly.
In practice, you will need to adapt your worker config accordingly.