Reproducible Scaling

A special use case is to re-run or continue a training run with the exact same data order, but using a different number of nodes or ranks.

Since version 2.0.0, Megatron Energon supports this behavior if a few constraints are met:

The global batch size must stay the same across runs
The global batch size must be a multiple of micro-batch size * world_size * num_workers
- The multiple of that is the number of gradient accumulation steps in your training
The product world_size * num_workers must stay the same across runs, such that the global number of workers stays the same
You need to set the same torch.manual_seed(...) on each rank before constructing the dataset and the data loader

By obeying these rules, you will be able to reproduce the same global batches. Let’s look at an example.

Name	Global batch size	Micro batch size	World size	Number of Workers	Gradient accumulation steps
Run 1	8	2	4	1	1
Run 2	8	2	1	4	4

Iterating the dataset will yield the same global batches for both of these runs, if the seed is set correctly.

In practice, you will need to adapt your worker config accordingly.