Epochized Blending
As an alternative to blending with a weight for each dataset, blending can be made accurate and iterating the dataset can follow epochs (i.e. interrupt iteration after an epoch) with this concept.
Here is an example metadataset.yaml
config file that changes to epochized blending:
__module__: megatron.energon
__class__: MetadatasetV2
splits:
train:
# Blend the following datasets, repeating coco 5 times, coyo-train 2 times and coyo-val 1 times
blend_epochized:
- repetitions: 5
path: ./coco
# ... Other parameters
- repetitions: 2
path: ./coyo
- repetitions: 1
path: ./coyo
split_part: val
Now, the call to get_train_dataset
requires the additional parameter repeat=False
to interrupt iterating after one epoch:
from megatron.energon import get_train_dataset, get_loader, WorkerConfig
loader = get_loader(get_train_dataset(
'metadataset.yaml',
batch_size=2,
shuffle_buffer_size=100,
max_samples_per_sequence=100,
worker_config=WorkerConfig.default_worker_config(),
repeat=False,
))
# This will now stop iterating after the datasets have been iterated (coco 5 times, coyo-train 2
# times and coyo-val 1 times). Of course, the data is still being shuffled between all those
# datasets.
for batch in loader:
print(batch)
# This will iterate the second epoch
for batch in loader:
print(batch)
If used as dataset for get_val_dataset
, the repetitions
are ignored.
The metadataset would also work without setting repeat=False
, but then the shuffle buffer will shuffle samples across bounderies of epochs.