Quickstart

You can use Megatron Energon to load datasets in the energon dataset format. This format is essentially WebDataset with some meta data added.

For a moment let’s assume you already have prepared a dataset in the needed format, and it’s stored on disk at /my/dataset/path. If you want to create a dataset now, check out Data Preparation.

If you simply want to use some dummy data for trying this out, checkout the unit test method create_captioning_test_dataset inside tests/test_dataset.py.

Good to know

You can also store your dataset inside an S3-compatible object store and load it from there! See Remote Dataset

You can then load the dataset like this:

from megatron.energon import get_train_dataset, get_loader, WorkerConfig

ds = get_train_dataset(
    '/my/dataset/path',
    batch_size=1,
    shuffle_buffer_size=100,
    max_samples_per_sequence=100,
    worker_config=WorkerConfig.default_worker_config(),
)

loader = get_loader(ds)

for batch in loader:
    # Do something with batch
    # Infer, gradient step, ...
    pass

At first, we call get_train_dataset (click to see signature). The method will check what kind of dataset is on disk and instantiate the correct class for it.

A worker configuration is always needed to specify how the work is distributed across multiple ranks and workers. In this simple example, we use a helper method default_worker_config to get reasonable default values.

The dataset should not be iterated directly, but used with a loader which handles the worker processes. The batches will contain samples of the sample type specified in the task encoder.

Good to know

Since we did not specify a task encoder above, the DefaultTaskEncoder will be used. It will not transform the data. For batching it will use common sense magic to pad and stack tensors or build lists if the type is unknown.

Wait. Why does the dataset create batches? Shouldn’t the dataloader do that?

Energon will create batches at dataset level. Internally, most of the cool things that energon can do (such as blending datasets together, sequence packing, etc.) are dataset wrappers. Even the process of batching is such a wrapper and the default get_train_dataset function will construct a suitable combination of all these based on the arguments you pass to that function. Check out the Data Flow section to see the steps in which the data is processed.

Why must shuffle_buffer_size and max_samples_per_sequence be set explicitly?

As the library is designed to work on (sequential) webdatasets but still wants to provide proper shuffling, these parameters are required. To make sure, the user does not forget to set these, we enforce them to be set explicitly. A value of 100 for both settings for image datasets seems to work well (i.e. balanced shuffling randomness vs seeking performance impact), but datasets where the samples are lots larger or smaller might require different settings. Setting the sequence length to a very small size compared to the number of samples in the dataset will result in more random access, thus slowing down dataloading, so the recommendation is to set it to a high enough value. At the same time, a high value reduces the shuffling randomness, which requires a larger shuffle buffer size to compensate for that (i.e. higher memory footprint and longer state restore times).

Tutorial 1: Preparing the Dataset

For running any tutorials, you’ll need your data structured as webdataset. For more details and more options check out Data Preparation.

For example, in your folder containing the tar files, run

$ energon prepare ./

# Example answers to interactive questions:
Ratio: 8,1,1
Dataset class: CaptioningWebdataset
Field map: Yes
  image: jpg
  caption: txt  # if txt contains the caption
# or
  caption: json[caption]  # if .json contains {"caption": "My nice image"}

Tutorial 2: Loading a Dataset

Let’s be a bit more concrete and try out the above data loading code with a real dataset. We are going to print the first batch and stop.

from megatron.energon import get_train_dataset, get_loader, WorkerConfig

ds = get_train_dataset(
    '/path/to/your/dataset',
    batch_size=None,
    shuffle_buffer_size=100,
    max_samples_per_sequence=100,
    worker_config=WorkerConfig.default_worker_config(),
)

loader = get_loader(ds)

for batch in loader:
    print(batch)
    break

This prints

CaptioningSample(__key__='part_00123/00403.tar/004030195', image=tensor([[[0.4549, 0.4784, 0.5059,  ..., 0.8392, 0.8275, 0.7961],
          [0.4549, 0.4784, 0.5020,  ..., 0.6431, 0.6275, 0.5882],
          [0.4510, 0.4706, 0.4941,  ..., 0.6235, 0.6353, 0.6078],
          ...,
          [0.4471, 0.4196, 0.4510,  ..., 0.8471, 0.8039, 0.8275],
          [0.4667, 0.4353, 0.4667,  ..., 0.8196, 0.7804, 0.8078],
          [0.4824, 0.4549, 0.4824,  ..., 0.8196, 0.7843, 0.8118]],

         [[0.3608, 0.3843, 0.4118,  ..., 0.7373, 0.7255, 0.6941],
          [0.3608, 0.3843, 0.4078,  ..., 0.5412, 0.5255, 0.4863],
          [0.3569, 0.3765, 0.4000,  ..., 0.5098, 0.5216, 0.4941],
          ...,
          [0.3608, 0.3333, 0.3647,  ..., 0.7529, 0.7098, 0.7333],
          [0.3804, 0.3490, 0.3804,  ..., 0.7255, 0.6863, 0.7137],
          [0.3961, 0.3686, 0.3961,  ..., 0.7255, 0.6902, 0.7176]],

         [[0.2510, 0.2745, 0.3020,  ..., 0.6000, 0.5882, 0.5569],
          [0.2510, 0.2745, 0.2980,  ..., 0.4039, 0.3882, 0.3490],
          [0.2471, 0.2667, 0.2902,  ..., 0.3765, 0.3882, 0.3608],
          ...,
          [0.2667, 0.2392, 0.2706,  ..., 0.6510, 0.6000, 0.6235],
          [0.2863, 0.2549, 0.2863,  ..., 0.6235, 0.5765, 0.6039],
          [0.3020, 0.2745, 0.3020,  ..., 0.6235, 0.5882, 0.6157]]]), caption='Cello Renting vs. Buying: Which is Right for You?')

Awesome, it returns a CaptioningSample with the attributes

__key__: part_00123/00403.tar/004030195, the identifier of the sample like TAR_FILE/INDEX
- All sample types will have a key. It’s in the base class Sample
image: The image as a tensor of shape (1, 3, 267, 400) (RGB image in a batch of size 1)
caption: A list of strings (here just one since batch size is one)

Let’s also talk about the WorkerConfig. As energon is made for distributed training, you always need to provide a worker config to the dataset so specify how many ranks and workers there are and which rank you’re currently on. For this simple tutorial, we don’t really distribute the work, so we use only a single rank with 4 workers. Check out the helper method default_worker_config to see how the worker config is constructed. Also don’t be afraid to click the [source] link and look at the very short source code of it.

Tutorial 3: Batch Size

Actually, we would like to use a batch_size of more than one, let’s go with 2 for now.

from megatron.energon import get_train_dataset, get_loader, WorkerConfig

loader = get_loader(get_train_dataset(
    '/path/to/your/dataset',
    batch_size=2,
    shuffle_buffer_size=100,
    max_samples_per_sequence=100,
    worker_config=WorkerConfig.default_worker_config(),
))

for batch in loader:
    print(batch)
    break

The output will be similar to above but with different shapes and lengths:

batch.__key__: A list of two keys
batch.image: Tensor of shape (2, 3, 267, 400)
batch.caption: A list of two caption strings

The default task encoder automagically padded and stacked the items to a batch. This may be ok for some cases, but usually you will want to process and batch your data differently.

Hence, we can

either use an existing task encoder
or define a custom one (see Task Encoder)

Tutorial 3: Blending using Metadataset

A typical use case is to blend multiple datasets of the same (or similar type) together. For example, you may want to blend the COCO dataset with the COYO dataset. The easiest way to do this, is to use the metadataset pattern. For this you need to create a new yaml file that defines the meta dataset:

coyo-coco-dataset.yaml:

__module__: megatron.energon
__class__: MetadatasetV2
splits:
  # Train dataset, the datasets will be blended according to their weights 
  train:
    blend:
      - weight: 5
        path: ./coco
      - weight: 2
        path: ./coyo
  # For val and test, datasets will be concatenated
  val:
    path: ./coco
  test:
    path: ./coyo

This assumes, that the datasets coyo and coco exist in subfolders next to the coyo-coco-dataset.yaml file. You could also use absolute paths, but that will not work well when using object storage such as S3.

Usage in your loader, simply use get_train_dataset:

from megatron.energon import get_train_dataset, get_loader, WorkerConfig

loader = get_loader(get_train_dataset(
    'coyo-coco-dataset.yaml',
    batch_size=4,
    shuffle_buffer_size=100,
    max_samples_per_sequence=100,
    worker_config=WorkerConfig.default_worker_config(),
))

for batch in loader:
    print(batch)
    break

If you need to handle samples from different datasets differently in your pipeline, you will want to use subflavors. For these and other details, check out the Metadataset section. Energon also supports blending by specifying the number of repetitions for each dataset using Epochized Blending.

Tutorial 4: Distributed Loading

For multi-GPU support, you may need to adapt the worker config. So far we have only used the default worker config, which you can get by calling WorkerConfig.default_worker_config(). This default config tries to infer your multi-GPU setup by using torch.distributed, which is fine in most cases. If you are not using any distributed setup, the default config will work, too. In that case, it assumes a single local rank.

However, if you have a more complex multi-node setup and other non-data-parallel stragies, you may need to set it up yourself. The following example shows how it could be set.

from megatron.energon import get_train_dataset, get_loader, WorkerConfig
import torch.distributed as dist

worker_config = WorkerConfig(
    rank=SET_YOUR_GLOBAL_DATA_RANK_HERE,
    world_size=SET_YOUR_GLOBAL_WORLD_SIZE_HERE,
    num_workers=2,
)

loader = get_loader(get_train_dataset(
    'coyo-coco-dataset.yaml',
    batch_size=4,
    shuffle_buffer_size=100,
    max_samples_per_sequence=100,
    worker_config=worker_config,
))

for batch in loader:
    print(batch)
    break

For a detailed discussion on parallelism, check out Parallelism.

Tutorial 5: Saving and Restoring

For saving and restoring the state (e.g. when pausing and resuming training), the loader must be instantiated with the savable loader.

from megatron.energon import get_train_dataset, get_savable_loader, WorkerConfig

ds = get_train_dataset(
    'coyo-coco-dataset.yaml',
    batch_size=4,
    shuffle_buffer_size=100,
    max_samples_per_sequence=100,
    worker_config=WorkerConfig.default_worker_config(),
)

# Must use the savable loader here. This provides methods to save
# and load the state of the data loader
loader = get_savable_loader(ds)

for i, batch in zip(range(10), loader):
    print(batch)
    break

# Save the state
state = loader.save_state_rank()
# Could save the state now using torch.save()

# ... when loading:
# Could load the state with torch.load()

# Restore the state for a new loader
ds = get_train_dataset(
    'coyo-coco-dataset.yaml',
    batch_size=4,
    shuffle_buffer_size=100,
    max_samples_per_sequence=100,
    worker_config=WorkerConfig.default_worker_config(),
)
loader = get_savable_loader(ds)
loader.restore_state_rank(state)

We provide code for different scenarios for saving and loading in distributed settings especially in the section Save and Restore.

More Features

Check out the topics in Advanced Usage for details on specific features.