Dataset Format on Disk

The energon library supports loading large multi-modal datasets from disk. To load the dataset, it must comply with the format described in this section.

A valid energon dataset must contain an .nv-meta folder with certain files as shown below.

my_dataset
β”œβ”€β”€ .nv-meta
β”‚   β”œβ”€β”€ dataset.yaml
β”‚   β”œβ”€β”€ split.yaml
β”‚   └── .info.yaml
β”œβ”€β”€ shards
β”‚   β”œβ”€β”€ shard_000.tar
β”‚   β”œβ”€β”€ shard_001.tar
β”‚   β”œβ”€β”€ ...

Note that the shards folder is just an example. The shards and their folder can be named differently. It does not even need to be a webdataset. Other formats are supported, but the .nv-meta structure is always the same.

Files in .nv-meta

dataset.yaml

The dataset.yaml contains the dataset definition, i.e. the dataset class to use as loader, optional decoders. If you want to create such a file, you should consider using the CLI preparation tool.

Here’s an example:

sample_type:
  __module__: megatron.energon
  __class__: CaptioningSample
field_map:
  image: jpg
  caption: txt

The __class__ and __module__ values help the library construct the correct object. The field_map specifies how the fields from each webdataset sample are mapped to the members of the sample dataclass.

In this example, the dataclass is

@dataclass
class CaptioningSample(Sample):
    image: torch.Tensor
    caption: str

In some scenarios, you might need a more advanced way to map samples into the dataclass. In that case, please check out this page.

split.yaml

This file contains the splits (i.e. train, val, test), each a list of the shards for each split. It can also contain a β€œdenylist” to exclude certain samples or shards from training. Example:

exclude: []
split_parts:
  train:
  - shards/shard_000.tar
  - shards/shard_001.tar
  val:
  - shards/shard_002.tar
  test:
  - shards/shard_003.tar

To exclude certain shards or samples, you need to add those to the exclude list as follows:

exclude:
  - shards/shard_004.tar
  - shards/shard_001.tar/000032
  - shards/shard_001.tar/000032
split_parts:
...

The above code excludes the entire shard 004 and two samples from the shard 001.

.info.yaml

The hidden info file is auto-generated and contains statistics about each shard.

Example:

shard_counts:
  shards/000.tar: 1223
  shards/001.tar: 1420
  shards/002.tar: 1418
  shards/003.tar: 1358