Data Preparation

The aim of data preparation is to convert your data to a format that the energon loader can understand and iterate. The outcome will be a WebDataset with some extra information stored in a folder called .nv-meta. Below in Dataset Format on Disk we explain the details about this format.

Important Considerations

Depending on what your data looks like and how you are planning to use it, you will have to make a few choices, before you prepare your dataset:

Monolithic Dataset vs. Polylithic (primary and auxiliary) Datasets

You can include the media (images/video/audio) inside the same webdataset along with the text and metadata of each sample. Or you can keep the media separate (either in another indexed webdataset or as individual files on disk)

If you can, you should go for the monolithic option, because it’s faster to load. However, there are a few reasons why the other option may be needed:

You need to keep the original media and you don’t want to duplicate it
Your media data is very large (e.g. long videos) and you need to keep your primary dataset small (containing just the text-base data and meta information)
You want to re-use the same media with different labels or you want to train on different subsets
You want to train with online packing and can’t fit all the media of the packing buffer in memory. With polylithic datasets you can use caching to avoid that issue.

How to shard the data

The WebDataset will be split into a bunch of shards (i.e. tar files). You’ll have to decide how many samples to put in one shard and how many shards to get overall.

To maximize the loading speed, use as few shards as possible. Even a single shard can work well! However, if you cannot handle files above a certain size you may need to split the shards more. A good rule of thumb is to keep your number of shards below 10k.

If you are using remote filesystems like S3, there may be an opposing constraint: S3 limits the number of requests per second that you can make for a single prefix (e.g. filename). By using more shards, you can increase the overall rate. Ideally, you would still want to stay below 10k shards.

Raw vs. baked data

When using images for example, you could put either the encoded JPG, the decoded pixel values or even the encoded features into the dataset.

Typically, we recommend to go with the “original form” (e.g. JPG) and do all the processing on the fly inside the cooker and task encoder. This way, you can change the processing and keep your dataset.

However, if the processing becomes a bottleneck, you can move some of it into the dataset creation phase by baking the information in.

Keep in mind that others may also want to use your dataset for a different project.

Steps to Create a Monolithic Dataset

These are the typical steps to get your data ready:

Create a normal WebDataset from your data (including all the media content)
Run our preparation tool energon prepare create additional metadata needed by energon. See Dataset Format on Disk.

Steps to Create a Polylithic Dataset

Create the primary WebDataset from your text-based part of the data (meta information, labels, sizes etc.)
- Include the file names (don’t use absolute paths) of the media that belongs to each sample (e.g. as strings inside a json entry)
Create the auxiliary dataset(s). Can be multiple datasets, e.g. one per modality.
- Either as a folder on disk with all the media files inside
- Or as another WebDataset that contains just the media files (with the exact same names)
Run our preparation tool energon prepare on both WebDatasets to convert to an energon-compatible format
- Configure both datasets as CrudeWebdataset
Create a metadataset that specifies what auxiliary data to load for each primary dataset
- For more details read about crude data

Step 1: Creating a WebDataset

Example for a WebDataset (e.g. image captioning dataset):

shards
├── shard_0000.tar
│   ├── sample_0000.jpg
│   ├── sample_0000.txt
│   ├── sample_0000.detail.json
│   ├── sample_0001.jpg
│   ├── sample_0001.txt
│   └── sample_0001.detail.json
├── shard_0001.tar
│   ├── sample_0002.jpg
│   ├── sample_0002.txt
│   ├── sample_0002.detail.json
│   ├── sample_0003.jpg
│   ├── sample_0003.txt
│   └── sample_0003.detail.json
└── ...

In the example you can see two shards (i.e. tar files) with multiple samples. Each group of files with the same basename makes one sample. So sample_0000.jpg, sample_0000.txt and sample_0000.detail.json are three parts that belong to the first sample. This shows a monolithic dataset, for polylithic you would drop the JPGs in the primary dataset.

Note that each sample may have a different number of parts, for example some samples may have more images than others. In this case, they should still have the same basename, for example sample_0000.img1.jpg and sample_0000.img2.jpg. For an advanced example for interleaved data, check out this section.

The order of samples in the tar file is important. Samples with the same base name (~before the first dot of the filename) must follow each other. The base name is used to group the samples, i.e. in the example sample_0000 is the first group name, with the part types jpg, txt, detail.json.

The default behavior of energon is to parse the contents by extensions (e.g. ending on .json will automatically use json.loads, .png will load the image).

Building a WebDataset using Python

The easiest way to construct a WebDataset from existing data (e.g. from another torch dataset or a folder with files) is to use the ShardWriter from the webdataset library:

import webdataset as wds


if __name__ == '__main__':
    # Wherever your dataset comes from
    my_dataset = ...
  
    with wds.ShardWriter("parts/data-%d.tar", maxcount=10000) as shard_writer:
        for key, data in my_dataset:
            sample = {
                "__key__": key,
                "png": data['image'],
            }
            shard_writer.write(sample)

Step 2: Preparing the Dataset

Once you have a WebDataset ready, you will want to prepare it for use with Energon. This means adding additional meta data files next to the data. This step does not change or copy the contents of your tar files.

Just run the energon prepare /path/to/dataset command, which will interactively walk you through the process.

The command will

Search for all *.tar files in the given folder
Index them so samples can be accessed randomly
Ask you how you want to split the data into train/val/test paritions
Ask you about the sample type (optionally crude)
Ask you how to decode the data if not using crude data (field map or sample_loader.py)
Store all this information in a subfolder .nv-meta/, see details below.

Splitting the dataset into train/val/test

The first thing that the energon prepare assistant will ask you, is how you want to split the data by ratios. However, if you have a pre-determined split, you can also pass that to energon. See the examples below.

Example 1: Let energon do the split

shards
├── shard_0000.tar
├── shard_0001.tar
└── ...

Commandline:

> energon prepare ./
# Exemplary answers to interactive questions:
Ratio: 8,1,1
Dataset class: CaptioningWebdataset
Field map: Yes
  image: jpg
  caption: txt  # if txt contains the caption
# or
  caption: json[caption]  # if .json contains {"caption": "My nice image"}

Example 2: Presplit shards by prefix

shards
├── train_shard_0000.tar
├── train_shard_0001.tar
├── ...
├── val_shard_0000.tar
├── val_shard_0001.tar
└── ...

Commandline:

> energon prepare --split-parts 'train:shards/train_.*' --split-parts 'val:shards/val_.*' ./

Note that the pattern matching syntax uses regexes, so for arbitrary characters insert .* not just *

Example 3: Presplit shards by folder

shards
├── train
│   ├── shard_00001.tar
│   ├── shard_00001.tar
│   └── ...
├── val
│   ├── shard_00001.tar
│   ├── shard_00001.tar
│   └── ...
└── ...

Commandline:

> energon prepare --split-parts 'train:shards/train/.*' --split-parts 'val:shards/val/.*' ./

Good to know

You can inspect your prepared dataset like a normal file system by using the energon mount feature.

Sample Types

After the split is set up, the assistant will ask you which sample type you want to use. We provide a set of common sample types such as for image captioning or visual question answering, they are listed below.

This will be sufficient in a simple scenario and if none of these fits, you may even create your own new sample type. Here are your options:

You have a new type sample which is rather common but not in our list below
- Please add your type to energon and create a pull request so we can add it
Your sample type is experimental very special or used temporarily only
- You can add the sample type class in your code repository and create the dataset.yaml manually, referring to your class with __class__
- You can add the sample type class in your code repository, use a crude dataset and cookers (no need to put the sample type in dataset.yaml)

Available Sample Types

These are the possible integrated types you can currently choose from:

Sample: Base dataclass for samples from source webdatasets.
- Attributes:
  - __key__: str: Unique identifier of the sample within the dataset. Useful for backtracking the source of a single sample.
  - __key__: str: Structured key of the sample, which can be used to regenerate the sample without storing the whole sample.
  - __subflavors__: dict[str, Any] | None: Represents the subflavors (i.e. custom dict data) set for the source dataset (typically in the metadataset).
- CaptioningSample: Represents a sample for captioning
  - Attributes:
    - image: torch.Tensor: The input image tensor
    - caption: str: The target caption string
- ImageSample: Represents a sample which only contains an image (e.g. for reconstruction)
  - Attributes:
    - image: torch.Tensor: The image tensor
- ImageClassificationSample: Represents a sample which contains an image with a caption
  - Attributes:
    - image: torch.Tensor: The image tensor
    - label: int | None: The label of the sample, as integral representation
    - label_name: str | None: The label of the sample
- InterleavedSample: Represents a sample which contains interleaved media, such as image and text.
  - Attributes:
    - sequence: list[torch.Tensor | str]: The interleaved media (either a torch.Tensor or string for text)
- MultiChoiceVQASample: Represents a sample for visual question answering, with a choice of answers and one correct answer.
  - Attributes:
    - image: torch.Tensor: The input image tensor
    - context: str: The context/question for the image
    - choices: List[str] | None: The candidate answers
    - correct_choice_idx: int | None: The index of the correct answer
- OCRSample: Sample type for optical character recognition.
  - Attributes:
    - image: str: The input image
    - text: str: The text string for the whole image
    - block_boxes: torch.Tensor | None: The bounding boxes of the block in the image float(N, 4|5<x,y,w,h,confidence>)
    - block_classes: torch.Tensor | list[str] | None: The classes of th blocks
    - block_text: torch.Tensor | None: The text content of the blocks
    - lines_boxes: torch.Tensor | None: The bounding boxes of the text lines
    - lines_text: list[str] | None: The text content of the text lines
    - words_boxes: torch.Tensor | None: The bounding boxes of the text words
    - words_text: list[str] | None: The text content of the text words
    - chars_boxes: torch.Tensor | None: The bounding boxes of the text characters
    - chars_text: list[str] | None: The text content of the text characters
- TextSample: Represents a sample which only contains a text string (e.g. for text generation)
  - Attributes:
    - text: str: The text string
- VidQASample: Represents a sample which contains a video and a question with answer.
  - Attributes:
    - video: VideoData: The input image tensor
    - context: str: The context/question
    - answers: list[str] | None: The answer string
    - answer_weights: torch.Tensor | None: Weights for possibly multiple answers
- VQASample: Represents a sample which contains an image, a question/context and an answer
  - Attributes:
    - image: torch.Tensor: The input image tensor
    - context: str: The context/question
    - answers: list[str] | None: The answer string
    - answer_weights: torch.Tensor | None: Weights for possibly multiple answers
- VQAOCRSample: Sample type for question answering related to optical character recognition.
  - Attributes:
    - image: str: The input image
    - context: str: The context/question
    - text: str: The text contained in the image
    - answers: list[str] | None: The answer string
    - answer_weights: torch.Tensor | None: Weights for possibly multiple answers
    - words_boxes: torch.Tensor | None: The bounding boxes of the text words
    - words_text: list[str] | None: The text content of the text words

Sample Loading

When you actually use and load your dataset, the data stored in the tar files needs to be converted to an instance of your chosen sample type. There are three options:

The conversion is a simple 1:1 mapping of files to fields of the sample type class
- You can use a simple field map
Otherwise the now preferred way is to use a CrudeWebdataset and do the conversion inside a cooker.
There is another (now legacy) way, i.e. to create a custom sample_loader.py file next to your dataset.
- This option will continue to work, but we encourage to move to crude datasets in the future.

When running energon prepare, you can choose “Crude sample” as the sample type and the assistant will end. If you picked another sample type, the assistant will ask if you want to use a “simple field map” or a “sample loader”.

Simple Field Map

If your data consists of simple text, json and images that can be decoded by the standard webdataset auto decoder, and they map directly to the attributes of your chosen sample type from the list above, use a “field map”. The field map stores which file extension in the webdataset shall be mapped to which attribute of the sample class.

Sample Loader (Deprecated)

If your data needs some custom decoding code to compute the sample attributes from the data in the tar, you can use a custom sample loader. However, starting from Energon 7, we recommend to use crude datasets and a cooker instead.

If you use a sample_loader.py, its code shall only contain the dataset-specific decoding, no project-specific decoding.

Example for a special format (e.g. ocr dataset) for which we will use a custom sample_loader.py:

parts
├── segs-000000.tar
│   ├── 636f6d706f6e656e747362656e6566693030616e6472(0025).jp2
│   ├── 636f6d706f6e656e747362656e6566693030616e6472(0025).lines.png
│   ├── 636f6d706f6e656e747362656e6566693030616e6472(0025).mp
│   ├── 636f6d706f6e656e747362656e6566693030616e6472(0025).words.png
│   ├── 636f6d706f6e656e747362656e6566693030616e6472(0075).jp2
│   ├── 636f6d706f6e656e747362656e6566693030616e6472(0075).lines.png
│   ├── 636f6d706f6e656e747362656e6566693030616e6472(0075).mp
│   └── ...
└── ...

.mp (msgpack content) files are automatically decoded, containing:

{
  "identifier": "componentsbenefi00andr",
  "pageno": 25,
  "size": {"w": 2286, "h": 3179},
  "lines": [
    {"l": 341, "t": 569, "b": 609, "r": 1974, "text": "CHAPTER 4  ADVANCED TRAFFIC CONTROL SYSTEMS IN INDIANA"},
    {"l": 401, "t": 770, "b": 815, "r": 2065, "text": "A variety of traffic control systems currently exist"},
    //...
  ],
  "words": [
    {"l": 341, "t": 577, "b": 609, "r": 544, "text": "CHAPTER"},
    {"l": 583, "t": 578, "b": 607, "r": 604, "text": "4"},
    //...
  ],
  "chars": [
    {"t": 579, "b": 609, "l": 341, "r": 363, "text": "C"},
    {"t": 579, "b": 609, "l": 370, "r": 395, "text": "H"},
    //...
  ],
}

sample_loader.py:

import torch


def sample_loader(raw: dict) -> dict:
    return dict(
        __key__=raw["__key__"],
        image=raw["jp2"],
        text="\n".join(line["text"] for line in raw["mp"]["lines"]),
        lines_boxes=torch.tensor(
            [
                (line["l"], line["t"], line["r"] - line["l"], line["b"] - line["t"])
                for line in raw["mp"]["lines"]
            ],
            dtype=torch.int64,
        ),
        lines_text=[line["text"] for line in raw["mp"]["lines"]],
        words_boxes=torch.tensor(
            [
                (line["l"], line["t"], line["r"] - line["l"], line["b"] - line["t"])
                for line in raw["mp"]["words"]
            ],
            dtype=torch.int64,
        ),
        words_text=[line["text"] for line in raw["mp"]["words"]],
        chars_boxes=torch.tensor(
            [
                (line["l"], line["t"], line["r"] - line["l"], line["b"] - line["t"])
                for line in raw["mp"]["chars"]
            ],
            dtype=torch.int64,
        ),
        chars_text=[line["text"] for line in raw["mp"]["chars"]],
    )


def part_filter(part: str) -> bool:
    return part in ("jp2", "mp")

For more information please also read Custom Sample Loader.

Dataset Format on Disk

The energon library supports loading large multi-modal datasets from disk. To load the dataset, it must comply with the format described in this section.

A valid energon dataset must contain an .nv-meta folder with certain files as shown below.

my_dataset
├── .nv-meta
│   ├── dataset.yaml
│   ├── split.yaml
│   ├── .info.json
│   ├── index.sqlite
│   └── index.uuid
├── shards
│   ├── shard_000.tar
│   ├── shard_001.tar
│   ├── ...

Note that the shards folder is just an example. The shards and their folder can be named differently, but the .nv-meta structure is always the same.

Files in `.nv-meta`

dataset.yaml (user editable)

The dataset.yaml contains the dataset definition, i.e. the dataset class to use as loader, optional decoders. If you want to create such a file, you should consider using the CLI preparation tool.

Here’s an example:

sample_type:
  __module__: megatron.energon
  __class__: CaptioningSample
field_map:
  image: jpg
  caption: txt

For a crude dataset the dataset.yaml will simply be

__module__: megatron.energon
__class__: CrudeWebdataset

The __class__ and __module__ values help the library construct the correct object. The field_map specifies how the fields from each webdataset sample are mapped to the members of the sample dataclass.

In this example, the dataclass is

@dataclass
class CaptioningSample(Sample):
    image: torch.Tensor
    caption: str

split.yaml (user editable)

This file contains the splits (i.e. train, val, test), each a list of the shards for each split. It can also contain an “exclude list” to exclude certain samples or shards from training. Example:

exclude: []
split_parts:
  train:
  - shards/shard_000.tar
  - shards/shard_001.tar
  val:
  - shards/shard_002.tar
  test:
  - shards/shard_003.tar

To exclude certain shards or samples, you need to add those to the exclude list as follows:

exclude:
  - shards/shard_004.tar
  - shards/shard_001.tar/000032
  - shards/shard_001.tar/000032
split_parts:
...

The above code excludes the entire shard 004 and two samples from the shard 001.

.info.json (read-only)

The hidden info file is auto-generated and contains a list of all shards and the number of samples in each.

Example:

{
  "energon_version": "7.1.0",
  "shard_counts": {
    "shards/000.tar": 1223,
    "shards/001.tar": 1420,
    "shards/002.tar": 1418,
    "shards/003.tar": 1358
  }
}

The order of tar files is important, as it’s used by the sqlite database below.

index.sqlite and index.uuid (read-only)

The sqlite database was introduced in Energon 7 and allows for fully random access of samples and files by their names. This is a precondition for polylithic datasets and for the energon mount command.

Below there is some detailed information for the interested reader. Note that the internal table structure can change in any release without notice.

The database contains an entry for each sample and sample part including their byte offsets and sizes in the tar files.

Example samples table:

sample_key	sample_index	byte_offset	byte_size
00000	0	0	35840
00001	1	35840	35840
00002	2	71680	35840
…

The byte offsets describe the range around all the tar entries that are part of that sample including the tar headers.

Corresponding example sample_parts table:

sample_index	part_name	content_byte_offset	content_byte_size
0	json	1536	31
0	png	3584	30168
0	txt	35328	16
1	json	37376	31
1	png	39424	30168
1	txt	71168	16
…

The byte offsets in the sample_parts table refer to the byte ranges of the actual file content and can be used to directly access the content without parsing the tar header.

Both tables can be joined over the tar_file_id and the sample_index. Note that the tar_file_id refers to the list of tar files in the .info.json file.

Data Preparation

Important Considerations

Steps to Create a Monolithic Dataset

Steps to Create a Polylithic Dataset

Step 1: Creating a WebDataset

Building a WebDataset using Python

Step 2: Preparing the Dataset

Splitting the dataset into train/val/test

Example 1: Let energon do the split

Example 2: Presplit shards by prefix

Example 3: Presplit shards by folder

Sample Types

Available Sample Types

Sample Loading

Simple Field Map

Sample Loader (Deprecated)

Dataset Format on Disk

Files in .nv-meta

dataset.yaml (user editable)

split.yaml (user editable)

.info.json (read-only)

index.sqlite and index.uuid (read-only)

Files in `.nv-meta`