Data Preparation

The aim of data preparation is to convert your data to a format that the energon loader can understand and iterate. The outcome will be a WebDataset with some extra information stored in a folder called .nv-meta. Below in Dataset Format on Disk we explain the details about this format.

These are the typical steps to get your data ready:

  1. Create a normal WebDataset from your data

  2. Run our preparation tool energon prepare to convert to an energon-compatible format

Step 1: Creating a WebDataset

Example for a WebDataset (e.g. image captioning dataset):

shards
β”œβ”€β”€ shard_0000.tar
β”‚   β”œβ”€β”€ sample_0000.jpg
β”‚   β”œβ”€β”€ sample_0000.txt
β”‚   β”œβ”€β”€ sample_0000.detail.json
β”‚   β”œβ”€β”€ sample_0001.jpg
β”‚   β”œβ”€β”€ sample_0001.txt
β”‚   └── sample_0001.detail.json
β”œβ”€β”€ shard_0001.tar
β”‚   β”œβ”€β”€ sample_0002.jpg
β”‚   β”œβ”€β”€ sample_0002.txt
β”‚   β”œβ”€β”€ sample_0002.detail.json
β”‚   β”œβ”€β”€ sample_0003.jpg
β”‚   β”œβ”€β”€ sample_0003.txt
β”‚   └── sample_0003.detail.json
└── ...

In the example you can see two shards (i.e. tar files) with multiple samples. Each group of files with the same basename makes one sample. So sample_0000.jpg, sample_0000.txt and sample_0000.detail.json are three parts that belong to the first sample.

Note that each sample may have a different number of parts, for example some samples may have more images than others. In this case, they should still have the same basename, for example sample_0000.img1.jpg and sample_0000.img2.jpg. For an advanced example for interleaved data, check out this section.

The order of samples in the tar file is important. Samples with the same base name (~before the first dot of the filename) must follow each other. The base name is used to group the samples, i.e. in the example sample_0000 is the first group name, with the part types jpg, txt, detail.json.

The default behavior of energon is to parse the contents by extensions (e.g. ending on .json will automatically use json.loads, .png will load the image).

Building a WebDataset using Python

The easiest way to construct a WebDataset from existing data (e.g. from another torch dataset or a folder with files) is to use the ShardWriter from the webdataset library:

import webdataset as wds


if __name__ == '__main__':
    # Wherever your dataset comes from
    my_dataset = ...
  
    with wds.ShardWriter("parts/data-%d.tar", maxcount=10000) as shard_writer:
        for key, data in my_dataset:
            sample = {
                "__key__": key,
                "png": data['image'],
            }
            shard_writer.write(sample)

Step 2: Preparing the Dataset

Once you have a WebDataset ready, you will want to prepare it for use with Energon. This means adding additional meta data files next to the data. This step does not change or copy the contents of your tar files.

Just run the energon prepare /path/to/dataset command, which will interactively walk you through the process.

The command will

  • Search for all *.tar files in the given folder

  • Index them so samples can be accessed randomly

  • Ask you how you want to split the data into train/val/test paritions

  • Ask you how to decode the data (field map or sample_loader.py)

  • store all this information in a subfolder .nv-meta/, see details below.

Splitting the dataset into train/val/test

The first thing that the energon prepare assistant will ask you, is how you want to split the data by ratios. However, if you have a pre-determined split, you can also pass that to energon. See the examples below.

Example 1: Let energon do the split

shards
β”œβ”€β”€ shard_0000.tar
β”œβ”€β”€ shard_0001.tar
└── ...

Commandline:

> energon prepare ./
# Exemplary answers to interactive questions:
Ratio: 8,1,1
Dataset class: CaptioningWebdataset
Field map: Yes
  image: jpg
  caption: txt  # if txt contains the caption
# or
  caption: json[caption]  # if .json contains {"caption": "My nice image"}

Example 2: Presplit shards by prefix

shards
β”œβ”€β”€ train_shard_0000.tar
β”œβ”€β”€ train_shard_0001.tar
β”œβ”€β”€ ...
β”œβ”€β”€ val_shard_0000.tar
β”œβ”€β”€ val_shard_0001.tar
└── ...

Commandline:

> energon prepare --split-parts 'train:shards/train_.*' --split-parts 'val:shards/val_.*' ./

Example 3: Presplit shards by folder

shards
β”œβ”€β”€ train
β”‚   β”œβ”€β”€ shard_00001.tar
β”‚   β”œβ”€β”€ shard_00001.tar
β”‚   └── ...
β”œβ”€β”€ val
β”‚   β”œβ”€β”€ shard_00001.tar
β”‚   β”œβ”€β”€ shard_00001.tar
β”‚   └── ...
└── ...

Commandline:

> energon prepare --split-parts 'train:shards/train/.*' --split-parts 'val:shards/val/.*' ./

Sample Types

After the split is set up, the assistant will ask you which sample type you want to use. We provide a set of common sample types such as for image captioning or visual question answering, they are listed below.

If none of these fits, you may need to set up your own new sample type. Here are your options:

  • You have a new type sample which is rather common but not in our list below

    • Please add your type to energon and create a pull request so we can add it

  • Your sample type is experimental or used temporarily only

    • You can add the sample type class in your code repository and create the dataset.yaml manually, referring to your class with __class__

Available Sample Types

These are the possible integrated types you can currently choose from:

  • Sample: Base dataclass for samples from source webdatasets.

    • Attributes:

      • __key__: str: Unique identifier of the sample within the dataset. Useful for backtracking the source of a single sample.

      • __key__: str: Structured key of the sample, which can be used to regenerate the sample without storing the whole sample.

      • __subflavor__: str: Deprecated.

      • __subflavors__: dict[str, Any] | None: Represents the subflavors (i.e. custom dict data) set for the source dataset (typically in the metadataset).

    • CaptioningSample: Represents a sample for captioning

      • Attributes:

        • image: torch.Tensor: The input image tensor

        • caption: str: The target caption string

    • ImageSample: Represents a sample which only contains an image (e.g. for reconstruction)

      • Attributes:

        • image: torch.Tensor: The image tensor

    • ImageClassificationSample: Represents a sample which contains an image with a caption

      • Attributes:

        • image: torch.Tensor: The image tensor

        • label: int | None: The label of the sample, as integral representation

        • label_name: str | None: The label of the sample

    • InterleavedSample: Represents a sample which contains interleaved media, such as image and text.

      • Attributes:

        • sequence: list[torch.Tensor | str]: The interleaved media (either a torch.Tensor or string for text)

    • MultiChoiceVQASample: Represents a sample for visual question answering, with a choice of answers and one correct answer.

      • Attributes:

        • image: torch.Tensor: The input image tensor

        • context: str: The context/question for the image

        • choices: List[str] | None: The candidate answers

        • correct_choice_idx: int | None: The index of the correct answer

    • OCRSample: Sample type for optical character recognition.

      • Attributes:

        • image: str: The input image

        • text: str: The text string for the whole image

        • block_boxes: torch.Tensor | None: The bounding boxes of the block in the image float(N, 4|5<x,y,w,h,confidence>)

        • block_classes: torch.Tensor | list[str] | None: The classes of th blocks

        • block_text: torch.Tensor | None: The text content of the blocks

        • lines_boxes: torch.Tensor | None: The bounding boxes of the text lines

        • lines_text: list[str] | None: The text content of the text lines

        • words_boxes: torch.Tensor | None: The bounding boxes of the text words

        • words_text: list[str] | None: The text content of the text words

        • chars_boxes: torch.Tensor | None: The bounding boxes of the text characters

        • chars_text: list[str] | None: The text content of the text characters

    • TextSample: Represents a sample which only contains a text string (e.g. for text generation)

      • Attributes:

        • text: str: The text string

    • VidQASample: Represents a sample which contains a video and a question with answer.

      • Attributes:

        • video: VideoData: The input image tensor

        • context: str: The context/question

        • answers: list[str] | None: The answer string

        • answer_weights: torch.Tensor | None: Weights for possibly multiple answers

    • VQASample: Represents a sample which contains an image, a question/context and an answer

      • Attributes:

        • image: torch.Tensor: The input image tensor

        • context: str: The context/question

        • answers: list[str] | None: The answer string

        • answer_weights: torch.Tensor | None: Weights for possibly multiple answers

    • VQAOCRSample: Sample type for question answering related to optical character recognition.

      • Attributes:

        • image: str: The input image

        • context: str: The context/question

        • text: str: The text contained in the image

        • answers: list[str] | None: The answer string

        • answer_weights: torch.Tensor | None: Weights for possibly multiple answers

        • words_boxes: torch.Tensor | None: The bounding boxes of the text words

        • words_text: list[str] | None: The text content of the text words

Sample Loading

There are multiple options for how to convert the data stored in the tar files to an instance of one of the sample types above.

After choosing the sample type, energon prepare will ask if you want to use a β€œsimple field map” or a β€œsample loader”. There is a also a third method called β€œCrudeWebdataset”.

Field Map

If your data consists of simple text, json and images that can be decoded by the standard webdataset auto decoder, and they map directly to the attributes of your chosen sample type from the list above, use a β€œfield map”. The field map stores which file extension in the webdataset shall be mapped to which attribute of the sample class.

Sample Loader

If your data needs some custom decoding code to compute the sample attributes from the data in the tar, you should use a custom sample loader. The code shall only contain the dataset-specific decoding, no project-specific decoding.

Example for a special format (e.g. ocr dataset) for which we will use a custom sample_loader.py:

parts
β”œβ”€β”€ segs-000000.tar
β”‚   β”œβ”€β”€ 636f6d706f6e656e747362656e6566693030616e6472(0025).jp2
β”‚   β”œβ”€β”€ 636f6d706f6e656e747362656e6566693030616e6472(0025).lines.png
β”‚   β”œβ”€β”€ 636f6d706f6e656e747362656e6566693030616e6472(0025).mp
β”‚   β”œβ”€β”€ 636f6d706f6e656e747362656e6566693030616e6472(0025).words.png
β”‚   β”œβ”€β”€ 636f6d706f6e656e747362656e6566693030616e6472(0075).jp2
β”‚   β”œβ”€β”€ 636f6d706f6e656e747362656e6566693030616e6472(0075).lines.png
β”‚   β”œβ”€β”€ 636f6d706f6e656e747362656e6566693030616e6472(0075).mp
β”‚   └── ...
└── ...

.mp (msgpack content) files are automatically decoded, containing:

{
  "identifier": "componentsbenefi00andr",
  "pageno": 25,
  "size": {"w": 2286, "h": 3179},
  "lines": [
    {"l": 341, "t": 569, "b": 609, "r": 1974, "text": "CHAPTER 4  ADVANCED TRAFFIC CONTROL SYSTEMS IN INDIANA"},
    {"l": 401, "t": 770, "b": 815, "r": 2065, "text": "A variety of traffic control systems currently exist"},
    //...
  ],
  "words": [
    {"l": 341, "t": 577, "b": 609, "r": 544, "text": "CHAPTER"},
    {"l": 583, "t": 578, "b": 607, "r": 604, "text": "4"},
    //...
  ],
  "chars": [
    {"t": 579, "b": 609, "l": 341, "r": 363, "text": "C"},
    {"t": 579, "b": 609, "l": 370, "r": 395, "text": "H"},
    //...
  ],
}

sample_loader.py:

import torch


def sample_loader(raw: dict) -> dict:
    return dict(
        __key__=raw["__key__"],
        image=raw["jp2"],
        text="\n".join(line["text"] for line in raw["mp"]["lines"]),
        lines_boxes=torch.tensor(
            [
                (line["l"], line["t"], line["r"] - line["l"], line["b"] - line["t"])
                for line in raw["mp"]["lines"]
            ],
            dtype=torch.int64,
        ),
        lines_text=[line["text"] for line in raw["mp"]["lines"]],
        words_boxes=torch.tensor(
            [
                (line["l"], line["t"], line["r"] - line["l"], line["b"] - line["t"])
                for line in raw["mp"]["words"]
            ],
            dtype=torch.int64,
        ),
        words_text=[line["text"] for line in raw["mp"]["words"]],
        chars_boxes=torch.tensor(
            [
                (line["l"], line["t"], line["r"] - line["l"], line["b"] - line["t"])
                for line in raw["mp"]["chars"]
            ],
            dtype=torch.int64,
        ),
        chars_text=[line["text"] for line in raw["mp"]["chars"]],
    )


def part_filter(part: str) -> bool:
    return part in ("jp2", "mp")

For more information please also read Custom Sample Loader.

Special format

Sometimes, your data will not be easily represented as a field_map explained above. For example, your data may contain

  • structured data like nested boxes for each sample

  • custom binary formats

  • xml / html / pickle etc.

In those cases you have two options:

  1. Creating a custom sample_loader.py in the .nv-meta folder as explained above.

    • This will typically do the job and is preferred if you only have to do some small conversions.

  2. Using a CrudeWebdataset

    • For more intricate conversions, you can use a CrudeWebdataset that will pass your samples in a raw form into your TaskEncoder where you can then convert them based on the subflavor for example. For more details see Crude Data and How to Cook It πŸ‘¨β€πŸ³.

Even for these specific wds formats, you would start preparing your data using the dataset preparation command, but you will need to define a custom sample loader or select CrudeWebdataset in the dataprep wizard.

Dataset Format on Disk

The energon library supports loading large multi-modal datasets from disk. To load the dataset, it must comply with the format described in this section.

A valid energon dataset must contain an .nv-meta folder with certain files as shown below.

my_dataset
β”œβ”€β”€ .nv-meta
β”‚   β”œβ”€β”€ dataset.yaml
β”‚   β”œβ”€β”€ split.yaml
β”‚   └── .info.yaml
β”œβ”€β”€ shards
β”‚   β”œβ”€β”€ shard_000.tar
β”‚   β”œβ”€β”€ shard_001.tar
β”‚   β”œβ”€β”€ ...

Note that the shards folder is just an example. The shards and their folder can be named differently, but the .nv-meta structure is always the same.

Files in .nv-meta

dataset.yaml

The dataset.yaml contains the dataset definition, i.e. the dataset class to use as loader, optional decoders. If you want to create such a file, you should consider using the CLI preparation tool.

Here’s an example:

sample_type:
  __module__: megatron.energon
  __class__: CaptioningSample
field_map:
  image: jpg
  caption: txt

The __class__ and __module__ values help the library construct the correct object. The field_map specifies how the fields from each webdataset sample are mapped to the members of the sample dataclass.

In this example, the dataclass is

@dataclass
class CaptioningSample(Sample):
    image: torch.Tensor
    caption: str

In some scenarios, you might need a more advanced way to map samples into the dataclass. In that case, please check out this page.

split.yaml

This file contains the splits (i.e. train, val, test), each a list of the shards for each split. It can also contain an β€œexclude list” to exclude certain samples or shards from training. Example:

exclude: []
split_parts:
  train:
  - shards/shard_000.tar
  - shards/shard_001.tar
  val:
  - shards/shard_002.tar
  test:
  - shards/shard_003.tar

To exclude certain shards or samples, you need to add those to the exclude list as follows:

exclude:
  - shards/shard_004.tar
  - shards/shard_001.tar/000032
  - shards/shard_001.tar/000032
split_parts:
...

The above code excludes the entire shard 004 and two samples from the shard 001.

.info.yaml

The hidden info file is auto-generated and contains statistics about each shard.

Example:

shard_counts:
  shards/000.tar: 1223
  shards/001.tar: 1420
  shards/002.tar: 1418
  shards/003.tar: 1358