Data Preparationο
The aim of data preparation is to convert your data to a format that the energon loader can understand and iterate.
The outcome will be a WebDataset with some extra information stored in a folder called .nv-meta
.
Below in Dataset Format on Disk we explain the details about this format.
These are the typical steps to get your data ready:
Create a normal WebDataset from your data
Run our preparation tool
energon prepare
to convert to an energon-compatible format
Step 1: Creating a WebDatasetο
Example for a WebDataset (e.g. image captioning dataset):
shards
βββ shard_0000.tar
β βββ sample_0000.jpg
β βββ sample_0000.txt
β βββ sample_0000.detail.json
β βββ sample_0001.jpg
β βββ sample_0001.txt
β βββ sample_0001.detail.json
βββ shard_0001.tar
β βββ sample_0002.jpg
β βββ sample_0002.txt
β βββ sample_0002.detail.json
β βββ sample_0003.jpg
β βββ sample_0003.txt
β βββ sample_0003.detail.json
βββ ...
In the example you can see two shards (i.e. tar files) with multiple samples. Each group of files with the same basename makes one sample.
So sample_0000.jpg
, sample_0000.txt
and sample_0000.detail.json
are three parts that belong to the first sample.
Note that each sample may have a different number of parts, for example some samples may have more images than others.
In this case, they should still have the same basename, for example sample_0000.img1.jpg
and sample_0000.img2.jpg
. For an advanced example for interleaved data, check out this section.
The order of samples in the tar file is important. Samples with the same base name (~before the first dot of the filename) must follow each other.
The base name is used to group the samples, i.e. in the example sample_0000
is the first group name, with the part types jpg
, txt
, detail.json
.
The default behavior of energon is to parse the contents by extensions (e.g. ending on .json
will automatically use json.loads
, .png
will load the image).
Building a WebDataset using Pythonο
The easiest way to construct a WebDataset from existing data (e.g. from another torch dataset or a folder with files) is to use the ShardWriter from the webdataset library:
import webdataset as wds
if __name__ == '__main__':
# Wherever your dataset comes from
my_dataset = ...
with wds.ShardWriter("parts/data-%d.tar", maxcount=10000) as shard_writer:
for key, data in my_dataset:
sample = {
"__key__": key,
"png": data['image'],
}
shard_writer.write(sample)
Step 2: Preparing the Datasetο
Once you have a WebDataset ready, you will want to prepare it for use with Energon. This means adding additional meta data files next to the data. This step does not change or copy the contents of your tar files.
Just run the energon prepare /path/to/dataset
command, which will interactively walk you through the process.
The command will
Search for all
*.tar
files in the given folderIndex them so samples can be accessed randomly
Ask you how you want to split the data into train/val/test paritions
Ask you how to decode the data (field map or sample_loader.py)
store all this information in a subfolder
.nv-meta/
, see details below.
Splitting the dataset into train/val/testο
The first thing that the energon prepare
assistant will ask you, is how you want to split the data by ratios.
However, if you have a pre-determined split, you can also pass that to energon. See the examples below.
Example 1: Let energon do the splitο
shards
βββ shard_0000.tar
βββ shard_0001.tar
βββ ...
Commandline:
> energon prepare ./
# Exemplary answers to interactive questions:
Ratio: 8,1,1
Dataset class: CaptioningWebdataset
Field map: Yes
image: jpg
caption: txt # if txt contains the caption
# or
caption: json[caption] # if .json contains {"caption": "My nice image"}
Example 2: Presplit shards by prefixο
shards
βββ train_shard_0000.tar
βββ train_shard_0001.tar
βββ ...
βββ val_shard_0000.tar
βββ val_shard_0001.tar
βββ ...
Commandline:
> energon prepare --split-parts 'train:shards/train_.*' --split-parts 'val:shards/val_.*' ./
Example 3: Presplit shards by folderο
shards
βββ train
β βββ shard_00001.tar
β βββ shard_00001.tar
β βββ ...
βββ val
β βββ shard_00001.tar
β βββ shard_00001.tar
β βββ ...
βββ ...
Commandline:
> energon prepare --split-parts 'train:shards/train/.*' --split-parts 'val:shards/val/.*' ./
Sample Typesο
After the split is set up, the assistant will ask you which sample type you want to use. We provide a set of common sample types such as for image captioning or visual question answering, they are listed below.
If none of these fits, you may need to set up your own new sample type. Here are your options:
You have a new type sample which is rather common but not in our list below
Please add your type to energon and create a pull request so we can add it
Your sample type is experimental or used temporarily only
You can add the sample type class in your code repository and create the
dataset.yaml
manually, referring to your class with__class__
Available Sample Typesο
These are the possible integrated types you can currently choose from:
Sample
: Base dataclass for samples from source webdatasets.Attributes:
__key__: str
: Unique identifier of the sample within the dataset. Useful for backtracking the source of a single sample.__key__: str
: Structured key of the sample, which can be used to regenerate the sample without storing the whole sample.__subflavor__: str
: Deprecated.__subflavors__: dict[str, Any] | None
: Represents the subflavors (i.e. custom dict data) set for the source dataset (typically in the metadataset).
CaptioningSample
: Represents a sample for captioningAttributes:
image: torch.Tensor
: The input image tensorcaption: str
: The target caption string
ImageSample
: Represents a sample which only contains an image (e.g. for reconstruction)Attributes:
image: torch.Tensor
: The image tensor
ImageClassificationSample
: Represents a sample which contains an image with a captionAttributes:
image: torch.Tensor
: The image tensorlabel: int | None
: The label of the sample, as integral representationlabel_name: str | None
: The label of the sample
InterleavedSample
: Represents a sample which contains interleaved media, such as image and text.Attributes:
sequence: list[torch.Tensor | str]
: The interleaved media (either a torch.Tensor or string for text)
MultiChoiceVQASample
: Represents a sample for visual question answering, with a choice of answers and one correct answer.Attributes:
image: torch.Tensor
: The input image tensorcontext: str
: The context/question for the imagechoices: List[str] | None
: The candidate answerscorrect_choice_idx: int | None
: The index of the correct answer
OCRSample
: Sample type for optical character recognition.Attributes:
image: str
: The input imagetext: str
: The text string for the whole imageblock_boxes: torch.Tensor | None
: The bounding boxes of the block in the image float(N, 4|5<x,y,w,h,confidence>)block_classes: torch.Tensor | list[str] | None
: The classes of th blocksblock_text: torch.Tensor | None
: The text content of the blockslines_boxes: torch.Tensor | None
: The bounding boxes of the text lineslines_text: list[str] | None
: The text content of the text lineswords_boxes: torch.Tensor | None
: The bounding boxes of the text wordswords_text: list[str] | None
: The text content of the text wordschars_boxes: torch.Tensor | None
: The bounding boxes of the text characterschars_text: list[str] | None
: The text content of the text characters
TextSample
: Represents a sample which only contains a text string (e.g. for text generation)Attributes:
text: str
: The text string
VidQASample
: Represents a sample which contains a video and a question with answer.Attributes:
video: VideoData
: The input image tensorcontext: str
: The context/questionanswers: list[str] | None
: The answer stringanswer_weights: torch.Tensor | None
: Weights for possibly multiple answers
VQASample
: Represents a sample which contains an image, a question/context and an answerAttributes:
image: torch.Tensor
: The input image tensorcontext: str
: The context/questionanswers: list[str] | None
: The answer stringanswer_weights: torch.Tensor | None
: Weights for possibly multiple answers
VQAOCRSample
: Sample type for question answering related to optical character recognition.Attributes:
image: str
: The input imagecontext: str
: The context/questiontext: str
: The text contained in the imageanswers: list[str] | None
: The answer stringanswer_weights: torch.Tensor | None
: Weights for possibly multiple answerswords_boxes: torch.Tensor | None
: The bounding boxes of the text wordswords_text: list[str] | None
: The text content of the text words
Sample Loadingο
There are multiple options for how to convert the data stored in the tar files to an instance of one of the sample types above.
After choosing the sample type, energon prepare
will ask if you want to use a βsimple field mapβ or a βsample loaderβ.
There is a also a third method called βCrudeWebdatasetβ.
Field Mapο
If your data consists of simple text, json and images that can be decoded by the standard webdataset auto decoder, and they map directly to the attributes of your chosen sample type from the list above, use a βfield mapβ. The field map stores which file extension in the webdataset shall be mapped to which attribute of the sample class.
Sample Loaderο
If your data needs some custom decoding code to compute the sample attributes from the data in the tar, you should use a custom sample loader. The code shall only contain the dataset-specific decoding, no project-specific decoding.
Example for a special format (e.g. ocr dataset) for which we will use a custom sample_loader.py
:
parts
βββ segs-000000.tar
β βββ 636f6d706f6e656e747362656e6566693030616e6472(0025).jp2
β βββ 636f6d706f6e656e747362656e6566693030616e6472(0025).lines.png
β βββ 636f6d706f6e656e747362656e6566693030616e6472(0025).mp
β βββ 636f6d706f6e656e747362656e6566693030616e6472(0025).words.png
β βββ 636f6d706f6e656e747362656e6566693030616e6472(0075).jp2
β βββ 636f6d706f6e656e747362656e6566693030616e6472(0075).lines.png
β βββ 636f6d706f6e656e747362656e6566693030616e6472(0075).mp
β βββ ...
βββ ...
.mp
(msgpack
content) files are automatically decoded, containing:
{
"identifier": "componentsbenefi00andr",
"pageno": 25,
"size": {"w": 2286, "h": 3179},
"lines": [
{"l": 341, "t": 569, "b": 609, "r": 1974, "text": "CHAPTER 4 ADVANCED TRAFFIC CONTROL SYSTEMS IN INDIANA"},
{"l": 401, "t": 770, "b": 815, "r": 2065, "text": "A variety of traffic control systems currently exist"},
//...
],
"words": [
{"l": 341, "t": 577, "b": 609, "r": 544, "text": "CHAPTER"},
{"l": 583, "t": 578, "b": 607, "r": 604, "text": "4"},
//...
],
"chars": [
{"t": 579, "b": 609, "l": 341, "r": 363, "text": "C"},
{"t": 579, "b": 609, "l": 370, "r": 395, "text": "H"},
//...
],
}
sample_loader.py
:
import torch
def sample_loader(raw: dict) -> dict:
return dict(
__key__=raw["__key__"],
image=raw["jp2"],
text="\n".join(line["text"] for line in raw["mp"]["lines"]),
lines_boxes=torch.tensor(
[
(line["l"], line["t"], line["r"] - line["l"], line["b"] - line["t"])
for line in raw["mp"]["lines"]
],
dtype=torch.int64,
),
lines_text=[line["text"] for line in raw["mp"]["lines"]],
words_boxes=torch.tensor(
[
(line["l"], line["t"], line["r"] - line["l"], line["b"] - line["t"])
for line in raw["mp"]["words"]
],
dtype=torch.int64,
),
words_text=[line["text"] for line in raw["mp"]["words"]],
chars_boxes=torch.tensor(
[
(line["l"], line["t"], line["r"] - line["l"], line["b"] - line["t"])
for line in raw["mp"]["chars"]
],
dtype=torch.int64,
),
chars_text=[line["text"] for line in raw["mp"]["chars"]],
)
def part_filter(part: str) -> bool:
return part in ("jp2", "mp")
For more information please also read Custom Sample Loader.
Special formatο
Sometimes, your data will not be easily represented as a field_map
explained above.
For example, your data may contain
structured data like nested boxes for each sample
custom binary formats
xml / html / pickle etc.
In those cases you have two options:
Creating a custom
sample_loader.py
in the.nv-meta
folder as explained above.This will typically do the job and is preferred if you only have to do some small conversions.
Using a
CrudeWebdataset
For more intricate conversions, you can use a CrudeWebdataset that will pass your samples in a raw form into your TaskEncoder where you can then convert them based on the subflavor for example. For more details see Crude Data and How to Cook It π¨βπ³.
Even for these specific wds formats, you would start preparing your data using the dataset preparation command, but you will need to define a custom sample loader or select CrudeWebdataset
in the dataprep wizard.
Dataset Format on Diskο
The energon library supports loading large multi-modal datasets from disk. To load the dataset, it must comply with the format described in this section.
A valid energon dataset must contain an .nv-meta
folder with certain files as shown below.
my_dataset
βββ .nv-meta
β βββ dataset.yaml
β βββ split.yaml
β βββ .info.yaml
βββ shards
β βββ shard_000.tar
β βββ shard_001.tar
β βββ ...
Note that the shards
folder is just an example. The shards and their folder can be named differently, but the .nv-meta
structure is always the same.
Files in .nv-meta
ο
dataset.yamlο
The dataset.yaml
contains the dataset definition, i.e. the dataset class to use as loader, optional decoders.
If you want to create such a file, you should consider using the CLI preparation tool.
Hereβs an example:
sample_type:
__module__: megatron.energon
__class__: CaptioningSample
field_map:
image: jpg
caption: txt
The __class__
and __module__
values help the library construct the correct object.
The field_map
specifies how the fields from each webdataset sample are mapped to the members of the sample dataclass.
In this example, the dataclass is
@dataclass
class CaptioningSample(Sample):
image: torch.Tensor
caption: str
In some scenarios, you might need a more advanced way to map samples into the dataclass. In that case, please check out this page.
split.yamlο
This file contains the splits (i.e. train, val, test), each a list of the shards for each split. It can also contain an βexclude listβ to exclude certain samples or shards from training. Example:
exclude: []
split_parts:
train:
- shards/shard_000.tar
- shards/shard_001.tar
val:
- shards/shard_002.tar
test:
- shards/shard_003.tar
To exclude certain shards or samples, you need to add those to the exclude
list as follows:
exclude:
- shards/shard_004.tar
- shards/shard_001.tar/000032
- shards/shard_001.tar/000032
split_parts:
...
The above code excludes the entire shard 004
and two samples from the shard 001
.
.info.yamlο
The hidden info file is auto-generated and contains statistics about each shard.
Example:
shard_counts:
shards/000.tar: 1223
shards/001.tar: 1420
shards/002.tar: 1418
shards/003.tar: 1358