Data Preparation
The aim of data preparation is to convert your data to a format that the energon loader can understand and iterate.
The outcome will be a WebDataset with some extra information stored in a folder called .nv-meta
.
Below in Dataset Format on Disk we explain the details about this format.
Important Considerations
Depending on what your data looks like and how you are planning to use it, you will have to make a few choices, before you prepare your dataset:
Monolithic Dataset vs. Polylithic (primary and auxiliary) Datasets
You can include the media (images/video/audio) inside the same webdataset along with the text and metadata of each sample. Or you can keep the media separate (either in another indexed webdataset or as individual files on disk)
If you can, you should go for the monolithic option, because it’s faster to load. However, there are a few reasons why the other option may be needed:
You need to keep the original media and you don’t want to duplicate it
Your media data is very large (e.g. long videos) and you need to keep your primary dataset small (containing just the text-base data and meta information)
You want to re-use the same media with different labels or you want to train on different subsets
You want to train with online packing and can’t fit all the media of the packing buffer in memory. With polylithic datasets you can use caching to avoid that issue.
How to shard the data
The WebDataset will be split into a bunch of shards (i.e. tar files). You’ll have to decide how many samples to put in one shard and how many shards to get overall.
To maximize the loading speed, use as few shards as possible. Even a single shard can work well! However, if you cannot handle files above a certain size you may need to split the shards more. A good rule of thumb is to keep your number of shards below 10k.
If you are using remote filesystems like S3, there may be an opposing constraint: S3 limits the number of requests per second that you can make for a single prefix (e.g. filename). By using more shards, you can increase the overall rate. Ideally, you would still want to stay below 10k shards.
Raw vs. baked data
When using images for example, you could put either the encoded JPG, the decoded pixel values or even the encoded features into the dataset.
Typically, we recommend to go with the “original form” (e.g. JPG) and do all the processing on the fly inside the cooker and task encoder. This way, you can change the processing and keep your dataset.
However, if the processing becomes a bottleneck, you can move some of it into the dataset creation phase by baking the information in.
Keep in mind that others may also want to use your dataset for a different project.
Steps to Create a Monolithic Dataset
These are the typical steps to get your data ready:
Create a normal WebDataset from your data (including all the media content)
Run our preparation tool
energon prepare
create additional metadata needed by energon. See Dataset Format on Disk.
Steps to Create a Polylithic Dataset
Create the primary WebDataset from your text-based part of the data (meta information, labels, sizes etc.)
Include the file names (don’t use absolute paths) of the media that belongs to each sample (e.g. as strings inside a json entry)
Create the auxiliary dataset(s). Can be multiple datasets, e.g. one per modality.
Either as a folder on disk with all the media files inside
Or as another WebDataset that contains just the media files (with the exact same names)
Run our preparation tool
energon prepare
on both WebDatasets to convert to an energon-compatible formatConfigure both datasets as
CrudeWebdataset
Create a metadataset that specifies what auxiliary data to load for each primary dataset
For more details read about crude data
Step 1: Creating a WebDataset
Example for a WebDataset (e.g. image captioning dataset):
shards
├── shard_0000.tar
│ ├── sample_0000.jpg
│ ├── sample_0000.txt
│ ├── sample_0000.detail.json
│ ├── sample_0001.jpg
│ ├── sample_0001.txt
│ └── sample_0001.detail.json
├── shard_0001.tar
│ ├── sample_0002.jpg
│ ├── sample_0002.txt
│ ├── sample_0002.detail.json
│ ├── sample_0003.jpg
│ ├── sample_0003.txt
│ └── sample_0003.detail.json
└── ...
In the example you can see two shards (i.e. tar files) with multiple samples. Each group of files with the same basename makes one sample.
So sample_0000.jpg
, sample_0000.txt
and sample_0000.detail.json
are three parts that belong to the first sample.
This shows a monolithic dataset, for polylithic you would drop the JPGs in the primary dataset.
Note that each sample may have a different number of parts, for example some samples may have more images than others.
In this case, they should still have the same basename, for example sample_0000.img1.jpg
and sample_0000.img2.jpg
. For an advanced example for interleaved data, check out this section.
The order of samples in the tar file is important. Samples with the same base name (~before the first dot of the filename) must follow each other.
The base name is used to group the samples, i.e. in the example sample_0000
is the first group name, with the part types jpg
, txt
, detail.json
.
The default behavior of energon is to parse the contents by extensions (e.g. ending on .json
will automatically use json.loads
, .png
will load the image).
Building a WebDataset using Python
The easiest way to construct a WebDataset from existing data (e.g. from another torch dataset or a folder with files) is to use the ShardWriter from the webdataset library:
import webdataset as wds
if __name__ == '__main__':
# Wherever your dataset comes from
my_dataset = ...
with wds.ShardWriter("parts/data-%d.tar", maxcount=10000) as shard_writer:
for key, data in my_dataset:
sample = {
"__key__": key,
"png": data['image'],
}
shard_writer.write(sample)
Step 2: Preparing the Dataset
Once you have a WebDataset ready, you will want to prepare it for use with Energon. This means adding additional meta data files next to the data. This step does not change or copy the contents of your tar files.
Just run the energon prepare /path/to/dataset
command, which will interactively walk you through the process.
The command will
Search for all
*.tar
files in the given folderIndex them so samples can be accessed randomly
Ask you how you want to split the data into train/val/test paritions
Ask you about the sample type (optionally crude)
Ask you how to decode the data if not using crude data (field map or sample_loader.py)
Store all this information in a subfolder
.nv-meta/
, see details below.
Splitting the dataset into train/val/test
The first thing that the energon prepare
assistant will ask you, is how you want to split the data by ratios.
However, if you have a pre-determined split, you can also pass that to energon. See the examples below.
Example 1: Let energon do the split
shards
├── shard_0000.tar
├── shard_0001.tar
└── ...
Commandline:
> energon prepare ./
# Exemplary answers to interactive questions:
Ratio: 8,1,1
Dataset class: CaptioningWebdataset
Field map: Yes
image: jpg
caption: txt # if txt contains the caption
# or
caption: json[caption] # if .json contains {"caption": "My nice image"}
Example 2: Presplit shards by prefix
shards
├── train_shard_0000.tar
├── train_shard_0001.tar
├── ...
├── val_shard_0000.tar
├── val_shard_0001.tar
└── ...
Commandline:
> energon prepare --split-parts 'train:shards/train_.*' --split-parts 'val:shards/val_.*' ./
Note that the pattern matching syntax uses regexes, so for arbitrary characters insert .*
not just *
Example 3: Presplit shards by folder
shards
├── train
│ ├── shard_00001.tar
│ ├── shard_00001.tar
│ └── ...
├── val
│ ├── shard_00001.tar
│ ├── shard_00001.tar
│ └── ...
└── ...
Commandline:
> energon prepare --split-parts 'train:shards/train/.*' --split-parts 'val:shards/val/.*' ./
Good to know
You can inspect your prepared dataset like a normal file system by using the energon mount
feature.
Sample Types
After the split is set up, the assistant will ask you which sample type you want to use. We provide a set of common sample types such as for image captioning or visual question answering, they are listed below.
This will be sufficient in a simple scenario and if none of these fits, you may even create your own new sample type. Here are your options:
You have a new type sample which is rather common but not in our list below
Please add your type to energon and create a pull request so we can add it
Your sample type is experimental very special or used temporarily only
You can add the sample type class in your code repository and create the
dataset.yaml
manually, referring to your class with__class__
You can add the sample type class in your code repository, use a crude dataset and cookers (no need to put the sample type in
dataset.yaml
)
Available Sample Types
These are the possible integrated types you can currently choose from:
Sample
: Base dataclass for samples from source webdatasets.Attributes:
__key__: str
: Unique identifier of the sample within the dataset. Useful for backtracking the source of a single sample.__key__: str
: Structured key of the sample, which can be used to regenerate the sample without storing the whole sample.__subflavors__: dict[str, Any] | None
: Represents the subflavors (i.e. custom dict data) set for the source dataset (typically in the metadataset).
CaptioningSample
: Represents a sample for captioningAttributes:
image: torch.Tensor
: The input image tensorcaption: str
: The target caption string
ImageSample
: Represents a sample which only contains an image (e.g. for reconstruction)Attributes:
image: torch.Tensor
: The image tensor
ImageClassificationSample
: Represents a sample which contains an image with a captionAttributes:
image: torch.Tensor
: The image tensorlabel: int | None
: The label of the sample, as integral representationlabel_name: str | None
: The label of the sample
InterleavedSample
: Represents a sample which contains interleaved media, such as image and text.Attributes:
sequence: list[torch.Tensor | str]
: The interleaved media (either a torch.Tensor or string for text)
MultiChoiceVQASample
: Represents a sample for visual question answering, with a choice of answers and one correct answer.Attributes:
image: torch.Tensor
: The input image tensorcontext: str
: The context/question for the imagechoices: List[str] | None
: The candidate answerscorrect_choice_idx: int | None
: The index of the correct answer
OCRSample
: Sample type for optical character recognition.Attributes:
image: str
: The input imagetext: str
: The text string for the whole imageblock_boxes: torch.Tensor | None
: The bounding boxes of the block in the image float(N, 4|5<x,y,w,h,confidence>)block_classes: torch.Tensor | list[str] | None
: The classes of th blocksblock_text: torch.Tensor | None
: The text content of the blockslines_boxes: torch.Tensor | None
: The bounding boxes of the text lineslines_text: list[str] | None
: The text content of the text lineswords_boxes: torch.Tensor | None
: The bounding boxes of the text wordswords_text: list[str] | None
: The text content of the text wordschars_boxes: torch.Tensor | None
: The bounding boxes of the text characterschars_text: list[str] | None
: The text content of the text characters
TextSample
: Represents a sample which only contains a text string (e.g. for text generation)Attributes:
text: str
: The text string
VidQASample
: Represents a sample which contains a video and a question with answer.Attributes:
video: VideoData
: The input image tensorcontext: str
: The context/questionanswers: list[str] | None
: The answer stringanswer_weights: torch.Tensor | None
: Weights for possibly multiple answers
VQASample
: Represents a sample which contains an image, a question/context and an answerAttributes:
image: torch.Tensor
: The input image tensorcontext: str
: The context/questionanswers: list[str] | None
: The answer stringanswer_weights: torch.Tensor | None
: Weights for possibly multiple answers
VQAOCRSample
: Sample type for question answering related to optical character recognition.Attributes:
image: str
: The input imagecontext: str
: The context/questiontext: str
: The text contained in the imageanswers: list[str] | None
: The answer stringanswer_weights: torch.Tensor | None
: Weights for possibly multiple answerswords_boxes: torch.Tensor | None
: The bounding boxes of the text wordswords_text: list[str] | None
: The text content of the text words
Sample Loading
When you actually use and load your dataset, the data stored in the tar files needs to be converted to an instance of your chosen sample type. There are three options:
The conversion is a simple 1:1 mapping of files to fields of the sample type class
You can use a simple field map
Otherwise the now preferred way is to use a CrudeWebdataset and do the conversion inside a cooker.
There is another (now legacy) way, i.e. to create a custom
sample_loader.py
file next to your dataset.This option will continue to work, but we encourage to move to crude datasets in the future.
When running energon prepare
, you can choose “Crude sample” as the sample type and the assistant will end.
If you picked another sample type, the assistant will ask if you want to use a “simple field map” or a “sample loader”.
Simple Field Map
If your data consists of simple text, json and images that can be decoded by the standard webdataset auto decoder, and they map directly to the attributes of your chosen sample type from the list above, use a “field map”. The field map stores which file extension in the webdataset shall be mapped to which attribute of the sample class.
Sample Loader (Deprecated)
If your data needs some custom decoding code to compute the sample attributes from the data in the tar, you can use a custom sample loader. However, starting from Energon 7, we recommend to use crude datasets and a cooker instead.
If you use a sample_loader.py
, its code shall only contain the dataset-specific decoding, no project-specific decoding.
Example for a special format (e.g. ocr dataset) for which we will use a custom sample_loader.py
:
parts
├── segs-000000.tar
│ ├── 636f6d706f6e656e747362656e6566693030616e6472(0025).jp2
│ ├── 636f6d706f6e656e747362656e6566693030616e6472(0025).lines.png
│ ├── 636f6d706f6e656e747362656e6566693030616e6472(0025).mp
│ ├── 636f6d706f6e656e747362656e6566693030616e6472(0025).words.png
│ ├── 636f6d706f6e656e747362656e6566693030616e6472(0075).jp2
│ ├── 636f6d706f6e656e747362656e6566693030616e6472(0075).lines.png
│ ├── 636f6d706f6e656e747362656e6566693030616e6472(0075).mp
│ └── ...
└── ...
.mp
(msgpack
content) files are automatically decoded, containing:
{
"identifier": "componentsbenefi00andr",
"pageno": 25,
"size": {"w": 2286, "h": 3179},
"lines": [
{"l": 341, "t": 569, "b": 609, "r": 1974, "text": "CHAPTER 4 ADVANCED TRAFFIC CONTROL SYSTEMS IN INDIANA"},
{"l": 401, "t": 770, "b": 815, "r": 2065, "text": "A variety of traffic control systems currently exist"},
//...
],
"words": [
{"l": 341, "t": 577, "b": 609, "r": 544, "text": "CHAPTER"},
{"l": 583, "t": 578, "b": 607, "r": 604, "text": "4"},
//...
],
"chars": [
{"t": 579, "b": 609, "l": 341, "r": 363, "text": "C"},
{"t": 579, "b": 609, "l": 370, "r": 395, "text": "H"},
//...
],
}
sample_loader.py
:
import torch
def sample_loader(raw: dict) -> dict:
return dict(
__key__=raw["__key__"],
image=raw["jp2"],
text="\n".join(line["text"] for line in raw["mp"]["lines"]),
lines_boxes=torch.tensor(
[
(line["l"], line["t"], line["r"] - line["l"], line["b"] - line["t"])
for line in raw["mp"]["lines"]
],
dtype=torch.int64,
),
lines_text=[line["text"] for line in raw["mp"]["lines"]],
words_boxes=torch.tensor(
[
(line["l"], line["t"], line["r"] - line["l"], line["b"] - line["t"])
for line in raw["mp"]["words"]
],
dtype=torch.int64,
),
words_text=[line["text"] for line in raw["mp"]["words"]],
chars_boxes=torch.tensor(
[
(line["l"], line["t"], line["r"] - line["l"], line["b"] - line["t"])
for line in raw["mp"]["chars"]
],
dtype=torch.int64,
),
chars_text=[line["text"] for line in raw["mp"]["chars"]],
)
def part_filter(part: str) -> bool:
return part in ("jp2", "mp")
For more information please also read Custom Sample Loader.
Dataset Format on Disk
The energon library supports loading large multi-modal datasets from disk. To load the dataset, it must comply with the format described in this section.
A valid energon dataset must contain an .nv-meta
folder with certain files as shown below.
my_dataset
├── .nv-meta
│ ├── dataset.yaml
│ ├── split.yaml
│ ├── .info.json
│ ├── index.sqlite
│ └── index.uuid
├── shards
│ ├── shard_000.tar
│ ├── shard_001.tar
│ ├── ...
Note that the shards
folder is just an example. The shards and their folder can be named differently, but the .nv-meta
structure is always the same.
Files in .nv-meta
dataset.yaml (user editable)
The dataset.yaml
contains the dataset definition, i.e. the dataset class to use as loader, optional decoders.
If you want to create such a file, you should consider using the CLI preparation tool.
Here’s an example:
sample_type:
__module__: megatron.energon
__class__: CaptioningSample
field_map:
image: jpg
caption: txt
For a crude dataset the dataset.yaml
will simply be
__module__: megatron.energon
__class__: CrudeWebdataset
The __class__
and __module__
values help the library construct the correct object.
The field_map
specifies how the fields from each webdataset sample are mapped to the members of the sample dataclass.
In this example, the dataclass is
@dataclass
class CaptioningSample(Sample):
image: torch.Tensor
caption: str
split.yaml (user editable)
This file contains the splits (i.e. train, val, test), each a list of the shards for each split. It can also contain an “exclude list” to exclude certain samples or shards from training. Example:
exclude: []
split_parts:
train:
- shards/shard_000.tar
- shards/shard_001.tar
val:
- shards/shard_002.tar
test:
- shards/shard_003.tar
To exclude certain shards or samples, you need to add those to the exclude
list as follows:
exclude:
- shards/shard_004.tar
- shards/shard_001.tar/000032
- shards/shard_001.tar/000032
split_parts:
...
The above code excludes the entire shard 004
and two samples from the shard 001
.
.info.json (read-only)
The hidden info file is auto-generated and contains a list of all shards and the number of samples in each.
Example:
{
"shard_counts": {
"shards/000.tar": 1223,
"shards/001.tar": 1420,
"shards/002.tar": 1418,
"shards/003.tar": 1358
}
}
The order of tar files is important, as it’s used by the sqlite database below.
index.sqlite and index.uuid (read-only)
The sqlite database was introduced in Energon 7 and allows for fully random access of samples and files by their names.
This is a precondition for polylithic datasets and for the energon mount
command.
Below there is some detailed information for the interested reader. Note that the internal table structure can change in any release without notice.
The database contains an entry for each sample and sample part including their byte offsets and sizes in the tar files.
Example samples
table:
tar_file_id |
sample_key |
sample_index |
byte_offset |
byte_size |
---|---|---|---|---|
0 |
00000 |
0 |
0 |
35840 |
0 |
00001 |
1 |
35840 |
35840 |
0 |
00002 |
2 |
71680 |
35840 |
0 |
… |
The byte offsets describe the range around all the tar entries that are part of that sample including the tar headers.
Corresponding example sample_parts
table:
tar_file_id |
sample_index |
part_name |
content_byte_offset |
content_byte_size |
---|---|---|---|---|
0 |
0 |
json |
1536 |
31 |
0 |
0 |
png |
3584 |
30168 |
0 |
0 |
txt |
35328 |
16 |
0 |
1 |
json |
37376 |
31 |
0 |
1 |
png |
39424 |
30168 |
0 |
1 |
txt |
71168 |
16 |
0 |
… |
The byte offsets in the sample_parts
table refer to the byte ranges of the actual file content and can be used to
directly access the content without parsing the tar header.
Both tables can be joined over the tar_file_id
and the sample_index
. Note that the tar_file_id
refers to the list
of tar files in the .info.json
file.