Glossary
Batch Grouping
Allows you to programmatically decide which samples (out of a buffer) will be put into one batch. See Grouping.
Cooking
Used to transform crude (raw) samples into a populated instance of a sample data class.
Crude Dataset
An energon dataset, that does not yield a readily-populated sample (instance of dataclass), but a raw dict.
A cooker is used to handle this transformation in the user’s custom task encoder. See Crude Datasets and Auxiliary Data.
Grouping
See “Batch Grouping”
Monolithic Dataset
The simple form of putting all your text and media data into the same WebDataset (see Steps to Create a Monolithic Dataset).
The other option is to use a “Polylithic Dataset”
Packing
For Energon, with “packing” we mean “sequence packing”. See “Sequence Packing” below.
Polylithic Dataset
Used to split the text-based data from the (usually larger) media data.
Each modality will be put in its own dataset and one dataset can refer to the other by file names.
For more information see Steps to Create a Polylithic Dataset
Sample
In Energon, by sample we typically mean an instance of
Sample
(e.g. one of its subclasses)Sometimes we also call the source files that are inside the WebDataset and are used to create that dataclass instance a “sample”
For example inside one tar file there may be
004.jpg
and004.txt
(image and label) together forming a captioning sample
The
Sample
dataclass has several mandatory and optional fields that describe one piece of training data for your ML workload. Typically it contains the input data to the model and the label data.
Sample Part
A “sample part” is one of the components of a sample inside the WebDataset tar file. A captioning sample may be created from
004.jpg
and004.txt
and each of those files is a sample part. This sample with the key004
has two partstxt
andjpg
.
Sequence Packing
A method to better utilize the available context length / sequence length of a model and reduce padding.
Explained in Packing
Task Encoder
An Energon-specific concept: The TaskEncoder is a user-defined class to customize the steps of the data flow pipeline.
See Data Flow and Task Encoder
WebDataset
A file-format to store your dataset on disk, based on TAR files. See https://github.com/webdataset/webdataset.
Energon’s dataset format builds on WebDataset and extends it with additional files, see Dataset Format on Disk.