Multi epoch dataset
EpochIndex
Bases: NamedTuple
A tuple that contains both the current epoch and index for multi-epoch training.
Source code in bionemo/core/data/multi_epoch_dataset.py
42 43 44 45 46 47 48 49 |
|
epoch: int
instance-attribute
An integer representing the current epoch.
idx: int
instance-attribute
An integer representing the index within the current epoch.
IdentityMultiEpochDatasetWrapper
dataclass
Bases: MultiEpochDatasetWrapper[T, T]
An implementation of the MultiEpochDatasetWrapper
that does not apply any transformations.
Source code in bionemo/core/data/multi_epoch_dataset.py
177 178 179 180 181 182 183 |
|
apply_transform(sample, index)
Return the sample as is.
Source code in bionemo/core/data/multi_epoch_dataset.py
180 181 182 183 |
|
MultiEpochDataset
Bases: Protocol[T_co]
A protocol for datasets for multi-epoch training in Megatron-LM.
Dataset determinism in Megatron-LM
In megatron training, the sampler and dataset objects are used to ensure consistent data loading across
model-parallel ranks. For datasets to work with megatron training, they must return exactly the same data for
every call to __getitem__
with the same index.
Source code in bionemo/core/data/multi_epoch_dataset.py
62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
|
MultiEpochDatasetResampler
dataclass
Bases: Dataset[T_co]
A dataset wrapper class that converts the sequential sampling from Megatron-LM to epoch-based sampling.
Either num_epochs
or num_samples
should be provided. If neither are provided, the dataset will use a single
epoch. If num_epochs
is given, the resampled dataset will have len(dataset) * num_epochs
samples. If
num_samples
the resampled dataset will have num_samples
samples. For num_samples
, the dataset will be repeated
for multiple epochs until the desired number of samples is reached (with the final epoch being truncated).
Source code in bionemo/core/data/multi_epoch_dataset.py
78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 |
|
dataset: MultiEpochDataset[T_co]
instance-attribute
The dataset to resample. Must support indexing with an EpochIndex
.
num_epochs: int | None = None
class-attribute
instance-attribute
The total number of epochs. The length of the resampled dataset will be len(dataset) * num_epochs.
num_samples: int | None = None
class-attribute
instance-attribute
The total number of samples to draw.
The number of epochs will be determined by the number of samples and the length of the dataset.
seed: int = 42
class-attribute
instance-attribute
A random seed for reproducibility.
shuffle: bool = True
class-attribute
instance-attribute
Whether to shuffle the samples in the dataset each epoch.
__getitem__(index)
Get the sample at the given index.
Source code in bionemo/core/data/multi_epoch_dataset.py
131 132 133 134 135 |
|
__len__()
Return the length of the resampled dataset.
Source code in bionemo/core/data/multi_epoch_dataset.py
137 138 139 |
|
__post_init__()
Pre-shuffle each epoch's samples.
Source code in bionemo/core/data/multi_epoch_dataset.py
106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 |
|
MultiEpochDatasetWrapper
dataclass
Bases: Dataset[U_co]
, Generic[T, U_co]
, ABC
A wrapper to convert a standard pytorch dataset into one that supports multi-epoch megatron training.
The underlying dataset's getitem method must be deterministic, i.e. it must return the same data for the same
index every time it is called. If there are any non-deterministic operations, they should be moved to the
apply_transform
method. This method must also be deterministic for every (epoch, index) pair, but it can use
the epoch to implement data augmentation each epoch.
Source code in bionemo/core/data/multi_epoch_dataset.py
150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 |
|
dataset: SizedDataset[T]
instance-attribute
A deterministic dataset that supports indexing with an integer index.
__getitem__(index)
Get the sample at the given epoch and index.
Source code in bionemo/core/data/multi_epoch_dataset.py
168 169 170 |
|
__len__()
Return the length of the dataset.
Source code in bionemo/core/data/multi_epoch_dataset.py
172 173 174 |
|
apply_transform(sample, index)
abstractmethod
Apply any transformations to the sample for the given epoch.
Source code in bionemo/core/data/multi_epoch_dataset.py
163 164 165 166 |
|
SizedDataset
Bases: Protocol[T_co]
A protocol for integer-indexed datasets that have a fixed length.
Source code in bionemo/core/data/multi_epoch_dataset.py
52 53 54 55 56 57 58 59 |
|