Megatron dataset compatibility
DatasetDistributedNondeterministic
Bases: AssertionError
Datasets are not locally deterministic.
Source code in bionemo/testing/megatron_dataset_compatibility.py
48 49 |
|
DatasetLocallyNondeterministic
Bases: AssertionError
Datasets are not locally deterministic.
Source code in bionemo/testing/megatron_dataset_compatibility.py
44 45 |
|
assert_dataset_compatible_with_megatron(dataset, index=0, assert_elements_equal=assert_dict_tensors_approx_equal)
Make sure that a dataset passes some basic sanity checks for megatron determinism constraints.
Constraints tested
- dataset[i] returns the same element regardless of device
- dataset[i] doesn't make calls to known problematic randomization procedures (currently
torch.manual_seed
).
As more constraints are discovered, they should be added to this test.
Source code in bionemo/testing/megatron_dataset_compatibility.py
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 |
|
assert_dataset_elements_not_equal(dataset, index_a=0, index_b=1, assert_elements_equal=assert_dict_tensors_approx_equal)
Test the case where two indices return different elements on datasets that employ randomness, like masking.
NOTE: if you have a dataset without any kinds of randomness, just use the assert_dataset_compatible_with_megatron
test and skip this one. This test is for the case when you want to test that a dataset that applies a random
transform to your elements as a function of index actually does so with two different indices that map to the same
underlying object. This test also runs assert_dataset_compatible_with_megatron
behind the scenes so if you
do this you do not need to also do the other.
With epoch upsampling approaches, some underlying index, say index=0, will be called multiple times by some wrapping dataset object. For example if you have a dataset of length 1, and you wrap it in an up-sampler that maps it to length 2 by mapping index 0 to 0 and 1 to 0, then in that wrapper we apply randomness to the result and we expect different masks to be used for each call, even though the underlying object is the same. Again this test only applies to a dataset that employs randomness. Another approach some of our datasets take is to use a special index that captures both the underlying index, and the epoch index. This tuple of indices is used internally to seed the mask. If that kind of dataset is used, then index_a could be (epoch=0, idx=0) and index_b could be (epoch=1, idx=0), for example. We expect those to return different random features.
The idea for using this test effectively is to identify cases where you have two indices that return the same underlying object, but where you expect different randomization to be applied to each by the dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset
|
Dataset[TensorCollectionOrTensor]
|
dataset object with randomness (eg masking) to test. |
required |
index_a
|
Index
|
index for some element. Defaults to 0. |
0
|
index_b
|
Index
|
index for a different element. Defaults to 1. |
1
|
assert_elements_equal
|
Callable[[TensorCollectionOrTensor, TensorCollectionOrTensor], None]
|
Function to compare two returned batch elements. Defaults to
|
assert_dict_tensors_approx_equal
|
Source code in bionemo/testing/megatron_dataset_compatibility.py
87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 |
|
assert_dict_tensors_approx_equal(actual, expected)
Assert that two tensors are equal.
Source code in bionemo/testing/megatron_dataset_compatibility.py
33 34 35 36 37 38 39 40 41 |
|