Dataset
AMPLIFYMaskedResidueDataset
Bases: Dataset
Dataset class for AMPLIFY pretraining that implements sampling of UR100P sequences.
Source code in bionemo/amplify/dataset.py
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 |
|
__getitem__(index)
Deterministically masks and returns a protein sequence from the dataset.
This function is largely copied from the ESM2 dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index
|
EpochIndex
|
The current epoch and the index of the cluster to sample. |
required |
Returns:
Type | Description |
---|---|
BertSample
|
A (possibly-truncated), masked protein sequence with CLS and EOS tokens and associated mask fields. |
Source code in bionemo/amplify/dataset.py
100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 |
|
__init__(hf_dataset, seed=42, max_seq_length=512, mask_prob=0.15, mask_token_prob=0.8, mask_random_prob=0.1, random_mask_strategy=RandomMaskStrategy.AMINO_ACIDS_ONLY, tokenizer=None)
Initializes the dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
hf_dataset
|
HFAmplifyDataset
|
HuggingFace dataset containing AMPLIFY protein sequences. This should likely be created via a
call like |
required |
total_samples
|
Total number of samples to draw from the dataset. |
required | |
seed
|
int
|
Random seed for reproducibility. This seed is mixed with the index of the sample to retrieve to ensure that getitem is deterministic, but can be random across different runs. If None, a random seed is generated. |
42
|
max_seq_length
|
int
|
Crop long sequences to a maximum of this length, including BOS and EOS tokens. |
512
|
mask_prob
|
float
|
The overall probability a token is included in the loss function. Defaults to 0.15. |
0.15
|
mask_token_prob
|
float
|
Proportion of masked tokens that get assigned the |
0.8
|
mask_random_prob
|
float
|
Proportion of tokens that get assigned a random natural amino acid. Defaults to 0.1. |
0.1
|
random_mask_strategy
|
RandomMaskStrategy
|
Whether to replace random masked tokens with all tokens or amino acids only. Defaults to RandomMaskStrategy.AMINO_ACIDS_ONLY. |
AMINO_ACIDS_ONLY
|
tokenizer
|
BioNeMoAMPLIFYTokenizer | None
|
The input AMPLIFY tokenizer. Defaults to the standard AMPLIFY tokenizer. |
None
|
Source code in bionemo/amplify/dataset.py
46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 |
|
__len__()
Returns the total number of sequences in the dataset.
Source code in bionemo/amplify/dataset.py
96 97 98 |
|
HFAmplifyDataset
Bases: Protocol
Protocol for HuggingFace datasets containing AMPLIFY protein sequences.
Source code in bionemo/amplify/dataset.py
37 38 39 40 |
|
HFDatasetRow
Bases: TypedDict
TypedDict for HuggingFace dataset rows.
Source code in bionemo/amplify/dataset.py
31 32 33 34 |
|