Masking
BertMaskConfig
dataclass
Configuration for masking tokens in a BERT-style model.
Attributes:
Name | Type | Description |
---|---|---|
mask_prob |
float
|
Probability of masking a token. |
mask_token_prob |
float
|
Probability of replacing a masked token with the mask token. |
random_token_prob |
float
|
Probability of replacing a masked token with a random token. |
Source code in bionemo/llm/data/masking.py
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
|
__post_init__()
Check that the sum of mask_token_prob
and random_token_prob
is less than or equal to 1.0.
Raises:
Type | Description |
---|---|
ValueError
|
If the sum of |
Source code in bionemo/llm/data/masking.py
40 41 42 43 44 45 46 47 |
|
add_cls_and_eos_tokens(sequence, labels, loss_mask, cls_token=None, eos_token=None)
Prepends the CLS token and appends the EOS token to the masked sequence, updating the loss mask and labels.
These labels should never be masked, so this is done after the masking step.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sequence
|
Tensor
|
The input (likely masked) sequence. |
required |
labels
|
Tensor
|
The true values of the input sequence at the mask positions. |
required |
loss_mask
|
Tensor
|
A boolean tensor indicating which tokens should be included in the loss. |
required |
cls_token
|
int | None
|
The token to use for the CLS token. If None, no CLS token is added. |
None
|
eos_token
|
int | None
|
The token to use for the EOS token. If None, no EOS token is added. |
None
|
Returns:
Type | Description |
---|---|
tuple[Tensor, Tensor, Tensor]
|
The same input tensors with the CLS and EOS tokens added, and the labels and loss_mask updated accordingly. |
Source code in bionemo/llm/data/masking.py
117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 |
|
apply_bert_pretraining_mask(tokenized_sequence, random_seed, mask_config)
Applies the pretraining mask to a tokenized sequence.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokenized_sequence
|
Tensor
|
Tokenized protein sequence. |
required |
random_seed
|
int
|
Random seed for reproducibility. |
required |
mask_config
|
BertMaskConfig
|
Configuration for masking tokens in a BERT-style model. |
required |
Returns:
Name | Type | Description |
---|---|---|
masked_sequence |
Tensor
|
The tokenized sequence with some tokens masked. |
labels |
Tensor
|
A tensor the same shape as |
loss_mask |
Tensor
|
A boolean tensor the same shape as |
Source code in bionemo/llm/data/masking.py
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 |
|