Types
BertSample
Bases: TypedDict
The type expected by NeMo/Megatron for a single dataset item.
Attributes:
Name | Type | Description |
---|---|---|
text |
Tensor
|
The tokenized, masked input text. |
types |
Tensor
|
The token type ids, if applicable. |
attention_mask |
Tensor
|
A mask over all valid tokens, excluding padding. |
labels |
Tensor
|
The true values of the masked tokens at each position covered by loss_mask. |
loss_mask |
Tensor
|
The mask over the text indicating which tokens are masked and should be predicted. |
is_random |
Tensor
|
?? |
Source code in bionemo/llm/data/types.py
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
|
Tokenizer
Bases: Protocol
Required attributes for a tokenizers provided to apply_bert_pretraining_mask.
Source code in bionemo/llm/data/types.py
48 49 50 51 52 53 54 55 56 57 |
|