Skip to content

Types

BertSample

Bases: TypedDict

The type expected by NeMo/Megatron for a single dataset item.

Attributes:

Name Type Description
text Tensor

The tokenized, masked input text.

types Tensor

The token type ids, if applicable.

attention_mask Tensor

A mask over all valid tokens, excluding padding.

labels Tensor

The true values of the masked tokens at each position covered by loss_mask.

loss_mask Tensor

The mask over the text indicating which tokens are masked and should be predicted.

is_random Tensor

??

Source code in bionemo/llm/data/types.py
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
class BertSample(TypedDict):
    """The type expected by NeMo/Megatron for a single dataset item.

    Attributes:
        text: The tokenized, masked input text.
        types: The token type ids, if applicable.
        attention_mask: A mask over all valid tokens, excluding padding.
        labels: The true values of the masked tokens at each position covered by loss_mask.
        loss_mask: The mask over the text indicating which tokens are masked and should be predicted.
        is_random: ??
    """

    text: Tensor
    types: Tensor
    attention_mask: Tensor
    labels: Tensor
    loss_mask: Tensor
    is_random: Tensor

Tokenizer

Bases: Protocol

Required attributes for a tokenizers provided to apply_bert_pretraining_mask.

Source code in bionemo/llm/data/types.py
48
49
50
51
52
53
54
55
56
57
class Tokenizer(Protocol):
    """Required attributes for a tokenizers provided to apply_bert_pretraining_mask."""

    @property
    def mask_token_id(self) -> int | None:  # noqa: D102
        ...

    @property
    def all_special_ids(self) -> list[int]:  # noqa: D102
        ...